Seqanswers Leaderboard Ad

**GenoMax** · 07-12-2013, 07:06 AM

You can use the suggestions in this thread to sample reads from your sequence files for testing small subsets: http://seqanswers.com/forums/showthread.php?t=16505

**mattia** · 07-12-2013, 07:25 AM

If you want to have a little number of reads (suppose you have a .fastq file) in your "test sample", an easy way is :

head -10000 your_big_file.fastq > sample_test_2500_reads.fastq

One read = 4 rows in .fastq....so, in order to have (for example!!) 2500 reads in your "test sample", you have to set -10000 in head option.

Changing this option, you can change the number of reads.

This is a fisrt easy way.

**moritz1** · 07-12-2013, 07:34 AM

Like I wrote, I have several files which end on .txt.gz
one example:
D1n67_GDNA_12s00128-1-2_Hose_lane00128_1_2_sequence.txt.gz
Is that an answer to the file format? (I'm totally new to Genetics)
So can I take my file, gunzip it and then run your line?

**Heisman** · 07-12-2013, 07:41 AM

Yes, you can gunzip it and then do what mattia suggested.

Keep in mind that the first lines of many fastq files will NOT have sequence data that is of similar quality to the majority of the file. If you solely want to make the pipeline work that may not be a problem (unless the reads are all "NNN...NNN"s and you're trying to align them), but if you want to see reads that are representative of the file as a whole then you should look in the thread that GenoMax linked too.

**moritz1** · 07-12-2013, 08:32 AM

Now I am stuck with this failure:

Can anybody help me out if this? This is my Bpipe pipeline:

REFERENCE="/genedata/human_genome_GRCh37/hg19.fa"
PICARD_HOME="/home/trr/picard-tools-1.93/picard-tools-1.93"
PLATYPUS_HOME="~/bin/Platypus_0.1.9"
STAMPY_HOME="~/bin/stampy-1.0.17"
BWA_HOME="/home/trr/bwa-0.7.5a"

seq1="/genedata/sample_test_10k_reads/sample_test_10k_reads.txt.gz"

//readgroup information:
//rg_id="lane711s003155"
rg_id="lane712s006433"
rg_lb="nextera"
rg_pl="ILLUMINA"
rg_pu="flowcell-barcode.lane"
rg_sm="GDNA"

//##############################################################

//Alignment

//##############################################################
//BWA
@Transform("sai")
align_bwa = {
exec "$BWA_HOME/bwa aln -t 4 -q 10 $REFERENCE $input > $output"
}

@Transform("sam")
sampe_bwa = {
exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE $inputs.sai $seq1 $seq2 > $output"
}

//##############################################################

//SAM to BAM

//##############################################################

//@Transform("bam")
sort = {
exec "samtools view -bSu $input | samtools sort - $output"
}

//##############################################################

//Remove Duplicates

//##############################################################

@Filter("dedupe")
dedupe = {
exec """
java -Xmx1g -jar $PICARD_HOME/MarkDuplicates.jar
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
METRICS_FILE=out.metrics
REMOVE_DUPLICATES=true
ASSUME_SORTED=true
VALIDATION_STRINGENCY=LENIENT
INPUT=$input
OUTPUT=$output
"""
}

//##############################################################

//Run Pipeline

//##############################################################

Bpipe.run {
"sample_test_10k_reads*" * [align_bwa] + sampe_bwa + sort + dedupe
}

and I run it with

Code:

bpipe run sample_test.pipe sample_test_10k_reads.txt.gz

**Heisman** · 07-12-2013, 08:52 AM

I may be wrong as I haven't used this, but it looks like your pipeline wants a $seq1 and a $seq2, but you've only supplied a $seq1. Is that possibly the problem?

When I have issues with alignment commands, I try to do it with the bare necessities to make it run and then add in more complicated aspects of the command once I can make sure I'm typing in the basic command correctly.

**moritz1** · 07-12-2013, 09:26 AM

Oh yeah thanks a lot, I think that should be it. I'll edit later when I can try it out!

**moritz1** · 07-13-2013, 02:57 AM

Now I am getting this error, although I have changed the input to two files:

Code:

REFERENCE="/genedata/human_genome_GRCh37/hg19.fa"
PICARD_HOME="/home/trr/picard-tools-1.93/picard-tools-1.93"
PLATYPUS_HOME="~/bin/Platypus_0.1.9"
STAMPY_HOME="~/bin/stampy-1.0.17"
BWA_HOME="/home/trr/bwa-0.7.5a"

seq1="/genedata/sample_test_10k_reads/sample_test_10k_reads1.txt.gz"
seq2="/genedata/sample_test_10k_reads/sample_test_10k_reads2.txt.gz"
 
//readgroup information:
rg_id="lane712s006433"
rg_lb="nextera"
rg_pl="ILLUMINA"
rg_pu="flowcell-barcode.lane"
rg_sm="GDNA"
 
//#############################################################
//Alignment
//##############################################################
//BWA
@Transform("sai")
align_bwa = {
      exec "$BWA_HOME/bwa aln -t 4 -q 10 $REFERENCE $input > $output"
}
 
@Transform("sam")
sampe_bwa = {
      exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE $inputs.sai $seq1 $seq2 > $output"
}
 
//##############################################################
//SAM to BAM
//##############################################################
 
//@Transform("bam")
sort = {
        exec "samtools view -bSu $input  | samtools sort - $output"
}
 
//##############################################################
 
//Remove Duplicates
 
//##############################################################
 
@Filter("dedupe")
dedupe = {
        exec """
           java -Xmx1g -jar $PICARD_HOME/MarkDuplicates.jar
                           MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
                           METRICS_FILE=out.metrics
                           REMOVE_DUPLICATES=true
                           ASSUME_SORTED=true  
                           VALIDATION_STRINGENCY=LENIENT
                           INPUT=$input
                           OUTPUT=$output
       """
}
 
//##############################################################
//Run Pipeline
//##############################################################
 
Bpipe.run {
    "sample_test_10k_reads%" * [align_bwa] + sampe_bwa + sort + dedupe
}

and I run it with:

Code:

bpipe run sample_test_pipeline.pipe sample_test_10k_reads*

**swbarnes2** · 07-13-2013, 12:24 PM

Shouldn't that one line read:

Code:

exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE [B]$inputs1.sai $inputs2.sai[/B] $seq1 $seq2 > $output"

**moritz1** · 07-15-2013, 05:35 AM

Yes, I got it working. Thanks a lot.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

How do I shorten a genome sequence to secure my workflow is properly functioning?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News