Unconfigured Ad

**dpryan** · 02-13-2014, 06:48 AM

That's because the second part isn't part of the read name. There's an option in fastq-dump to put the original read name where it should be rather than just numbering things sequentially.

**splaisan** · 02-13-2014, 06:58 AM

Problem is I downloaded the fastq pre-made from the EBI repo and mapped them all :-( without figuring this out. I can fix this by patching the fatsQ but will still need to remap the whole shebang...

Thanks for the info anyway (for next time)

**splaisan** · 02-15-2014, 01:59 AM

picard markDuplicate compatible reads from SRA data

few days later, the issue is fixed by:

NOT downloading the fastq files from SRA but instead the .sra formatted data using Aspera (I used the browser link)
Use the sratoolkit command fastq-dump (thanks Devon) to convert .sra to .fastq and split reads in paired files. The trick was here to use the specific parameter -F|--origfmt to ensure 'Defline contains only original sequence name' and that the remaining text was discarder

The resulting command in my case was (after correcting typo!):

fastq-dump -F --split-3 --gzip *.sra -O fastq_read_folder

TIP: I used P|P|S|S to speed this dramatically for the 26 input files on my 24 thread machine.

My reads have now a header line as

@HWI-ST188:1:1101:1222:2140
NAGACGAAGGTTCTTCAGTTAAACAGTTTAGAGCCCCATAAGAGCAAACTGTAGTGTAAAGAGGAAAAGTAAGTACAATCTTTCCAGACACACAACTAATA
+HWI-ST188:1:1101:1222:2140
#1:BDDDDHHHHHIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIHGIFHCGIEHIIIHIIIIDEHHCHEHEEEEEECCECCCBCCBBBBCCCCA

which after tophat mapping results for that particular read in

HWI-ST188:1:1101:1222:2140 99 chr10 59953037 50 101M = 59953061 125 NAGACGAAGGTTCTTCAGTTAAACAGTTTAGAGCCCCATAAGAGCAAACTGTAGTGTAAAGAGGAAAAGTAAGTACAATCTTTCCAGACACACAACTAATA #1:BDDDDHHHHHIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIHGIFHCGIEHIIIHIIIIDEHHCHEHEEEEEECCECCCBCCBBBBCCCCA AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0C100 YT:Z:UU NH:i:1

Running picard on such BAM data is now able to identify few 1000' optical repeats in the full sample.

CQFD

**GenoMax** · 02-15-2014, 05:41 AM

Don't see a "-F" in your fastq-dump command above. Typo?

**splaisan** · 02-15-2014, 06:15 AM

shame on me! corrected now (thanks)

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

keep read address using tophat

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News