Seqanswers Leaderboard Ad

**mastal** · 03-05-2014, 02:55 AM

Could the problem be with the formatting of the files as well as the quality values?

I notice that the read header lines say length=216, are these paired end reads that have been merged together, and is that why they don't align?

**Gordo2B** · 03-05-2014, 03:15 AM

Here is a description from the ModEncode website.

"RNA-seq analysis was performed on poly(A)+ RNA from various tissues dissected from various stages of D. melanogaster. Total RNA was isolated by the Peter Cherbas group. Isolation of poly(A)+ RNA and strand-specific library construction were performed in the Brenton Graveley lab. Libraries were distributed among 3 labs in the Drosophila Transcriptome group for sequencing (Celniker, Gingeras, and Graveley) for paired-end RNA sequencing (2x76+ nt) on the GAIIx and HiSeq platforms. Fastq files were generated using pipeline version 1.5."

I'll admit I had just assumed it wasn't paired-end, I've worked with paired-end before and I thought you needed two .fastq files for that. But my bad.

**Brian Bushnell** · 03-05-2014, 07:58 AM

Paired-end doesn't need 2 fastq files; It can be interleaved. But as mastal said, it looks like that read was paired data that was merged.

Tophat can't handle really low-quality data (which that read is). If most of your reads look like that, with mostly N's and Q2 bases, it's not surprising almost nothing aligned. I suggest you go back to the original raw fastq files and do quality-control on them: quality trimming and adapter trimming (in case of insert sizes being shorter than read length), and try again, but it's possible the library is just a complete failure.

If you download BBMap, you can do quality- and adapter-trimming with bbduk.sh and also try aligning with bbmap.sh, which is much more robust to errors than Tophat. Still, at the default settings, it would also not align that particular read unless it was at least trimmed to remove the Ns.

**yueluo** · 03-05-2014, 05:18 PM

Is it this study?

http://www.ncbi.nlm.nih.gov/sra/SRX029215

Study summary: SRP003893 • D. melanogaster Cell Line Stranded RNASeq

So it kind of looks like merged reads. And I think the quality is qual-64 judging from the '!'s for base N.

**Brian Bushnell** · 03-05-2014, 05:33 PM

Originally posted by yueluo View Post

And I think the quality is qual-64 judging from the '!'s for base N.

Nope, that's ASCII-33:
! = 33 (Q0)
# = 35 (Q2)

**yueluo** · 03-05-2014, 06:12 PM

Oops, sorry my bad

**Gordo2B** · 03-07-2014, 01:54 AM

Hey guys,

Thanks for the input, I kinda knew something was off about the fragment length, wouldn't have known it was paired end. I still don't see the point in making a .fastq like that where you can't discern where one read ends and the other begins.

Anyway, I took the quick and dirty way out and trimmed the reads to ~50bp (thank you fastx toolkit), and achieved >98% of reads mapped and post-processing is going well so far.

All the best,

Gordon

**yueluo** · 03-07-2014, 02:22 AM

Usually, PE-reads are merged when you have a small insert library - e.g. You have ~300bp library but do PE250 with miseq, such PE-reads will overlap at 3'-ends. After trimming the low-qual bases for both reads, you can merge them into one from the overlapping bases.

**mastal** · 03-07-2014, 02:50 AM

I think in this case the reads were merged because the data came from the NCBI's repository (SRA), and that is the format that they use for paired reads.

I think there are tools to split the data back into R1 and R2, there have previously been discussions about that here on SEQanswers. I will try to find a link and post it.

See for example this thread on biostars

How To Convert Sra-Lite Paired-End Submission To Fastq?

http://www.biostars.org/p/11111/

You need to use fastq-dump from the SRA-Toolkit.

Topics	Statistics	Last Post
The Adaptation of the Cell Cycle in Multiciliated Cells by seqadmin Started by seqadmin, 06-07-2024, 06:58 AM	0 responses 13 views 0 likes	Last Post by seqadmin 06-07-2024, 06:58 AM
New Method for DNA Sequence Amplification by seqadmin Started by seqadmin, 06-06-2024, 08:18 AM	0 responses 23 views 0 likes	Last Post by seqadmin 06-06-2024, 08:18 AM
New Tools Enhance Single-Molecule DNA Analysis with Minimal Samples by seqadmin Started by seqadmin, 06-06-2024, 08:04 AM	0 responses 20 views 0 likes	Last Post by seqadmin 06-06-2024, 08:04 AM
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, 06-03-2024, 06:55 AM	0 responses 14 views 0 likes	Last Post by seqadmin 06-03-2024, 06:55 AM

Seqanswers Leaderboard Ad

Announcement

TopHat not aligning (pretty much) anything.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News