Seqanswers Leaderboard Ad

**Bukowski** · 10-26-2015, 09:01 AM

The disturbance in the first 12 bases is not at all unusual for an Illumina RNA-Seq experiment. It's been documented countless times in threads on SeqAnswers, and even has a paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896536/ and this is discussed in the FastQC documentation.

As you probably wont deduplicate the data (normally something you do after *alignment* to a reference genome for DNA-Seq) why are you concerned about it? Again there's plenty of discussion on Seqanswers about what to do with duplicates in RNA-Seq data.

R2 files are always of a lower quality than R1 files, it's the nature of the Illumina chemistry, I didn't realise it was quite so pronounced on the 4000 though.

FastQC is really not optimised for QC of RNA-Seq data, be warned.

**Brian Bushnell** · 10-26-2015, 09:52 AM

I'd be interested in seeing some QC metrics of your data. The only 2x150bp HS4000 data I've seen so far looked terrible (and read 2 is FAR worse than read 1), and would need stringent trimming and filtering before being usable, but that was just one run. In order to get QC metrics, I suggest you try this:

First, acquire the raw data so that you can preprocess it optimally. Then adapter-trim it as indicated here.

From your histograms, it looks like you did not get back 2x150bp reads, since the max position is around 130bp or a little higher. So either the libraries were made incorrectly, or size-selection was done improperly (or skipped), or something went wrong in sequencing. Depending on your actual target insert size, and who did those steps, you may be eligible for a free replacement run, considering that you got less than half of the data you paid for. A length histogram and insert size histogram would be helpful, in fact. Using the BBMap package, and assuming you have files r1.fq and r2.fq containing your adapter-trimmed (not quality-trimmed) reads:

readlength.sh in=r#.fq bin=1 nzo out=lengthhist.txt

bbmerge.sh in=r#.fq ihist=inserthist.txt xloose

Then, if you have a reference (genome or transcriptome), I suggest you map to it to determine your actual read error rates. You can do so like this:

bbmap.sh in=r#.fq ref=reference.fa mhist=mhist.txt ihist=ihist_mapping.txt qhist=qhist.txt qahist=qahist.txt bhist=bhist.txt slow minid=0.2

If you don't have a reference you can make a quick assembly like this:

tadpole.sh in=r#.fq out=contigs.fa

...which will not be ideal, but adequate for measuring quality metrics. Once you have those metrics it will be more clear how to proceed.

P.S. I assumed you were trying to assemble a previously unassembled transcriptome, but I guess I actually don't know what you are trying to do. What is the goal of your experiment?

**BastianOldenkott** · 10-27-2015, 09:12 AM

Reply

Hey Bukowski and Brian

,

thank you for your quick replies. I guess your opinions are quite contrary (if I interpret Bukowski right).

Both: I analyzed the untrimmed data again using PRINSEQ:

Bukowski: Do you think the data can be assembled as it is? Do you see any problems in the 130 bp kmer peak?
I want to use Trinity for assembly. Can you recommend any settings or extensions, important for evaluation of this particular data?

Brian: The data I achieved from the company had read lengthes of 150bp, but I trimmed them already with Trimmomatic-0.33 (Headcrop 14--> 136 bp). The 'raw' (adapter trimmed) files have been checked with PRINSEQ again (see above). Based on my limited knowledge of NextGen sequencing, I cannot really see that the results are too bad (Except for the quality drop in the reverse sequences). Can you explain a little further, please?
I will also follow your suggested pipeline and will give you the results asap, thank you very much for the input. My experiment aims mostly for analysis of organellar transcripts and a certain gene family, named 'PPR-DYW'. So far, the transcriptomes of two sister taxa from the same genus are available, but we expect highly divergent sequences in the mitochondrion.

Best regards,

Bastian

**GenoMax** · 10-27-2015, 09:21 AM

@Bastian: Please don't trim data by brute force. You are likely throwing away good data for no reason. As @Bukowski indicated the "bias" seen in the first few cycles is characteristic of RNAseq and is "normal". Instead what you should do is pass your sequences through a trimming program (if the reads have been "cleaned" already every one should come through) and then go ahead and try Trinity out (and Brian's suggestions too).

**Brian Bushnell** · 10-27-2015, 09:32 AM

The untrimmed data looks fine, but you can't get much useful info from those charts, anyway. The question is what it will look like after proper adapter trimming. If the reads end up 130bp, then the insert size was way too short.

**BastianOldenkott** · 10-27-2015, 09:46 AM

@GenoMax: Thank you for your suggestions. Could you recommend a trimming program besides trimmomatic? Or would you just change the parameters? When I trimmed the sequences I used the following: ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 HEADCROP:14 SLIDINGWINDOW:4:15 MINLEN:50 (Although, removing the adapter was unnecessary, since the data we got from the company was already cleaned from adapter sequences.)
Or: Since trimmomatic is already integrated in trinity by default, should I not trim before starting the assembly and use the internal trimmomatic?

All the best,

Bastian

**GenoMax** · 10-27-2015, 09:56 AM

@Bastian: Using trimmomatic in trinity is one option or you can use BBDuk :-)

If there are no adapters to begin with all reads should survive.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 17 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 46 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

RNA Seq- Problems with duplicated sequences and kmer content

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News