Seqanswers Leaderboard Ad

**Brian Bushnell** · 07-27-2015, 10:14 AM

First - to clarify, these are single-ended, right? I will use that as an assumption.

Originally posted by vajuli View Post

1) Which reads would be best to use for mapping to reference and DE analysis (I’m guessing trimmed or trimmed&decontaminated, but I cannot decide which).

It should not matter too much between the trimmed and trimmed+decontaminated, but I'd suggest trimmed+decontaminated; there's no reason to leave contamination in, as long as nothing in the contaminants file (other than rRNA) has high similarity to your organisms of interest.
However, I would make a couple of suggestions - first, throw away reads shorter than some value; maybe 40bp or so. Really short reads are uninformative and can skew your results without adding anything of value. Second, Q20 is fairly high for quality-trimming; the higher the value, the more bias is incurred. For mapping, I generally don't recommend going over 10~15 unless the aligner or settings are very intolerant of mismatches.

2) Number of reads in the sample UNI-2 is ~3 smaller than in the other samples. Would such a big difference in library size present a problem during DE analysis in DESeq2, EdgeR and voom+limma? In other words, should I discard this replicate during DE analysis?

Not sure, but I highly the quantity difference would cause problems, so I would expect including it to increase resolution. It is different in other ways, though - the read-length histogram is very different from the others, and the quality was much lower. Those are more worrisome. It seems like the lane had serious problems, and I would consider tossing it for that reason.

4) Would you agree that the adapter contamination is the driving force behind the nonstandard appearance of the “Per base sequence content” graphs. (I am referring to the raising %C line, not the beginning of reads)

Yes.

6) Some of the reads have really been trimmed to very short sizes. Will STAR appropriately discard reads that are below some reasonable size, or do I have to do it manually. If so, what would be the minimum suggested (allowed) read size to be used during mapping for DE analysis?

Yes, I'd set a size limit. The length depends on the organism - does it have introns, and if so, how long? But even for prokaryotes, with this dataset, I don't see any reason to include stuff below 40bp, at minimum; that won't remove much and will decrease noise.

7) I really don’t know what to make of “Sequence duplication levels” graph. Neither trimming nor decontamination seemed to have any appreciable effects. Does anyone know what could be the usual culprit causing such big differences, even between biological replicates? Do they have to be dealt with in some way?

I don't think that's important in RNA-seq, unless you have amplified data... which, hopefully, you don't. Anyway, there's nothing you can do about it with SE quantitative data.

8) Trimming seems to deal nicely with adapter contamination, but Kmer content graphs still show overrepresentation of certain Kmers at the end of reads. Did I miss any adapters that need to be removed? Should I trim this out somehow, or can I let the STAR soft-clip this during mapping?

The command you used can only find adapters as short as 13bp. Therefore, it is expected that there would still be adapters after position 77. I usually trim down to mink=11 by default with paired reads; you could go a little farther, down to maybe mink=9 (starting with the raw reads, not the trimmed reads) before you start getting false-positives. Eventually, at maybe mink=7, you will start seeing an inverted kmer enrichment at the tip as genomic sequence that happens to match adapter sequence is preferentially removed. So, no matter what you do, you can't solve it completely without hard-trimming. I'd suggest using mink=9 and letting the rest be soft-clipped.

With paired reads, the "tbo" flag will eliminate this, but with single-ended reads, "tbo" doesn't work.

I'm not really sure about #3 and #5.

**vajuli** · 07-27-2015, 12:33 PM

Hi Brian, big thanks for all the answers and suggestions. You're right, the data are SE. I hope you won't mind a quick follow-up regarding (6).

Yes, I'd set a size limit. The length depends on the organism - does it have introns, and if so, how long? But even for prokaryotes, with this dataset, I don't see any reason to include stuff below 40bp, at minimum; that won't remove much and will decrease noise.

The organism in question is mouse - does 40 bps still stand? I found in one paper that average mouse intron is ~4.7 kb. But I don't know how the intron length dictates read size limit? Also, is it possible to do the length filtering with BBduk during trimming?

Once again big thanks for your help!

**Brian Bushnell** · 07-27-2015, 01:28 PM

The min length for retaining reads after trimming also depends on whether you are mapping to a genome or transcriptome. For a transcriptome, you can use shorter reads because they will map without splices. For a genome, it's more difficult because when a splice site occurs near the end of a read, say between bases 35 and 36 in a 40bp read, you will only have 5bp overhanging one of the exons - or, you could map it going into the intron with a few mismatches, or clip it. 5bp of anchor is just not enough to confidently support a gap of length 4700bp. The shorter reads are, the higher the "surface area/volume ratio" - the bases near the tips which cannot confidently span splices. So, the more noise and bias you get.

So, 40 or 50 is probably fine if mapping to the transcriptome, or mapping to the transcriptome and then only mapping the unmapped reads to the genome (which is the default behavior of TopHat and probably STAR if you give them gene annotations). If mapping only to the genome and finding all splice sites denovo, I'd recommend maybe ~70bp for that dataset. The upper limit depends on the fraction of your data you will retain and how short the shortest genes of interest are. But 10bp or 20bp is definitely too short outside of small-RNA studies.

BBDuk has a "minlen" flag, e.g. "minlen=40", which will discard short reads and can be used simultaneously with trimming. The default is 10 which is why 10 is the lower bound of your current length distribution.

**vajuli** · 07-27-2015, 01:39 PM

Brian, thanks for the help and the lesson, I really appreciate it!

Topics	Statistics	Last Post
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, 06-03-2024, 06:55 AM	0 responses 12 views 0 likes	Last Post by seqadmin 06-03-2024, 06:55 AM
Genetic Mosaicism More Prevalent Than Previously Thought by seqadmin Started by seqadmin, 05-30-2024, 03:16 PM	0 responses 26 views 0 likes	Last Post by seqadmin 05-30-2024, 03:16 PM
Comprehensive Sequencing of Great Ape Sex Chromosomes Yields Insights into Evolution and Genetic Variability by seqadmin Started by seqadmin, 05-29-2024, 01:32 PM	0 responses 29 views 0 likes	Last Post by seqadmin 05-29-2024, 01:32 PM
New Toolkit Enhances Plant Mitochondrial Genome Research by seqadmin Started by seqadmin, 05-24-2024, 07:15 AM	0 responses 216 views 0 likes	Last Post by seqadmin 05-24-2024, 07:15 AM

Seqanswers Leaderboard Ad

Announcement

RNASeq premapping QC questions - is it ok to proceed?

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News