Seqanswers Leaderboard Ad

**Brian Bushnell** · 05-09-2015, 11:26 AM

I think Trimmomatic is always doing right-trimming (perhaps it has a flag for left-trimming; I don't know). If the adapter is present, it's the leftmost 25bp, so the entire read will get trimmed, which is why varying the min length doesn't change anything.

Please note, BBDuk has an "outs" stream that you can use to catch the singletons whose mate was trimmed shorter than minlength. And 25bp reads are useful for mapping, when part of a pair - they help anchor the mate. I wouldn't use 25bp reads that map unpaired, though.

**BrianU** · 05-11-2015, 09:27 PM

I did the alignments to hg19 on a couple of these files with TopHat2 (default settings) to check the outcome. I tried:

1. No clean-up; untrimmed/no contaminate removal (control)
2. Contaminate removal of CloneTech primers using BBDuk (tpe/tpo = t), min length = 36nt
3. Left trim CloneTech primers using BBDuk, min length = 36nt
4. Left trim CloneTech primers using BBDuk, min length = 25nt

The data is in the attached table. The biggest impact seemed to be the minimum length filter after trimming or decontamination. It had a noticeable effect on run time, aligned pairs, discordant pairs and multiple alignments. Setting a strict minimum length filter at 36nt yields a strikingly similar number of aligned pairs and multiple alignments as not cleaning up the primers at all. The only dramatic difference when comparing trimmed @ min length 36nt to untrimmed was the increased run-time (4 hours vs 16 hours).

By lowering the minimum length filter to 25nt, I 1) retained most of my input reads, 2) got the most mapped reads and 3) increased aligned read pairs by a million. However, 20 million of the retained reads are discarded by TopHat2 before alignment. In addition, it has 200k more multiple alignments and 500k more discordant alignments. It also took twice as much time to process than the others at 7.5 hours, but still better than untrimmed at 16 hours!

A million extra aligned pairs and 500k extra discordant alignments seems to be like a significant number, but I wonder if they worth the increase in multiple alignments?

Attached Files

alignment_summary.png (45.4 KB, 45 views)

**BrianU** · 05-11-2015, 09:40 PM

Originally posted by Brian Bushnell View Post

Please note, BBDuk has an "outs" stream that you can use to catch the singletons whose mate was trimmed shorter than minlength. And 25bp reads are useful for mapping, when part of a pair - they help anchor the mate. I wouldn't use 25bp reads that map unpaired, though.

How do you go about putting the pairs back together from the output stream and remove the 25nt unpaired ones?

I am guessing in my BBDuk left trim (minLen=25) data, the unpaired reads that are between 25-36nt are contributing to the increase in multiple alignments. Is there is a way for me to filter out singletons below 36nt, but only make exceptions to keep shorter reads if they have a mate that is >24nt? Or am I just splitting hairs at this point?

**Brian Bushnell** · 05-12-2015, 01:13 PM

I am not sure how well TopHat's pairing/rescue functions are for letting one read anchor another, or even if it has them implemented. So it's possible that the increase in multiple alignments is an artifact of the aligner rather than fundamental to the data. I'm not sure why TopHat is discarding so many reads, either, or what the basis is. However, I would trust the alignments of the trimmed reads over the untrimmed reads. You might try a run with BBMap to compare with the TopHat results; it's a lot easier for me to interpret BBMap's output statistics, at least. If you do, then be sure to include the "maxindel=100000" flag for human RNA mapping.

With BBDuk you can use the "rieb" (removeifeitherbad) flag to determine whether pairs are removed. It defaults to "true". If you set "minlen=25 rieb=t" then pairs will be removed if EITHER is shorter than 25bp (default behavior); if you set rieb=f they will be removed only if BOTH are shorter than 25bp.

Therefore, with the default behavior, you already should not have any singletons, period.

**BrianU** · 05-12-2015, 04:31 PM

Originally posted by Brian Bushnell View Post

With BBDuk you can use the "rieb" (removeifeitherbad) flag to determine whether pairs are removed. It defaults to "true". If you set "minlen=25 rieb=t" then pairs will be removed if EITHER is shorter than 25bp (default behavior); if you set rieb=f they will be removed only if BOTH are shorter than 25bp.

Therefore, with the default behavior, you already should not have any singletons, period.

Thanks for clarifying. You really thought of every option! I might try BBMap on two of the trimmed files before I start with the rest of my dataset. I think I will probably end up just moving on and going with left trimming with BBDuk and filtering minimum length @ 36nt.

Curious to see how the mapped reads are distributed, I ran BBMap's pileup.sh on a few of the bam files. Looking at the stats output I found the majority of the reads are mapping to mitochondria like these guys:

mitochondrial sequences - SEQanswers

http://seqanswers.com/forums/showthread.php?t=13827

Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)

My samples are from muscle biopsies. So I expected a lot of mitochondrial sequences, but this much? Or am I interpreting the pileup output wrong?

#ID Avg_fold
chr1 14.6919
chr10 7.4524
chr11 49.4229
chr12 12.4561
chr13 4.3214
chr14 5.2243
chr15 7.1383
chr16 4.2848
chr17 26.1763
chr18 6.6821
chr19 10.6602
chr2 15.9738
chr20 4.3257
chr21 5.8453
chr22 4.7022
chr3 5.1661
chr4 4.4176
chr5 11.3854
chr6 16.339
chr7 5.3753
chr8 6.2763
chr9 13.0348
chrM 189412.3428
chrX 6.3534
chrY 1.1263

**Brian Bushnell** · 05-12-2015, 04:44 PM

That's way more than I would expect for DNA, but for RNA? I don't know. Perhaps this was a very athletic individual!

It could have something to do with the capture efficiency on mito vs genomic sequence, or the sequence composition bias (mito often has a different GC fraction than main genome). But you're reading it correctly; the mitochondria had around 189412x coverage.

Oh - another thing. Human DNA is mostly noncoding, but mito is more like a prokaryote, densely packed genes. So RNA-seq data would be expected to give maybe 100x higher (or whatever the actual human noncoding/coding ratio is) coverage on the mito even if the expression level of all genes were identical. So, maybe the difference can be largely explained just by that plus the fact that there are lots of mitochondria per cell.

**BrianU** · 05-14-2015, 06:21 AM

Originally posted by Brian Bushnell View Post

Human DNA is mostly noncoding, but mito is more like a prokaryote, densely packed genes. So RNA-seq data would be expected to give maybe 100x higher (or whatever the actual human noncoding/coding ratio is) coverage on the mito even if the expression level of all genes were identical. So, maybe the difference can be largely explained just by that plus the fact that there are lots of mitochondria per cell.

Good points. ChrM is 99.95% covered. Other chromosomes (containing introns) are 10-20% covered. The size of the mitochondrial genome is also much smaller at ~16.5kb... Considering all the factors, it still seems pretty crazy to have ~200,000x average coverage, but I'll have to check the other samples to see if its just this sample or library. Thanks.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News