Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I think Trimmomatic is always doing right-trimming (perhaps it has a flag for left-trimming; I don't know). If the adapter is present, it's the leftmost 25bp, so the entire read will get trimmed, which is why varying the min length doesn't change anything.

    Please note, BBDuk has an "outs" stream that you can use to catch the singletons whose mate was trimmed shorter than minlength. And 25bp reads are useful for mapping, when part of a pair - they help anchor the mate. I wouldn't use 25bp reads that map unpaired, though.

    Comment


    • #17
      I did the alignments to hg19 on a couple of these files with TopHat2 (default settings) to check the outcome. I tried:

      1. No clean-up; untrimmed/no contaminate removal (control)
      2. Contaminate removal of CloneTech primers using BBDuk (tpe/tpo = t), min length = 36nt
      3. Left trim CloneTech primers using BBDuk, min length = 36nt
      4. Left trim CloneTech primers using BBDuk, min length = 25nt

      The data is in the attached table. The biggest impact seemed to be the minimum length filter after trimming or decontamination. It had a noticeable effect on run time, aligned pairs, discordant pairs and multiple alignments. Setting a strict minimum length filter at 36nt yields a strikingly similar number of aligned pairs and multiple alignments as not cleaning up the primers at all. The only dramatic difference when comparing trimmed @ min length 36nt to untrimmed was the increased run-time (4 hours vs 16 hours).

      By lowering the minimum length filter to 25nt, I 1) retained most of my input reads, 2) got the most mapped reads and 3) increased aligned read pairs by a million. However, 20 million of the retained reads are discarded by TopHat2 before alignment. In addition, it has 200k more multiple alignments and 500k more discordant alignments. It also took twice as much time to process than the others at 7.5 hours, but still better than untrimmed at 16 hours!

      A million extra aligned pairs and 500k extra discordant alignments seems to be like a significant number, but I wonder if they worth the increase in multiple alignments?
      Attached Files

      Comment


      • #18
        Originally posted by Brian Bushnell View Post
        Please note, BBDuk has an "outs" stream that you can use to catch the singletons whose mate was trimmed shorter than minlength. And 25bp reads are useful for mapping, when part of a pair - they help anchor the mate. I wouldn't use 25bp reads that map unpaired, though.
        How do you go about putting the pairs back together from the output stream and remove the 25nt unpaired ones?

        I am guessing in my BBDuk left trim (minLen=25) data, the unpaired reads that are between 25-36nt are contributing to the increase in multiple alignments. Is there is a way for me to filter out singletons below 36nt, but only make exceptions to keep shorter reads if they have a mate that is >24nt? Or am I just splitting hairs at this point?

        Comment


        • #19
          I am not sure how well TopHat's pairing/rescue functions are for letting one read anchor another, or even if it has them implemented. So it's possible that the increase in multiple alignments is an artifact of the aligner rather than fundamental to the data. I'm not sure why TopHat is discarding so many reads, either, or what the basis is. However, I would trust the alignments of the trimmed reads over the untrimmed reads. You might try a run with BBMap to compare with the TopHat results; it's a lot easier for me to interpret BBMap's output statistics, at least. If you do, then be sure to include the "maxindel=100000" flag for human RNA mapping.

          With BBDuk you can use the "rieb" (removeifeitherbad) flag to determine whether pairs are removed. It defaults to "true". If you set "minlen=25 rieb=t" then pairs will be removed if EITHER is shorter than 25bp (default behavior); if you set rieb=f they will be removed only if BOTH are shorter than 25bp.

          Therefore, with the default behavior, you already should not have any singletons, period.

          Comment


          • #20
            Originally posted by Brian Bushnell View Post
            With BBDuk you can use the "rieb" (removeifeitherbad) flag to determine whether pairs are removed. It defaults to "true". If you set "minlen=25 rieb=t" then pairs will be removed if EITHER is shorter than 25bp (default behavior); if you set rieb=f they will be removed only if BOTH are shorter than 25bp.

            Therefore, with the default behavior, you already should not have any singletons, period.
            Thanks for clarifying. You really thought of every option! I might try BBMap on two of the trimmed files before I start with the rest of my dataset. I think I will probably end up just moving on and going with left trimming with BBDuk and filtering minimum length @ 36nt.

            Curious to see how the mapped reads are distributed, I ran BBMap's pileup.sh on a few of the bam files. Looking at the stats output I found the majority of the reads are mapping to mitochondria like these guys:

            Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)


            My samples are from muscle biopsies. So I expected a lot of mitochondrial sequences, but this much? Or am I interpreting the pileup output wrong?

            #ID Avg_fold
            chr1 14.6919
            chr10 7.4524
            chr11 49.4229
            chr12 12.4561
            chr13 4.3214
            chr14 5.2243
            chr15 7.1383
            chr16 4.2848
            chr17 26.1763
            chr18 6.6821
            chr19 10.6602
            chr2 15.9738
            chr20 4.3257
            chr21 5.8453
            chr22 4.7022
            chr3 5.1661
            chr4 4.4176
            chr5 11.3854
            chr6 16.339
            chr7 5.3753
            chr8 6.2763
            chr9 13.0348
            chrM 189412.3428
            chrX 6.3534
            chrY 1.1263

            Comment


            • #21
              That's way more than I would expect for DNA, but for RNA? I don't know. Perhaps this was a very athletic individual!

              It could have something to do with the capture efficiency on mito vs genomic sequence, or the sequence composition bias (mito often has a different GC fraction than main genome). But you're reading it correctly; the mitochondria had around 189412x coverage.

              Oh - another thing. Human DNA is mostly noncoding, but mito is more like a prokaryote, densely packed genes. So RNA-seq data would be expected to give maybe 100x higher (or whatever the actual human noncoding/coding ratio is) coverage on the mito even if the expression level of all genes were identical. So, maybe the difference can be largely explained just by that plus the fact that there are lots of mitochondria per cell.

              Comment


              • #22
                Originally posted by Brian Bushnell View Post
                Human DNA is mostly noncoding, but mito is more like a prokaryote, densely packed genes. So RNA-seq data would be expected to give maybe 100x higher (or whatever the actual human noncoding/coding ratio is) coverage on the mito even if the expression level of all genes were identical. So, maybe the difference can be largely explained just by that plus the fact that there are lots of mitochondria per cell.
                Good points. ChrM is 99.95% covered. Other chromosomes (containing introns) are 10-20% covered. The size of the mitochondrial genome is also much smaller at ~16.5kb... Considering all the factors, it still seems pretty crazy to have ~200,000x average coverage, but I'll have to check the other samples to see if its just this sample or library. Thanks.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-25-2024, 11:49 AM
                0 responses
                20 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-24-2024, 08:47 AM
                0 responses
                20 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                62 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                61 views
                0 likes
                Last Post seqadmin  
                Working...
                X