Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I think Trimmomatic is always doing right-trimming (perhaps it has a flag for left-trimming; I don't know). If the adapter is present, it's the leftmost 25bp, so the entire read will get trimmed, which is why varying the min length doesn't change anything.

    Please note, BBDuk has an "outs" stream that you can use to catch the singletons whose mate was trimmed shorter than minlength. And 25bp reads are useful for mapping, when part of a pair - they help anchor the mate. I wouldn't use 25bp reads that map unpaired, though.

    Comment


    • #17
      I did the alignments to hg19 on a couple of these files with TopHat2 (default settings) to check the outcome. I tried:

      1. No clean-up; untrimmed/no contaminate removal (control)
      2. Contaminate removal of CloneTech primers using BBDuk (tpe/tpo = t), min length = 36nt
      3. Left trim CloneTech primers using BBDuk, min length = 36nt
      4. Left trim CloneTech primers using BBDuk, min length = 25nt

      The data is in the attached table. The biggest impact seemed to be the minimum length filter after trimming or decontamination. It had a noticeable effect on run time, aligned pairs, discordant pairs and multiple alignments. Setting a strict minimum length filter at 36nt yields a strikingly similar number of aligned pairs and multiple alignments as not cleaning up the primers at all. The only dramatic difference when comparing trimmed @ min length 36nt to untrimmed was the increased run-time (4 hours vs 16 hours).

      By lowering the minimum length filter to 25nt, I 1) retained most of my input reads, 2) got the most mapped reads and 3) increased aligned read pairs by a million. However, 20 million of the retained reads are discarded by TopHat2 before alignment. In addition, it has 200k more multiple alignments and 500k more discordant alignments. It also took twice as much time to process than the others at 7.5 hours, but still better than untrimmed at 16 hours!

      A million extra aligned pairs and 500k extra discordant alignments seems to be like a significant number, but I wonder if they worth the increase in multiple alignments?
      Attached Files

      Comment


      • #18
        Originally posted by Brian Bushnell View Post
        Please note, BBDuk has an "outs" stream that you can use to catch the singletons whose mate was trimmed shorter than minlength. And 25bp reads are useful for mapping, when part of a pair - they help anchor the mate. I wouldn't use 25bp reads that map unpaired, though.
        How do you go about putting the pairs back together from the output stream and remove the 25nt unpaired ones?

        I am guessing in my BBDuk left trim (minLen=25) data, the unpaired reads that are between 25-36nt are contributing to the increase in multiple alignments. Is there is a way for me to filter out singletons below 36nt, but only make exceptions to keep shorter reads if they have a mate that is >24nt? Or am I just splitting hairs at this point?

        Comment


        • #19
          I am not sure how well TopHat's pairing/rescue functions are for letting one read anchor another, or even if it has them implemented. So it's possible that the increase in multiple alignments is an artifact of the aligner rather than fundamental to the data. I'm not sure why TopHat is discarding so many reads, either, or what the basis is. However, I would trust the alignments of the trimmed reads over the untrimmed reads. You might try a run with BBMap to compare with the TopHat results; it's a lot easier for me to interpret BBMap's output statistics, at least. If you do, then be sure to include the "maxindel=100000" flag for human RNA mapping.

          With BBDuk you can use the "rieb" (removeifeitherbad) flag to determine whether pairs are removed. It defaults to "true". If you set "minlen=25 rieb=t" then pairs will be removed if EITHER is shorter than 25bp (default behavior); if you set rieb=f they will be removed only if BOTH are shorter than 25bp.

          Therefore, with the default behavior, you already should not have any singletons, period.

          Comment


          • #20
            Originally posted by Brian Bushnell View Post
            With BBDuk you can use the "rieb" (removeifeitherbad) flag to determine whether pairs are removed. It defaults to "true". If you set "minlen=25 rieb=t" then pairs will be removed if EITHER is shorter than 25bp (default behavior); if you set rieb=f they will be removed only if BOTH are shorter than 25bp.

            Therefore, with the default behavior, you already should not have any singletons, period.
            Thanks for clarifying. You really thought of every option! I might try BBMap on two of the trimmed files before I start with the rest of my dataset. I think I will probably end up just moving on and going with left trimming with BBDuk and filtering minimum length @ 36nt.

            Curious to see how the mapped reads are distributed, I ran BBMap's pileup.sh on a few of the bam files. Looking at the stats output I found the majority of the reads are mapping to mitochondria like these guys:

            Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)


            My samples are from muscle biopsies. So I expected a lot of mitochondrial sequences, but this much? Or am I interpreting the pileup output wrong?

            #ID Avg_fold
            chr1 14.6919
            chr10 7.4524
            chr11 49.4229
            chr12 12.4561
            chr13 4.3214
            chr14 5.2243
            chr15 7.1383
            chr16 4.2848
            chr17 26.1763
            chr18 6.6821
            chr19 10.6602
            chr2 15.9738
            chr20 4.3257
            chr21 5.8453
            chr22 4.7022
            chr3 5.1661
            chr4 4.4176
            chr5 11.3854
            chr6 16.339
            chr7 5.3753
            chr8 6.2763
            chr9 13.0348
            chrM 189412.3428
            chrX 6.3534
            chrY 1.1263

            Comment


            • #21
              That's way more than I would expect for DNA, but for RNA? I don't know. Perhaps this was a very athletic individual!

              It could have something to do with the capture efficiency on mito vs genomic sequence, or the sequence composition bias (mito often has a different GC fraction than main genome). But you're reading it correctly; the mitochondria had around 189412x coverage.

              Oh - another thing. Human DNA is mostly noncoding, but mito is more like a prokaryote, densely packed genes. So RNA-seq data would be expected to give maybe 100x higher (or whatever the actual human noncoding/coding ratio is) coverage on the mito even if the expression level of all genes were identical. So, maybe the difference can be largely explained just by that plus the fact that there are lots of mitochondria per cell.

              Comment


              • #22
                Originally posted by Brian Bushnell View Post
                Human DNA is mostly noncoding, but mito is more like a prokaryote, densely packed genes. So RNA-seq data would be expected to give maybe 100x higher (or whatever the actual human noncoding/coding ratio is) coverage on the mito even if the expression level of all genes were identical. So, maybe the difference can be largely explained just by that plus the fact that there are lots of mitochondria per cell.
                Good points. ChrM is 99.95% covered. Other chromosomes (containing introns) are 10-20% covered. The size of the mitochondrial genome is also much smaller at ~16.5kb... Considering all the factors, it still seems pretty crazy to have ~200,000x average coverage, but I'll have to check the other samples to see if its just this sample or library. Thanks.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Exploring the Dynamics of the Tumor Microenvironment
                  by seqadmin




                  The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                  07-08-2024, 03:19 PM
                • seqadmin
                  Exploring Human Diversity Through Large-Scale Omics
                  by seqadmin


                  In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                  06-25-2024, 06:43 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 07-19-2024, 07:20 AM
                0 responses
                40 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 07-16-2024, 05:49 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 07-15-2024, 06:53 AM
                0 responses
                64 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 07-10-2024, 07:30 AM
                0 responses
                43 views
                0 likes
                Last Post seqadmin  
                Working...
                X