Announcement

Collapse
No announcement yet.

Quality trimming prior to aligning

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Quality trimming prior to aligning

    I am aligning reads to the GrCh37 human reference genome using STAR and am confused as to how to get the most reliable results. At first I aligned my reads with no pre-processing using STAR and got essentially 2 groups: 1 group that aligned well, with 85-95+% of reads uniquely mapping, and another group with only about 60% uniquely mapping. I then tried using sickle to trim low quality reads (q<20). Reads that mapped 85-95+% unique lost about 1 million reads and map a similar % uniquely, and reads that mapped about 60% now map >80% but lost millions of uniquely mapped reads. Though millions of reads are lost after quality trimming, on average ~90% of reads survive in every case, with a max of 21% reads lost after quality trimming. My question is which pipeline should I trust? Should I be more interested in the raw number of mapping reads, or the percentage? Are the reads lost after quality trimming untrustworthy anyways?

    I have done a lot of reading on this and have yet to find a definitive consensus. Any help or guidance would be much appreciated!
    Last edited by gstone; 02-28-2017, 03:48 PM.

  • #2
    If you want to do quality-trimming, it should be done conservatively to avoid losing so much information that the reads no longer map uniquely, and increasing bias. Typically, I suggest either not trimming, or to at most Q8~Q12. On the other hand, you should definitely adapter-trim prior to mapping.

    It's also possible to map with a more error-tolerant aligner to avoid the issues incurred by quality-trimming entirely. BBMap is slower than STAR, but yields a higher mapping rate.

    Oh, and one more thing - for overlapping paired-ended reads, it's also possible to increase read quality by using the consensus in the overlapping portion, by using BBMerge like this:

    Code:
    bbmerge.sh in=reads.fq out=corrected.fq ecco mix vstrict adapters=default
    That may negate the need for quality-trimming, and it's a very fast operation.
    Last edited by Brian Bushnell; 02-28-2017, 05:29 PM.

    Comment


    • #3
      Hi Brian, thank you so much for your quick and thoughtful response. If you don't mind taking the time I was hoping to ask 2 questions. 1) How would I map the consensus file, would I treat it as a single-end read? and 2) When I use bbrepair to match the order of my paired end reads and then quality trim I get less input reads to my aligner than if I just quality trim then align, but no reads are placed in the singletons file when I call bbrepair. Why is this?

      Thank you for being so generous with you time!
      Last edited by gstone; 02-28-2017, 06:06 PM.

      Comment


      • #4
        Originally posted by gstone View Post
        1) How would I map the consensus file, would I treat it as a single-end read?
        Sorry, let me clarify -

        This command assumes you have paired reads interleaved in "reads.fq". In that case, you can use BBMerge like this:

        Code:
        bbmerge.sh in=reads.fq out=merged.fq outu=unmerged.fq
        ...which produces 2 files. merged.fq contains consensus singletons made from overlapping pairs (each pair yields one read), and unmerged.fq contains interleaved pairs that did not confidently overlap so they remain unchanged. This is nice for assembly, but sometimes annoying otherwise (like for mapping). Also, some programs only accept paired reads. So you can also do this:

        Code:
        bbmerge.sh in=reads.fq out=corrected.fq ecco mix
        ...which produces a single output file that still contains paired, interleaved reads. Reads that overlapped will have the overlapping portion corrected to the consensus - where they match, the quality score is increased, and where they don't match, the quality score is set to the difference of the quality scores and the base is set to the base with the higher quality score. With this command, the result is used for mapping just like the input would have been.

        2) When I use bbrepair to match the order of my paired end reads and then quality trim I get less input reads to my aligner than if I just quality trim then align, but no reads are placed in the singletons file when I call bbrepair. Why is this?
        Can you give the exact sequence of commands you are using? Also, how did the order of your paired reads get scrambled in the first place? If you start with the raw reads and use a series of tools that respects pairing, you'll never need to use repair.sh.

        Thank you for being so generous with you time!
        You're welcome!

        Comment


        • #5
          Great, thank you for the clarification on the first point. As for my use of BBRepair, when I was aligning with STAR I got a large amount of reads too short to map, which was due to the paired-end reads being out of order. I used BBRepair and the mapping rate drastically improved (~40% to anywhere from ~65%-97%). I then figured I'd use BBRepair then quality trim just to be safe. My repair command is as follows: bbmap/repair.sh in1=file1 in2=file2 out1=out1 out2=out2 outs=out_singles repair. As to how the reads got scrambled is a mystery to me, but I was very excited to find BBRepair.

          My confusion essentially comes from the fact that using BBRepair prior to quality trimming results in fewer input reads than just quality trimming, even though BBRepair identifies no singleton reads. My understanding is that BBRepair just orders the reads between the pairs.

          Thank you again for all of your help.

          Comment


          • #6
            Originally posted by gstone View Post
            My confusion essentially comes from the fact that using BBRepair prior to quality trimming results in fewer input reads than just quality trimming, even though BBRepair identifies no singleton reads. My understanding is that BBRepair just orders the reads between the pairs.
            Your understanding is correct. The repair tool does not discard any reads - they either get sent to "out1", "out2", or "outs". How are you counting the reads before and after? Can you post the commands and results? It's possible that there's a bug somewhere, and if so, I'd like to fix it ASAP.

            Comment

            Working...
            X