No announcement yet.

Trimming SoLiD 50bp Reads = Doubling Our Mapping

  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimming SoLiD 50bp Reads = Doubling Our Mapping

    We've all seen how the sequencing errors tend to found toward the 3' end of the read. We taken the task up of trimming reads for a highly over-covered study to 25, 30, 35... 50b. and mapping using NextGene under 1, 2, 3 etc mismatches.

    The results will floor you.

    When we map to (human) reference using all 50bp, we map 25 million reads.

    When we map using 35b trimmed reads, we map DOUBLE that... 50 million reads. 100% more reads mapped, and errors and NA's (i.e., -1's) are avoided.

    The implications are obvious; this halves the average expected (i.e., target) coverage needed for the same, or better, level of accuracy, thus also cutting in half the laboratory budget.

    We are writing it up for a peer-reviewed publication using a variety of mapping tools.

    I just wanted to share this tidbit, especially for those support cancer research. Please let me know if you'd like to see the data. I don't want to post images here that will also appear in a publication.

    If anyone else tries this, note that trimming all reads to 25 results in the highest number of reads mapped; however, it also increases at 25 the number of valid adjacent errors. The lowest number of valid adjacent errors occurs between 30 and 35b.

    Your feedback is welcome.


    pitt bac

  • #2
    I'm not sure I follow your logic. Surely you mean that you should run shorter read lengths, not that you should change your coverage depth. So if you were running Illumina and getting low coverage 3' ends then I agree you might want to run with fewer number of cycles. I am not familiar with SOLiD particularly so am not sure whether it has configurable read lengths. Anyway, I agree with your findings that one should always trim their reads according to quality and that quality tends to drop off further along each read!


    • #3

      In SoLiD, the reads were generated using the 50b read kit. The 'default' analysis to use all the reads in mapping, and to let mapping remove those with >1, 2, or 3 mismatches.

      By trimming all of the reads >35b to 35, this avoids the errors that tend to occur near the 3' end in the 50b reads; so these reads are retained during mapping. if the errors occurred evenly along the reads, we would not see a doubling of mapping.

      We start with reads of all lengths, from 25 to 50bp


      and trim all of them 35b to 35b


      We are trimming in colorspace.

      We have not yet explored read trimming systematically for Illumina.

      I hope this clarifies.


      • #4
        Yeah, I understand that, I just don't understand the bit where you halve the laboratory budget! I guess that would be true if you didn't realise that trimming reads to eliminate error-prone sequence before mapping was a good idea, but I thought everyone did know that


        • #5
          It's actually fairly intuitive that mapping the shortened reads provides greater number of mappings. The number of reads with adequate quality out to base N goes up as N shrinks and the probability of randomly matching "junk" reads goes up as N shrinks (though it's not very large at N>=25). What would be interesting (but not terribly counter-intuitive) is how this trimming affects SNP calling, genome assembly and/or transcriptome splice variant calling. The shorter reads are expected to be a major liability in unambiguously calling splice variants, but would probably increase accuracy for SNP calling and possibly for genome assembly (provided adequate coverage). I'm not sure whether the shorter reads would help or hinder ChIP-Seq as I'm not as familiar with that application.

          This is interesting, and I wouldn't have thought to do it. I didn't quite follow what application you're targeting, though, and I doubt that this will provide benefits for all sequencing applications.


          • #6
            Actually, I'm surprised to hear that you think everyone knows about this... I've not seen anyone reference this approach in NGS studies that are published. Then again the 50bp kits were added late.

            I presented this result at a conference in Rhode Island last month. No one out of the hundreds in attendance using SoLID routinely trims their 50b reads. It's debateable and an open question as to whether trimming has the same effect on all mapping algorithms, such as Bowtie.

            It cuts your budget in half because instead of using the recommended say 15X, you don't lose 1/2 of your reads to mapping and you can achieve the same realized coverage at 7.5X.

            Leaves me wondering what other tricks of the trade are out there that can increase coverage? We only see a modest increase (by comparison) in mapping when we run the SAET tool.

            So let me ask, of ABI SoLiD Users only please, do you routinely trim your reads, and what other observations about the fundamental characteristics of SoLiD data does anyone care to share here?


            • #7
              Well I guess I better caveat and say I was talking about Illumina sequencing, not SOLiD. But in the world of Illumina sequencing, I would say what you describe is a fairly well understood phenomenon.

              I pointed this out in a blog post a long time back, although in the context of de novo assembly - see

              If you say that there are loads of SOLiD users who don't quality filter their reads, then I guess you have a duty to bring it to their attention by any means necessary!


              • #8
                just to clarify, you are counting uniquely mapped reads, correct?
                clearly, shorter reads will be more likely to map ambiguously--that is they can map to multiple places in the genome.
                [fwiw, i work with illumina and *always* check them in FastQC then trim with fastx-toolkit]


                • #9
                  Thanks for the info on the blog.

                  Actually we have found that the quality scores don't quite reflect the information that one would hope for SoLID - we found empirically that mapping is highest with lowest reported errors between 30 and 35; we also found that qvals only weakly trend with base position. I think a lot of people filter first on qvals... but see, a 50b read with a low qval can be 'saved' for mapping by trimming to 35.......... otherwise they would be lost, hence our doubling of the mapping and halving of the cost.

                  We have a better measure than the qval - we call it ambiguity. You might be interested in it; it's the difference between the ideal entropy given a genotype call and the observed entropy. the score has a distinct distribution for homozygotes and heterozygotes. it's calculated after mapping, and we use ambiguity, along w/ ave realized coverage, qvals and other measures to prioritize variants for validation.

                  also working up a publication on this...


                  • #10
                    I don't remember seeing any reads shorter than the full 50 bp from our SOLiD machine, so I'm a bit surprised that you report seeing them.

                    For our analysis we figure that the first roughly 27 bp (with 3 mismatches) are sufficient to uniquely identify a transcript in our microbes (smaller genomes than human and no alternative splicing issues). We've set BowTie to ignore all bases after that during mapping (although I think it may still use those additional bases to break ties). That works pretty well, but makes it more difficult to handle alternative splicing. I'm not familiar with how NextGene does its mapping, which may affect the results.

                    Also, what percentage of your reads are mapping? You mentioned you see between 50 and 100 million, but how many reads do you have overall?


                    • #11

                      No this is total reads mapped. good point. As I said, we see an uptick in valid adjacent errors when we go lower than 30, but we see the lowest number of valid adjacent errors when we are between 30 and 35. This means there is the least arbitary mapping between 30-35. It increases, as one expects, as we allow more read in, at 40, or 45, or 50.

                      We are not filtering reads >35, they are just trimmed to 35.

                      I'll get back to you on the %reads mapped.


                      • #12
                        50/52 million reads mapped....


                        • #13
                          Originally posted by bacdirector View Post
                          50/52 million reads mapped....
                          96% reads mapping; that sounds impressive to me. I wonder if that is best unique reads mapping. For some applications, I think it is better to ignore reads that map to multiple positions in the genome with the same number of mismatches and alignment length. I am thinking that you don't exclude those reads?

                          To be honest, it is not so novel to trim reads, even colourspace reads, as we have been experimenting with that ourselves. However, although we get an improvement, we don't get 96% of total reads mapping hence my slight doubts. If you can get a publication out of it, why not? Also, let me know, I can contribute some data :-). Did Watson and Crick really first elucidate the structure of DNA on their own? Maybe/maybe not but they are famous. So go for a publication if you can.

                          I also wonder if there is a bit of difference with colourspace and Illumina data. I think an error in colourspace causes the rest of the read to be incorrect, whereas it might look like a SNP in Illumina/base space. I am more than happy to be corrected here if I am wrong but based on that scenario, it is possibly more beneficial for cutting off the ends of colourspace reads than Illumina's.

                          What exactly is your command line for NextGene and/or bowtie?


                          • #14
                            Have you tried using something other than NextGENe for alignment? BFAST? Novoalign? BWA? These are aligners you should definitely assess for SOLiD data. This phenomenon seems like an alignment issue, not a technology issue.

                            If you have tried these algorithms and you're seeing such a phenomenon still, it may be time to call up LifeTech and ask them to come see if something is wrong with your machine. It does not sound right to me.

                            Originally posted by poisson200 View Post
                            I think an error in colourspace causes the rest of the read to be incorrect, whereas it might look like a SNP in Illumina/base space. I am more than happy to be corrected here if I am wrong but based on that scenario, it is possibly more beneficial for cutting off the ends of colourspace reads than Illumina's.
                            This is misinformation. An error in colorspace does mess with the rest of the read momentarily, but colorspace errors are easily corrected thanks to the same phenomenon because there will be an inconsistency across the different ligations, and colorspace error trends are known and modeled. In actuality, colorspace reads are a lot more accurate in theory than base space (but in chemistry, they may get messier at the ends).
                            Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                            Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                            Projects: U87MG whole genome sequence [Website] [Paper]


                            • #15

                              you need to allow at least 10% errors to get decent 50 bp alignments. Trimming can help to rescue reads, we have used that strategy in SOLiD publications and a similar strategy was also used in the Gibbs/Lupski/ABI genome paper in NEJM (

                              Better yet is to use a decent mapping strategy in the first case. If you re-do your alignment with BFAST it will find an anchor in the good part of the read and still use the lower qv part of the read, which should give you more coverage. Likewise bowtie will use the god part as a seed and still use full length reads. Bowtie (at least early cs versions) sometimes gives different base calls than bfast for the same reads so it may give you more of a reference bias, or fewer false SNPs.

                              Given that the read is read in 5bp steps your errors dont necessarily accumulate towards the end, if you have a bad primer E you will still not be able to map them at 35 bp or 25 bp. With bfast you can easily design an index that handles this.

                              Still a thourough evaluation on mapping strategies for CS would be useful, especially to look into effects on snp-calls.
                              Last edited by Chipper; 11-02-2010, 11:31 PM.