Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote

    Dear All,

    Although our 'Subread' read aligner and 'Subjunc' junction detector have been around for a while, we have just got them published on Nucleic Acids Research today.

    Subread and Subjunc employ a novel read mapping paradigm called "seed-and-vote" to achieve a fast mapping speed and a high mapping accuracy. It takes Subread less than 20 minutes to map 10 million 100bp reads using one thread. Its running time remains nearly the same when mapping longer reads thanks to the high scalability of the seed-and-vote paradigm.

    Subread and Subjunc can be used to map reads generated from all major sequencing platforms including Illumina GA/HiSeq, Roche 454, ABI SOLiD and Ion Torrent. They can run on both Linux/unix and Mac computers.

    Subread and Subjunc are included in a software package called 'Subread', which can freely downloaded from http://subread.sourceforge.net/ .

    Our paper for Subread and Subjunc can be freely accessed at: http://nar.oxfordjournals.org/conten...kt214.abstract

    A short tutorial for Subread can be found here - http://bioinf.wehi.edu.au/subread/

    A short tutorial for Subjunc can be found here - http://bioinf.wehi.edu.au/subjunc/

    The User's Guides provides comprehensive descriptions to these two programs.

    Please do not hesitate to contact me if you have any questions ([email protected]).


    Best regards,
    ------------------------
    Wei Shi, Ph.D
    Bioinformatics Division,
    The Walter and Eliza Hall Institute of Medical Research
    1G Royal Parade, Parkville 3052, Victoria
    Australia
    Last edited by shi; 05-15-2013, 08:07 PM.

  • #2
    Is there a user group for these tools or is it best to post questions in this forum?
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #3
      I'll just start putting questions here and see what happens.

      First off, here's what I'm doing:

      I've just been testing this aligner out to see how it performs and to compare it's alignments with control simulated reads. I simulate illumina-like reads (in this case 100 bp single end reads) from transcript sequences which were generated from a copy of the mm10 genome with random single point mutations added in. I've not included any INDELS. By translating the sampled read coordinates back into genomic coordinates (similar to how Tophat does this post-transcriptome alignment) I obtain "control" alignments in genomic coordinates that include spliced alignments marking junctions between exons.

      I'm quantifying alignment "correctness" in a pretty simple way. I make 4 categories: 1) is the read aligned?, 2) is the read the correct orientation, 3) is the read on the correct chromosome? and 4) is the read in the correct posiiton? "correct position" is overly simplified where I only check to see if the aligned coordinates overlap with the control alignment coordinates so my leniency in correct position is essentially the entire length of the read. I assume true positives would go down a little if I refined that metric.

      So on to the question:

      I ran subread-align with '-u' to see how precise its alignments are when not reporting any multi-mapping reads at all. Surprisingly there were still some reads that aligned incorrectly despite the fact that the aligner reported these as unique alignments. For starters I extracted 94 reads that were mapped "uniquely" in the wrong orientation and remapped those with bowtie, to the mm10 genome, with all defaults (implies -n 2 -e 70) and with the '-a' option enabled to allow all possible alignments to be reported. In terms of my simulated read quals this would allow up to 2 mismatches in the seed and up to 2 mismatches in the remaining bases of the read. 67 of those reads aligned producing 8,665 alignments.

      so the questions are - why are these read alignments considered to be uniquely mapped when 1) I know they are mapped in the wrong location and 2) those same reads report an average of more than 100 alignments each to the same genome using bowtie?
      Last edited by sdriscoll; 06-04-2013, 02:49 PM. Reason: forgot to include my simulated read length and type
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment


      • #4
        One little niggle I have with the paper is that you use the computer science terms 'accuracy' and 'recall', rather than the biomedical terms 'sensitivity' and 'specificity' (or alternatively the more explicit terms 'false positive' and 'false negative'). All these terms are easily interchangeable, so it's a good idea to use the terms most appropriate for your audience.

        Otherwise, great paper. One question that doesn't seem to be addressed: what is the size (on disk) of the indexes that subread generates?

        Along a similar vein, could it be used for indexing a massive database with similar sequences (e.g. NCBI-nr) to replace something like BLAST?

        Edit: just noticed that you discussed BLAST-like databases at the end of the paper, and you leave it open for investigation.
        Last edited by gringer; 06-04-2013, 03:34 PM.

        Comment


        • #5
          Question/Bug

          I'm running version 1.3.3 on Mac OS 10.7.5. I've used both subread-align and subjunc and both of them crash if I specify the '-T' option with anything other than 1 thread.

          Here's what subjunc returns:
          Code:
          subread-align -J --allow-repeating  -T 2 -i '/Users/pfafflab/alind/ensemble/mm10/subread/mm10e' -r 'reads.1.fq' -R 'reads.2.fq' -o './subjunc-temp-sam-004023-uVJ2Zl' -P 3 -d 50 -D 600   -I 5 -b 0 
          
          Number of selected subreads = 10
          Consensus threshold = 3
          Number of threads=2
          Number of positions reported per multi-mapping read=1
          Number of indels allowed=5
          
          
          Performing paired-end alignment:
          Maximum distance between reads=600
          Minimum distance between reads=50
          Threshold on number of subreads for a successful mapping (the minor end in the pair)=1
          Number of anchors=10
          The directions of the two input files are: forward, reversed
          
          Input file './subjunc-temp-sam-004023-uVJ2Zl' is not found or is in an incorrect format.
          The file '/subjunc-temp-sam-004023-uVJ2Zl' does exist and it has zero content. Passing it through the unix 'file' utility reports:

          Code:
          subjunc-temp-sam-004023-uVJ2Zl: empty
          subread-align is slightly less friendly throwing a seg fault

          Code:
          subread-align -T 2 -i ~/alind/ensemble/mm10/subread/mm10e -r reads.1.fq -R reads.2.fq -o temp.sam
          
          Number of selected subreads = 10
          Consensus threshold = 3
          Number of threads=2
          Number of positions reported per multi-mapping read=1
          Number of indels allowed=5
          
          
          Performing paired-end alignment:
          Maximum distance between reads=600
          Minimum distance between reads=50
          Threshold on number of subreads for a successful mapping (the minor end in the pair)=1
          Number of anchors=10
          The directions of the two input files are: forward, reversed
          
          completed=1.25%; time used=7.1s; rate=29.5k reads/s; time left=103.4s; total=1593k pairs            
          completed=1.88%; time used=7.5s; rate=29.2k reads/s; time left=103.5s; total=1593k pairs            
          completed=2.51%; time used=7.8s; rate=28.9k reads/s; time left=103.7s; total=1592k pairs            
          Segmentation fault: 11
          it's not always the same though. sometimes it throws the seg fault without actually making any progress at all (during the 'Loading the 01-th index file ...') stage. In fact it even throws the fault at different points if I run the program several times in a row with the exact same options.
          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
          Salk Institute for Biological Studies, La Jolla, CA, USA */

          Comment


          • #6
            Originally posted by gringer View Post
            One little niggle I have with the paper is that you use the computer science terms 'accuracy' and 'recall', rather than the biomedical terms 'sensitivity' and 'specificity' (or alternatively the more explicit terms 'false positive' and 'false negative'). All these terms are easily interchangeable, so it's a good idea to use the terms most appropriate for your audience.
            That's funny I was thinking the same thing when I was looking at their plots. When I run my tests I like to use "precision" and "recall" where I define precision as the ratio of "correct" results to total results and recall as the ratio of "correct" results to total possible results (as in correctly aligned simulated reads to total simulated reads). Simple terms that don't include the confusion of trying to define both true and false positives and negatives. Since in these simulations we don't include any reads that intentionally should NOT align it's hard to define all four of those.

            I was wondering if they really defined accuracy by strict terms: (TP+TN)/(TP+TN+FP+FN) which, depending on how you define things, can really only have three values in that equation instead of four. So it might become (TP)/(TP+FP+FN) if you define false negatives as unaligned reads. now the equation has become equal to recall or sensitivity comparing correctly aligned reads to the total number of reads attempted to be aligned. So...accuracy isn't the right word anymore.

            I suppose you can be extra strict and say that since there are no "true" negatives then there can't be false negatives either in which case the equation for accuracy becomes the same as that for precision. This also makes it impossible to define specificity (aka the true negative rate) since it would always be 0.

            In my opinion it would be best to avoid any terms that include true or false negatives for aligner benchmarking. Then we can avoid bastardizing these basic terms.
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • #7
              Another question:

              Is it really necessary to have a limit to the number of references allowed when building an aligner index? I know people who work with poorly defined genomes who have to deal with 10's or sometimes over 100,000 scaffolds. This limit seems to completely exclude those possible users and puts a limitation on this aligner that doesn't exist in any other aligners that I've used.
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment


              • #8
                Originally posted by sdriscoll View Post
                I'll just start putting questions here and see what happens.

                First off, here's what I'm doing:

                I've just been testing this aligner out to see how it performs and to compare it's alignments with control simulated reads. I simulate illumina-like reads (in this case 100 bp single end reads) from transcript sequences which were generated from a copy of the mm10 genome with random single point mutations added in. I've not included any INDELS. By translating the sampled read coordinates back into genomic coordinates (similar to how Tophat does this post-transcriptome alignment) I obtain "control" alignments in genomic coordinates that include spliced alignments marking junctions between exons.

                I'm quantifying alignment "correctness" in a pretty simple way. I make 4 categories: 1) is the read aligned?, 2) is the read the correct orientation, 3) is the read on the correct chromosome? and 4) is the read in the correct posiiton? "correct position" is overly simplified where I only check to see if the aligned coordinates overlap with the control alignment coordinates so my leniency in correct position is essentially the entire length of the read. I assume true positives would go down a little if I refined that metric.

                So on to the question:

                I ran subread-align with '-u' to see how precise its alignments are when not reporting any multi-mapping reads at all. Surprisingly there were still some reads that aligned incorrectly despite the fact that the aligner reported these as unique alignments. For starters I extracted 94 reads that were mapped "uniquely" in the wrong orientation and remapped those with bowtie, to the mm10 genome, with all defaults (implies -n 2 -e 70) and with the '-a' option enabled to allow all possible alignments to be reported. In terms of my simulated read quals this would allow up to 2 mismatches in the seed and up to 2 mismatches in the remaining bases of the read. 67 of those reads aligned producing 8,665 alignments.

                so the questions are - why are these read alignments considered to be uniquely mapped when 1) I know they are mapped in the wrong location and 2) those same reads report an average of more than 100 alignments each to the same genome using bowtie?
                Dear sdriscoll,

                The definition of 'multi-mapping reads' in Subread is that if a read is called a multi-mapping read, it must have the same number of votes (number of maximum consensus subheads) and also the same length of genomic regions spanned by these consensus subheads in more than one location. This definition is apparently different from the definitions used by other aligners. I guess this to some extend explains the discrepancy you observed between Subread and Bowtie, although this does not explain why these reads were wrongly mapped by Subread.

                I noticed that there were exon-spanning reads in your simulation data. Subread determines the read mapping locations by using the largest mapped regions in the reads. For an exon-spanning read, the continuous mappable region is shorter than those of reads falling within an exon and thus the exon-spanning reads are more likely to be wrongly mapped. Given that you introduced SNP to the reference genome when you generated simulated reads, it is possible that Subread assigned such reads to locations where it got higher mapping quality according to its mapping criteria (and it found only one such location).

                You may try the Subjunc program (an exon-exon junction detecting program included in the Subread package) to align your reads. Subjunc is more accurate but less sensitive than Subread for the mapping of exon-spanning reads. Subjunc uses two best mappable regions in each read (instead of one as used by Subread) to determine the read mapping location. It also checks the donor/receptor sites. Your simulated reads must have the correct donor/receptor sites if you want to run Subjunc on them (Subjunc only recognizes GT/AG).

                I hope this explanation makes sense to you. We will be happy to have a close look at your simulation data if you'd like make it available to us.

                Best wishes,

                Wei

                Comment


                • #9
                  Originally posted by gringer View Post
                  One little niggle I have with the paper is that you use the computer science terms 'accuracy' and 'recall', rather than the biomedical terms 'sensitivity' and 'specificity' (or alternatively the more explicit terms 'false positive' and 'false negative'). All these terms are easily interchangeable, so it's a good idea to use the terms most appropriate for your audience.

                  Otherwise, great paper. One question that doesn't seem to be addressed: what is the size (on disk) of the indexes that subread generates?

                  Along a similar vein, could it be used for indexing a massive database with similar sequences (e.g. NCBI-nr) to replace something like BLAST?

                  Edit: just noticed that you discussed BLAST-like databases at the end of the paper, and you leave it open for investigation.

                  Dear David Eccles,

                  Thanks for your comments.

                  I hope we made it clear in the paper what we mean by accuracy and recall. But I think I agree with you that these computer science terms may confuse people from other fields. So here I just try to describe a bit more these two terms to make them more clearer.

                  The accuracy used in the paper was calculated as the percentage of correctly mapped reads out of all the mapped reads. This is equivalent to the false discovery rate. The recall was calculated as the percentage of correctly mapped reads out of all the reads included in the test dataset. This is equivalent to the sensitivity (the sensitivity in my dictionary). Please note that the denominators for the calculation of accuracy and recall are different, but the nominators are the same. So as you can see, our paper focused on the correctly mapped reads.

                  The size of the index built for human genome hg19 is 6.8GB. But please note that the amount of memory needed for mapping using this index ranges from 1GB to 7.6GB, depending on the memory size selection when you built the index.

                  As sdriscoll pointed out, there is currently a limit (50,000) on the number of chromosomes/contigs allowed when building the index. A massive database such as NCBI nucleotide database certainly has a lot more sequences than this number. So we are now working on removing this limitation (this should not be too hard). Once this is fixed, we may try to do a search similar to the blast search to see how Subread behaves. I guess there are many issues we need to address for this, such as not only reporting the best location/s but also reporting the second, third or fourth locations. This will need quite a bit changes to our program. But I think the exciting part for us is that we can test if the seed-and-vote paradigm is going to be useful in this search (like many short read aligners, blast also uses the seed-and-vote paradigm). I will be very happy to hear any suggestions for this.

                  Please let me know if you have any further questions.

                  Best wishes,

                  Wei
                  Last edited by shi; 06-04-2013, 08:53 PM.

                  Comment


                  • #10
                    Wei that's interesting. I haven't read into exactly how the subread mapper works but I follow your explanation of how it defines a multi-mapping read. It may be useful to users if it were possible to somehow make that clear because I think we are used to using tools that define a multi-mapped read simply as one that can align in more than one location within some edit distance cutoff. Or at least that's what I'm used to from using STAR, bowtie1/2/tophat, and bwa. It seems to be a dramatic difference based on this one test. Do you suppose there is a way to tweak the input options of subread to mimic the behavior of an edit distance cutoff for multi-mapping?

                    Also the reason the read is not aligned "correctly" it's likely easily explainable and what you mentioned is no doubt the case. These reads probably wouldn't be aligned correctly by most aligners except maybe by random chance because there exists a continuous region in the genome that is a pretty good mapping for the reads as well as a spliced position that's a perfect mapping (and the actual source of the read). For example one such read should aligned spliced with this alignment position X:28585510 and cigar=91M2449N9M with 0 mismatches. subread-align reported a 6-mismatch alignment at X:30352028 with MAPQ=46 and CIGAR=100M. So it did as you say it should - favored the full length alignment over a truncated one. In this case that wasn't the correct option but it seems the aligner is using the logic you described.

                    I'll run this "misaligned" reads back against the genome with subjunc and see what it does with them.
                    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                    Salk Institute for Biological Studies, La Jolla, CA, USA */

                    Comment


                    • #11
                      Originally posted by shi View Post
                      The accuracy used in the paper was calculated as the percentage of correctly mapped reads out of all the mapped reads. This is equivalent to the false discovery rate.
                      isn't this also the same as 'precision'/'positive predicative value'? I believe the FDR is the percentage of incorrectly mapped reads out of all mapped reads. anytime I get mixed up about these terms I go back to this wikipeida article...

                      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                      Salk Institute for Biological Studies, La Jolla, CA, USA */

                      Comment


                      • #12
                        Originally posted by sdriscoll View Post
                        isn't this also the same as 'precision'/'positive predicative value'? I believe the FDR is the percentage of incorrectly mapped reads out of all mapped reads. anytime I get mixed up about these terms I go back to this wikipeida article...

                        http://en.wikipedia.org/wiki/Receive...characteristic
                        Sorry for my typo. It meant to be 1 - FDR. And you are correct that this is exactly the same as the 'precision' mentioned in one of your posts. We did not use precision in the paper because we use it to refer to the variation between replicates, which is irrelevant in this context to us.

                        Wei

                        Comment


                        • #13
                          Interesting. subjunc didn't manage to pick up on the correct alignment for the read I used as an example. To review:

                          Code:
                          Correct:        X:28585510, CIGAR=91M2449N9M (NM=0) flag=0
                          subread-align:  X:30352028, MAPQ=46, CIGAR=100M (NM=6) flag=16
                          subjunc:        X:30352036, MAPQ=187, CIGAR=8S92M (NM=6) flag=16
                          subjunc seems to have put the read in the exact same spot but this time soft-clipped it. It also appears to be even more confident that it aligned the read correctly this time.

                          Here's the read if you'd like to take a look (again this is in mm10):

                          Code:
                          ACTTCTCATGCATGTCTTCAATTGCTTCCATGGTCGGACTCTGACTACATTCGGACAGTTTTAATGCTTGTTGTTCTTTTTGATAATTGTTCACTGATTT
                          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                          Salk Institute for Biological Studies, La Jolla, CA, USA */

                          Comment


                          • #14
                            Also I don't want to distract you too much with this alignment snag - I'm really more interested in having the aligners NOT crash when using multiple threads. Any ideas on that one?
                            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                            Salk Institute for Biological Studies, La Jolla, CA, USA */

                            Comment


                            • #15
                              Originally posted by sdriscoll View Post
                              Wei that's interesting. I haven't read into exactly how the subread mapper works but I follow your explanation of how it defines a multi-mapping read. It may be useful to users if it were possible to somehow make that clear because I think we are used to using tools that define a multi-mapped read simply as one that can align in more than one location within some edit distance cutoff. Or at least that's what I'm used to from using STAR, bowtie1/2/tophat, and bwa. It seems to be a dramatic difference based on this one test. Do you suppose there is a way to tweak the input options of subread to mimic the behavior of an edit distance cutoff for multi-mapping?

                              Also the reason the read is not aligned "correctly" it's likely easily explainable and what you mentioned is no doubt the case. These reads probably wouldn't be aligned correctly by most aligners except maybe by random chance because there exists a continuous region in the genome that is a pretty good mapping for the reads as well as a spliced position that's a perfect mapping (and the actual source of the read). For example one such read should aligned spliced with this alignment position X:28585510 and cigar=91M2449N9M with 0 mismatches. subread-align reported a 6-mismatch alignment at X:30352028 with MAPQ=46 and CIGAR=100M. So it did as you say it should - favored the full length alignment over a truncated one. In this case that wasn't the correct option but it seems the aligner is using the logic you described.

                              I'll run this "misaligned" reads back against the genome with subjunc and see what it does with them.
                              Hi sdriscoll,

                              Yes, Subread is quite different from other aligners in determining if a read is a multi-mapping reads. This stems from the way in which it finds the best mapping locations for the reads. It does not perform a full alignment for a read to determine its mapping location. It takes a number of equally-spaced 16bp subreads from the read, mapping them to the reference and then using the mapping location of these subreads to vote for the mapping location of the read (ie. the largest set of consensus subreads determines the mapping location of the read). So for example, if the largest set of consensus subreads include 5 subreads and the read length is 100bp, then the total number of known matched bases at the voting time is <= 80. So the read may have more than 20bp mismatched bases but still get mapped. On the other hand, if the largest consensus set includes 10 subreads, then all 100 bases are perfectly matched with the reference. However, Subread does perform a full read alignment in the end with the determined position.

                              You could possibly use the number of votes (-m option) to try to mimic the behavior of the edit distance cutoff. But I'm not really sure if you can find a good correlation between them.

                              Best wishes,

                              Wei

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              103 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              111 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              1 response
                              114 views
                              0 likes
                              Last Post EmiTom
                              by EmiTom
                               
                              Started by seqadmin, 09-26-2024, 12:57 PM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X