Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Question...

    Wei mentioned that one difference between the performance of subread-align and subjunc is that subjunc has lower sensitivity. In other words if you compare the two on the same set of rna-seq like reads subread-align will align more of the data than subjunc. This seems to be opposite what I'm used to. For example if you align RNA-seq to a genome with bowtie and then with Tophat you'll have more reads aligned with Tophat almost guaranteed thanks to the additional alignments from spliced alignments. So intuitively something doesn't make sense. Why wouldn't the output of subjunc basically be the same as subread-align with the added alignments of reads that span junctions?
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #32
      Wei everything you fixed yesterday seems to be working great. Thanks. Got another one for you. This is maybe simpler to deal with though. When using the '-J' option with subread-align and aligning paired-end reads it looks like if one of the mates is soft-clipped on the left side of the alignment the other mate's "mate position" field isn't updated to include the offset from the soft-clipping.

      Here's an exmaple

      Code:
      ENST00000367469_153_348_0_1:0_3	83	1	4557898	121	34M66S	=	4557683	-315	CGATCTGGGACCGCAGCTGAAGTGACGTGGGGCTAGAATCGGGTTTCTCCACTTCCAGGTCCTGGGAAACCCGCCGTTTCCGCAGCTCCTCCATCCTCTC	????????????????????????????????????????????????????????????????????????????????????????????????????AS:i:3	NM:i:0	NH:i:1
      ENST00000367469_153_348_0_1:0_3	163	1	4557719	147	36S24M40S	=	4557898	315ACCTTCTTGGAAGGTGGTCCTGGGCAGAGGGAGAAAGACTTACTTTCTTTCCACTTCTGGGGTTGACACGGCGCTACAGAAGCCAAGCGACTCTTCGATC????????????????????????????????????????????????????????????????????????????????????????????????????AS:i:1	NM:i:0	NH:i:1
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment


      • #33
        No offense but I find the usage message in the terminal to be a mess. Allowing the argument descriptions to wrap and mix in with the arguments makes it very difficult to read and find information. I reformatted the usage function (only in aligner.c) to wrap text at about 80 characters. The difference is like night and day...
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #34
          Dear sdriscoll,

          Thanks for your helpful suggestion and reporting the bug. We will fix it when we are back to work on Tue. This is a long weekend in Australia.

          For the comparison of subread vs subjunc, firstly they can both map exon-spanning reads, so they are both splicing-aware aligners. But the difference is that subread performs local alignments, meaning that you will not get full alignments for exon-spanning reads from it, while Subjunc can give you full alignments for such reads.

          Subjunc applies a more stringent criteria for the mapping of exon-spanning reads. This is mainly because the aim of Subjunc is to detect exon-exon junctions and we found that using exon-spanning reads with higher mapping confidence to detect junctions significantly reduced its false discovery rate.

          So my recommendation for choosing subread or subjunc to align your RNA-seq data is that if the purpose of your analysis is to perform a gene expression analysis (eg. discovering differentially expressed genes), subread is a better choice (the slightly lower accuracy is outweighted by its higher sensitivity). Otherwise you should use subjunc.

          I hope this makes sense to you.

          Best wishes,

          Wei

          Comment


          • #35
            Hey Wei,

            another question:
            Is there a possibility to add read-group infomation to the SAM file from the command line (without doing it after the alignment, in order to save IO)?

            Comment


            • #36
              Hey Wei,

              Just finished aligning one color-space exome in 25min on 16 cores, that is insanely fast!

              Problem is the resulting SAM file is still in color-space encoding... Am I missing some parameter here to output basespace, or would I have to do the conversion to basespace with a script? Also this poses the same problem as my above question, if the colorspace-->basespace conversion is not done directly one would have to write another file and thus increase disc usage, which slows down the process...
              Last edited by Bernt.Popp; 06-09-2013, 11:21 AM.

              Comment


              • #37
                Also during runtime this section of the output message should be updated to match the redefined meaning of the -d and -D options:


                Performing paired-end alignment:
                Maximum distance between reads=600
                Minimum distance between reads=50
                Threshold on number of subreads for a successful mapping (the minor end in the pair)=1
                Number of anchors=10
                The directions of the two input files are: forward, reversed
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #38
                  Thanks sdriscoll,

                  They will be changed the next release. We are now testing the changes we have made. A new release should be available fairly soon.


                  Best wishes,

                  Wei

                  Comment


                  • #39
                    We have just released a new version 1.3.5-p2 which mainly includes the following changes:

                    (1) Fixed a bug of reporting mapping location of mate read when it contains soft-clipped bases.
                    (2) Reformatted the program usage info and updated the program output info.
                    (3) An '-b' option was added to subread-align to output base-space reads when mapping color-space reads.

                    Please check it out from http://subread.sourceforge.net

                    Thanks again for your helpful comments.

                    Best wishes,

                    Wei

                    Comment


                    • #40
                      awesome, thanks!

                      I was wondering what the expected change in the aligner's performance would be by tweaking the -n option. I have seen that the alignments are more strict when I increase the -m value but I don't understand what should happen when we increase or decrease -n. I do understand that these values are defaulted to relatively optimal settings.
                      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                      Salk Institute for Biological Studies, La Jolla, CA, USA */

                      Comment


                      • #41
                        Hi sdriscoll,

                        From our evaluations based on 100bp reads, we found that -n should not be too small (<7) or too big (>20). My explanation for this is that if -n is too small, you will not have enough power to map the reads accurately. When -n is too big, there seems to be quite a bit of noises introduced to the mapping locations, which may also decrease the mapping accuracy as well.

                        We found that using -n=10 (default setting) or a very close number yielded the best results in terms of sensitivity, accuracy and speed. However, the differences are quite minor for different -n values if you always keep the ratio of -m/-n at ~30% for different -n values.

                        You are correct that the alignments become more stringent when the -m value is increased. The false positive rate will be reduced with larger -m values, but you will get less mapped reads. So the -n and -m options allow you to get the balance you want to have between the sensitivity and accuracy. With the default setting, Subread leans a little towards the accuracy end of the spectrum.

                        Hope this makes sense to you.

                        Best wishes,

                        Wei

                        Comment


                        • #42
                          Thank you for the explanation. Does it seem logical, then, to adjust these settings if I have 50bp or 75bp reads instead of 100?
                          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                          Salk Institute for Biological Studies, La Jolla, CA, USA */

                          Comment


                          • #43
                            Hi sdriscoll,

                            I wouldn't recommend you to change the subread setting for the mapping of shorter reads. We actually ran subread on quite a few datasets containing shorter reads and we found the mapping performance was quite good. The default setting, which is also our recommended setting, is largely insensitive to the read length because it is primarily the number of votes which is used by subread to determine the read mapping locations, rather than using other metrics such as the number of mismatched bases.

                            In the subread paper, we also ran subread on a 202bp dataset using its default setting and subread was found to perform very well. I suspect that the default setting may work well with even longer reads.

                            So I think the default setting should deliver the best mapping results for Subread in most cases.


                            Best regards,

                            Wei

                            Comment


                            • #44
                              Hey Wei,

                              I am trying to align SOLiD colorspace reads with subread (1.4.0).
                              The commands used are:
                              1)
                              subread-buildindex -c -o human_g1k_v37_decoy human_g1k_v37_decoy.fasta
                              2)
                              subread-align -T 16 -I 16 -b -i $ref -r $myfilename".csfasta" -o $mydnaID.$myslide.subread.sam
                              3) adding readgroup information, sorting and converting to BAM with picard.

                              Unfortunately either there is some bug in the conversion from colorspace to basespace (option -b) or I am doing something wrong as the alignments are totally messy when viewed in IGV (although the reads seem to be at the right position).
                              Here is a example with a comparison to CUSHAW2 and novoalignCS alignments:
                              https://www.dropbox.com/s/4vgi0c7ev1...%20subread.jpg
                              Do you have any idea what could be wrong?

                              Also the new Indel feature does not emit any variants for the colorspace exomes analyzed...

                              Cheers,

                              Bernt

                              Comment


                              • #45
                                My guess is that this is due to the colour-space to basespace mapping being too strict.

                                It is possible that the "bad" alignments that you are seeing are the result of errors in the color-space sequence -- any error in the sequence will cause all the following bases to be incorrect. You're not showing the reference sequence, so I can't work out if this is the case in this situation.

                                Any colour-space to base-space conversion needs to take into account (and correct) errors so that the base-space sequences are correct. When there is a sequence difference, the conversion needs to make sure that only that position is changed in the base-space version.

                                Consider the following sequences that map to the same position:
                                Code:
                                reference: G101320112
                                sequence1: X101120112
                                sequence2: X101312011
                                sequence3: X201320312
                                [I chucked an INDEL in there to make things a bit harder]

                                A naive base-space conversion would convert these sequences as follows:

                                Code:
                                reference: GTTGCTTGTC
                                sequence1: GTTGTCCACT
                                sequence2: GTTGCAGGTG
                                sequence3: GAACGAATGA
                                [apologies if my conversion is incorrect. Fixes appreciated]

                                Very similar colour-space sequences, but very different base-space sequences.

                                A more correct conversion would notice where the errors were in the sequences relative to the index, and modify the next colour-space base as well to something that looks appropriate:

                                Code:
                                reference: G101320112
                                sequence1: X101100112
                                sequence2: X101313011
                                sequence3: X231320332
                                This would end up with these converted sequences:

                                Code:
                                reference: GTTGCTTGTC
                                sequence1: GTTGTTTGTC
                                sequence2: GTTGCATTGG
                                sequence3: GATGCTTATC
                                Which look considerably better.

                                I hate colour-space because the conversions are very unintuitive, and difficult to explain to other people. About the only nice thing is that reverse complement is just the reverse, but this also means that aligners need to be modified to account for that when working in colour-space (or double-encoded colour-space), and you can get weird unexpected chimeras (e.g. poly-A tails and poly-T heads merging). You can save a lot of pain and confusion by sticking with a base-space sequencer.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-25-2024, 11:49 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-24-2024, 08:47 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                62 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                61 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X