Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • IrisZhu
    Member
    • Jul 2010
    • 25

    transcriptome mapping with bowtie -- only 22% reads hit

    To check how well the sequencing parts were done, I used bowtie to map all the reads from RNA-seq directly to human refseqs, and I got only 22% (of ) of reads have at least one hit, 78% failed to align, which I think is impossible.

    my command is:
    "bowtie -p 5 -m 10 -f trx -1 mate1.fa -2 mate2.fa >output"

    The output:
    # reads processed: 130653415
    # reads with at least one reported alignment: 28765542 (22.02%)
    # reads that failed to align: 101882364 (77.98%)
    # reads with alignments suppressed due to -m: 5509 (0.00%)
    Reported 28765542 paired-end alignments to 1 output stream(s)


    Any idea why so few reads got mapped? what did I do wrong?

    Thank you very much,

    Iris
  • Ben Langmead
    Senior Member
    • Sep 2008
    • 200

    #2
    Hi Iris,

    Paired-end alignment against the transcriptome is tricky, because the distance between the mates in genome space is affected both by fragment length, and on the "shape" of the transcriptome. E.g. when a fragment spans a long intron, the intron size is in some sense "added" to the fragment length. A better way of measuring sequencing quality is to break the mates apart and align both files in an unpaired manner. And an even better way is to use TopHat or another tool that attempts spliced mapping, so that alignments that span splice junctions can be found as well.

    Hope that helps,
    Ben

    Comment

    • IrisZhu
      Member
      • Jul 2010
      • 25

      #3
      Originally posted by Ben Langmead View Post
      Hi Iris,

      Paired-end alignment against the transcriptome is tricky, because the distance between the mates in genome space is affected both by fragment length, and on the "shape" of the transcriptome. E.g. when a fragment spans a long intron, the intron size is in some sense "added" to the fragment length. A better way of measuring sequencing quality is to break the mates apart and align both files in an unpaired manner. And an even better way is to use TopHat or another tool that attempts spliced mapping, so that alignments that span splice junctions can be found as well.

      Hope that helps,
      Ben
      Thanks Ben.
      I see. I'll map the two mates separately.

      Iris

      Comment

      • Zigster
        Jeremy Leipzig
        • May 2009
        • 117

        #4
        A more useful way to pose this question in the future is to find one example of a read or a pair of reads that you expected would map, but did not.
        --
        Jeremy Leipzig
        Bioinformatics Programmer
        --
        My blog
        Twitter

        Comment

        • Lee Sam
          Member
          • Oct 2008
          • 57

          #5
          Originally posted by Ben Langmead View Post
          Hi Iris,

          Paired-end alignment against the transcriptome is tricky, because the distance between the mates in genome space is affected both by fragment length, and on the "shape" of the transcriptome. E.g. when a fragment spans a long intron, the intron size is in some sense "added" to the fragment length. A better way of measuring sequencing quality is to break the mates apart and align both files in an unpaired manner. And an even better way is to use TopHat or another tool that attempts spliced mapping, so that alignments that span splice junctions can be found as well.

          Hope that helps,
          Ben
          If she's aligning against the set of RefSeq transcripts, wouldn't she automatically get the alignments for reads spanning splice junctions?

          Comment

          • IrisZhu
            Member
            • Jul 2010
            • 25

            #6
            Originally posted by Lee Sam View Post
            If she's aligning against the set of RefSeq transcripts, wouldn't she automatically get the alignments for reads spanning splice junctions?
            I am supposed to. That's why I don't use tophat.

            Comment

            • IrisZhu
              Member
              • Jul 2010
              • 25

              #7
              Following Ben's suggestion, so I mapped the two mates seperately. But still only a small percentage of reads get mapped.

              ===mate1=====
              reads processed: 130653415
              # reads with at least one reported alignment: 47875358 (36.64%)
              # reads that failed to align: 82659307 (63.27%)
              # reads with alignments suppressed due to -m: 118750 (0.09%)
              Reported 47875358 alignments to 1 output stream(s)

              ====mate2=====
              # reads processed: 130653415
              # reads with at least one reported alignment: 35773578 (27.38%)
              # reads that failed to align: 94798249 (72.56%)
              # reads with alignments suppressed due to -m: 81588 (0.06%)
              Reported 35773578 alignments to 1 output stream(s)

              If bowtie works well, then there are two explanations: either the human refseqs are far from complete, or, the sample is contaminated with a lot of pre-mRNA/genomic DNA.
              Any comments are welcome

              Iris

              Comment

              • Lee Sam
                Member
                • Oct 2008
                • 57

                #8
                Originally posted by IrisZhu View Post
                I am supposed to. That's why I don't use tophat.
                that's why I'm a bit mystified by Ben's comment about insert sizes because that shouldn't be an issue if the fragment you're sequencing is basically represented by the RefSeq transcript. Off the top of my head (e.g. I'm not thinking about this very hard), my initial suspicions might be that you're sequencing some strange artifacts (maybe some contamination?). Maybe try to align a subset of your reads to the set of UCSC transcripts to see if the number improves substantially (as you might expect with more references?). Also, I noticed you aren't supplying Bowtie with arguments related to the inset size? Are your insert sizes in line with the bowtie defaults?

                Originally posted by IrisZhu View Post
                Following Ben's suggestion, so I mapped the two mates seperately. But still only a small percentage of reads get mapped.

                ===mate1=====
                reads processed: 130653415
                # reads with at least one reported alignment: 47875358 (36.64%)
                # reads that failed to align: 82659307 (63.27%)
                # reads with alignments suppressed due to -m: 118750 (0.09%)
                Reported 47875358 alignments to 1 output stream(s)

                ====mate2=====
                # reads processed: 130653415
                # reads with at least one reported alignment: 35773578 (27.38%)
                # reads that failed to align: 94798249 (72.56%)
                # reads with alignments suppressed due to -m: 81588 (0.06%)
                Reported 35773578 alignments to 1 output stream(s)

                If bowtie works well, then there are two explanations: either the human refseqs are far from complete, or, the sample is contaminated with a lot of pre-mRNA/genomic DNA.
                Any comments are welcome

                Iris
                Looks like you beat me to posting. I guess the insert size issue is refuted by the low individual mapping rate... Might as well align to hg18/19 to see if you've got genomic contamination, right?
                Last edited by Lee Sam; 08-10-2010, 12:06 PM.

                Comment

                • IrisZhu
                  Member
                  • Jul 2010
                  • 25

                  #9
                  Anyone knows how to make bowtie report unmapped reads?
                  What's the option pamameter?
                  Thanks!
                  Iris

                  Comment

                  • IrisZhu
                    Member
                    • Jul 2010
                    • 25

                    #10
                    Originally posted by Lee Sam View Post
                    that's why I'm a bit mystified by Ben's comment about insert sizes because that shouldn't be an issue if the fragment you're sequencing is basically represented by the RefSeq transcript. Off the top of my head (e.g. I'm not thinking about this very hard), my initial suspicions might be that you're sequencing some strange artifacts (maybe some contamination?). Maybe try to align a subset of your reads to the set of UCSC transcripts to see if the number improves substantially (as you might expect with more references?). Also, I noticed you aren't supplying Bowtie with arguments related to the inset size? Are your insert sizes in line with the bowtie defaults?



                    Looks like you beat me to posting. I guess the insert size issue is refuted by the low individual mapping rate... Might as well align to hg18/19 to see if you've got genomic contamination, right?
                    I mapped two mates seperately, i.e. in a unpaired manner. So no inset size here.

                    Comment

                    • Zigster
                      Jeremy Leipzig
                      • May 2009
                      • 117

                      #11
                      Originally posted by IrisZhu View Post
                      Anyone knows how to make bowtie report unmapped reads?
                      What's the option pamameter?
                      Thanks!
                      Iris
                      no but that's easy enough to infer:

                      #get defline, strip >
                      grep '>' mate1.fa | sed -e 's/>//' > in.seqs

                      #get first col with seqnames
                      cut -f1 output.bowtie > out.seqs

                      #get symmetric difference
                      sort in.seqs out.seqs | uniq -u
                      --
                      Jeremy Leipzig
                      Bioinformatics Programmer
                      --
                      My blog
                      Twitter

                      Comment

                      • babati
                        Junior Member
                        • Aug 2010
                        • 1

                        #12
                        IrisZhu,

                        to make bowtie report the unmapped reads, use the command line option:
                        --un <output_file.fa>

                        Do you know the protocol used to generate the RNAseq data? For example, the totalRNA protocol yields a lot of ribosomal and non-coding RNAs, most of which are not in RefSeq, while polyA-selected protocol should yield more coding RNAs.

                        Also, what is the source of your data? Some highly-expressed transcripts in diseased tissues may not be in RefSeq.

                        In either case, you should try Lee Sam's suggestion and map directly to the genome and see how many more reads map that didn't map to RefSeq. Let us know what you find!
                        Last edited by babati; 08-10-2010, 01:15 PM.

                        Comment

                        • Zigster
                          Jeremy Leipzig
                          • May 2009
                          • 117

                          #13
                          Originally posted by babati View Post
                          IrisZhu,
                          to make bowtie report the unmapped reads, use the command line option:
                          --unmapped <output_file.fa>
                          can you point me to where that feature is described? it is not working for me.
                          --
                          Jeremy Leipzig
                          Bioinformatics Programmer
                          --
                          My blog
                          Twitter

                          Comment

                          • IrisZhu
                            Member
                            • Jul 2010
                            • 25

                            #14
                            Originally posted by babati View Post
                            IrisZhu,

                            to make bowtie report the unmapped reads, use the command line option:
                            --un <output_file.fa>

                            Do you know the protocol used to generate the RNAseq data? For example, the totalRNA protocol yields a lot of ribosomal and non-coding RNAs, most of which are not in RefSeq, while polyA-selected protocol should yield more coding RNAs.

                            Also, what is the source of your data? Some highly-expressed transcripts in diseased tissues may not be in RefSeq.

                            In either case, you should try Lee Sam's suggestion and map directly to the genome and see how many more reads map that didn't map to RefSeq. Let us know what you find!
                            My sample is supposed to be mRNA. I mapped them to the whole genome, ~30% was mapped. If i used bowtie properly , then it must be the problems of sequencing data itself.
                            And I just tried mapping another batch of data to human refseqs, the results are much more reasonble:
                            reads processed: 76269225
                            # reads with at least one reported alignment: 46230181 (60.61%)
                            # reads that failed to align: 29912348 (39.22%)
                            # reads with alignments suppressed due to -m: 126696 (0.17%)


                            BTW, is my command "bowtie -p 10 -m 10 -f hg19 myfile.fa (for unpaired reads)" right?
                            Thanks!

                            Comment

                            • IrisZhu
                              Member
                              • Jul 2010
                              • 25

                              #15
                              Originally posted by Zigster View Post
                              no but that's easy enough to infer:

                              #get defline, strip >
                              grep '>' mate1.fa | sed -e 's/>//' > in.seqs

                              #get first col with seqnames
                              cut -f1 output.bowtie > out.seqs

                              #get symmetric difference
                              sort in.seqs out.seqs | uniq -u
                              Oh yes we can do it this way .... Thank you!

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 10:09 AM
                              0 responses
                              9 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 08:59 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              24 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...