Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • transcriptome mapping with bowtie -- only 22% reads hit

    To check how well the sequencing parts were done, I used bowtie to map all the reads from RNA-seq directly to human refseqs, and I got only 22% (of ) of reads have at least one hit, 78% failed to align, which I think is impossible.

    my command is:
    "bowtie -p 5 -m 10 -f trx -1 mate1.fa -2 mate2.fa >output"

    The output:
    # reads processed: 130653415
    # reads with at least one reported alignment: 28765542 (22.02%)
    # reads that failed to align: 101882364 (77.98%)
    # reads with alignments suppressed due to -m: 5509 (0.00%)
    Reported 28765542 paired-end alignments to 1 output stream(s)


    Any idea why so few reads got mapped? what did I do wrong?

    Thank you very much,

    Iris

  • #2
    Hi Iris,

    Paired-end alignment against the transcriptome is tricky, because the distance between the mates in genome space is affected both by fragment length, and on the "shape" of the transcriptome. E.g. when a fragment spans a long intron, the intron size is in some sense "added" to the fragment length. A better way of measuring sequencing quality is to break the mates apart and align both files in an unpaired manner. And an even better way is to use TopHat or another tool that attempts spliced mapping, so that alignments that span splice junctions can be found as well.

    Hope that helps,
    Ben

    Comment


    • #3
      Originally posted by Ben Langmead View Post
      Hi Iris,

      Paired-end alignment against the transcriptome is tricky, because the distance between the mates in genome space is affected both by fragment length, and on the "shape" of the transcriptome. E.g. when a fragment spans a long intron, the intron size is in some sense "added" to the fragment length. A better way of measuring sequencing quality is to break the mates apart and align both files in an unpaired manner. And an even better way is to use TopHat or another tool that attempts spliced mapping, so that alignments that span splice junctions can be found as well.

      Hope that helps,
      Ben
      Thanks Ben.
      I see. I'll map the two mates separately.

      Iris

      Comment


      • #4
        A more useful way to pose this question in the future is to find one example of a read or a pair of reads that you expected would map, but did not.
        --
        Jeremy Leipzig
        Bioinformatics Programmer
        --
        My blog
        Twitter

        Comment


        • #5
          Originally posted by Ben Langmead View Post
          Hi Iris,

          Paired-end alignment against the transcriptome is tricky, because the distance between the mates in genome space is affected both by fragment length, and on the "shape" of the transcriptome. E.g. when a fragment spans a long intron, the intron size is in some sense "added" to the fragment length. A better way of measuring sequencing quality is to break the mates apart and align both files in an unpaired manner. And an even better way is to use TopHat or another tool that attempts spliced mapping, so that alignments that span splice junctions can be found as well.

          Hope that helps,
          Ben
          If she's aligning against the set of RefSeq transcripts, wouldn't she automatically get the alignments for reads spanning splice junctions?

          Comment


          • #6
            Originally posted by Lee Sam View Post
            If she's aligning against the set of RefSeq transcripts, wouldn't she automatically get the alignments for reads spanning splice junctions?
            I am supposed to. That's why I don't use tophat.

            Comment


            • #7
              Following Ben's suggestion, so I mapped the two mates seperately. But still only a small percentage of reads get mapped.

              ===mate1=====
              reads processed: 130653415
              # reads with at least one reported alignment: 47875358 (36.64%)
              # reads that failed to align: 82659307 (63.27%)
              # reads with alignments suppressed due to -m: 118750 (0.09%)
              Reported 47875358 alignments to 1 output stream(s)

              ====mate2=====
              # reads processed: 130653415
              # reads with at least one reported alignment: 35773578 (27.38%)
              # reads that failed to align: 94798249 (72.56%)
              # reads with alignments suppressed due to -m: 81588 (0.06%)
              Reported 35773578 alignments to 1 output stream(s)

              If bowtie works well, then there are two explanations: either the human refseqs are far from complete, or, the sample is contaminated with a lot of pre-mRNA/genomic DNA.
              Any comments are welcome

              Iris

              Comment


              • #8
                Originally posted by IrisZhu View Post
                I am supposed to. That's why I don't use tophat.
                that's why I'm a bit mystified by Ben's comment about insert sizes because that shouldn't be an issue if the fragment you're sequencing is basically represented by the RefSeq transcript. Off the top of my head (e.g. I'm not thinking about this very hard), my initial suspicions might be that you're sequencing some strange artifacts (maybe some contamination?). Maybe try to align a subset of your reads to the set of UCSC transcripts to see if the number improves substantially (as you might expect with more references?). Also, I noticed you aren't supplying Bowtie with arguments related to the inset size? Are your insert sizes in line with the bowtie defaults?

                Originally posted by IrisZhu View Post
                Following Ben's suggestion, so I mapped the two mates seperately. But still only a small percentage of reads get mapped.

                ===mate1=====
                reads processed: 130653415
                # reads with at least one reported alignment: 47875358 (36.64%)
                # reads that failed to align: 82659307 (63.27%)
                # reads with alignments suppressed due to -m: 118750 (0.09%)
                Reported 47875358 alignments to 1 output stream(s)

                ====mate2=====
                # reads processed: 130653415
                # reads with at least one reported alignment: 35773578 (27.38%)
                # reads that failed to align: 94798249 (72.56%)
                # reads with alignments suppressed due to -m: 81588 (0.06%)
                Reported 35773578 alignments to 1 output stream(s)

                If bowtie works well, then there are two explanations: either the human refseqs are far from complete, or, the sample is contaminated with a lot of pre-mRNA/genomic DNA.
                Any comments are welcome

                Iris
                Looks like you beat me to posting. I guess the insert size issue is refuted by the low individual mapping rate... Might as well align to hg18/19 to see if you've got genomic contamination, right?
                Last edited by Lee Sam; 08-10-2010, 12:06 PM.

                Comment


                • #9
                  Anyone knows how to make bowtie report unmapped reads?
                  What's the option pamameter?
                  Thanks!
                  Iris

                  Comment


                  • #10
                    Originally posted by Lee Sam View Post
                    that's why I'm a bit mystified by Ben's comment about insert sizes because that shouldn't be an issue if the fragment you're sequencing is basically represented by the RefSeq transcript. Off the top of my head (e.g. I'm not thinking about this very hard), my initial suspicions might be that you're sequencing some strange artifacts (maybe some contamination?). Maybe try to align a subset of your reads to the set of UCSC transcripts to see if the number improves substantially (as you might expect with more references?). Also, I noticed you aren't supplying Bowtie with arguments related to the inset size? Are your insert sizes in line with the bowtie defaults?



                    Looks like you beat me to posting. I guess the insert size issue is refuted by the low individual mapping rate... Might as well align to hg18/19 to see if you've got genomic contamination, right?
                    I mapped two mates seperately, i.e. in a unpaired manner. So no inset size here.

                    Comment


                    • #11
                      Originally posted by IrisZhu View Post
                      Anyone knows how to make bowtie report unmapped reads?
                      What's the option pamameter?
                      Thanks!
                      Iris
                      no but that's easy enough to infer:

                      #get defline, strip >
                      grep '>' mate1.fa | sed -e 's/>//' > in.seqs

                      #get first col with seqnames
                      cut -f1 output.bowtie > out.seqs

                      #get symmetric difference
                      sort in.seqs out.seqs | uniq -u
                      --
                      Jeremy Leipzig
                      Bioinformatics Programmer
                      --
                      My blog
                      Twitter

                      Comment


                      • #12
                        IrisZhu,

                        to make bowtie report the unmapped reads, use the command line option:
                        --un <output_file.fa>

                        Do you know the protocol used to generate the RNAseq data? For example, the totalRNA protocol yields a lot of ribosomal and non-coding RNAs, most of which are not in RefSeq, while polyA-selected protocol should yield more coding RNAs.

                        Also, what is the source of your data? Some highly-expressed transcripts in diseased tissues may not be in RefSeq.

                        In either case, you should try Lee Sam's suggestion and map directly to the genome and see how many more reads map that didn't map to RefSeq. Let us know what you find!
                        Last edited by babati; 08-10-2010, 01:15 PM.

                        Comment


                        • #13
                          Originally posted by babati View Post
                          IrisZhu,
                          to make bowtie report the unmapped reads, use the command line option:
                          --unmapped <output_file.fa>
                          can you point me to where that feature is described? it is not working for me.
                          --
                          Jeremy Leipzig
                          Bioinformatics Programmer
                          --
                          My blog
                          Twitter

                          Comment


                          • #14
                            Originally posted by babati View Post
                            IrisZhu,

                            to make bowtie report the unmapped reads, use the command line option:
                            --un <output_file.fa>

                            Do you know the protocol used to generate the RNAseq data? For example, the totalRNA protocol yields a lot of ribosomal and non-coding RNAs, most of which are not in RefSeq, while polyA-selected protocol should yield more coding RNAs.

                            Also, what is the source of your data? Some highly-expressed transcripts in diseased tissues may not be in RefSeq.

                            In either case, you should try Lee Sam's suggestion and map directly to the genome and see how many more reads map that didn't map to RefSeq. Let us know what you find!
                            My sample is supposed to be mRNA. I mapped them to the whole genome, ~30% was mapped. If i used bowtie properly , then it must be the problems of sequencing data itself.
                            And I just tried mapping another batch of data to human refseqs, the results are much more reasonble:
                            reads processed: 76269225
                            # reads with at least one reported alignment: 46230181 (60.61%)
                            # reads that failed to align: 29912348 (39.22%)
                            # reads with alignments suppressed due to -m: 126696 (0.17%)


                            BTW, is my command "bowtie -p 10 -m 10 -f hg19 myfile.fa (for unpaired reads)" right?
                            Thanks!

                            Comment


                            • #15
                              Originally posted by Zigster View Post
                              no but that's easy enough to infer:

                              #get defline, strip >
                              grep '>' mate1.fa | sed -e 's/>//' > in.seqs

                              #get first col with seqnames
                              cut -f1 output.bowtie > out.seqs

                              #get symmetric difference
                              sort in.seqs out.seqs | uniq -u
                              Oh yes we can do it this way .... Thank you!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              27 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X