Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Cirno
    Junior Member
    • Jun 2011
    • 6

    Questions about Top-Hat and Paired End Reads

    Hello.

    I am working with transcriptome data from a new HiSeq Machine, 100xbp Paired End Reads, and trying to map it to the newly released Brassica rapa genome using tophat.

    A)General question - I get the BAM output after using both mate-pairs and convert it to SAM; the Bit-wise flags are bizarre numbers; 99, 147,163, etc, rather than what I am used to seeing in runs where there are no mate pairs, i.e. just seeing 0 and 16, etc...
    If I just put one of the mate pairs into Tophat, I get the normal bit flags. What do these new bitflags tell me? Is there a table somewhere of these new bitflags? I realize they have to do with the fact that each 100bp read in this case is a new line of the SAM file, even if it is only one half of a mate-pair. Mate pairs also seem to have different flags, i.e. one half of the pair will have one flag, and the next member of the pair on the next line will have another (99,147, for example). What is all this about?

    I like to have the BAM and SAM files, and converting it this way leaves the SAM file free of a header, so its ready to use in a custom pipe.

    B)I took the first 10000 reads (first 40k lines of each .fastq file) and tried to play with mapping settings to get an idea of how rough the genome was, what settings would be best, etc, as it is newly published and potentially has gaps. I have few questions here:

    B1)So I tried doing both mate pairs separately at first,then together...
    Code:
    tophat -g 1 --segment-mismatches 3 --library-type fr-unstranded --butterfly-search -r 200 /home/quorteth/bowtie-0.12.7/indexes/Brassica_Rapa Rapa_1_1.fq
    tophat -g 1 --segment-mismatches 3 --library-type fr-unstranded --butterfly-search -r 200 /home/quorteth/bowtie-0.12.7/indexes/Brassica_Rapa Rapa_1_2.fq
    tophat -g 1 --segment-mismatches 3 --library-type fr-unstranded --butterfly-search -r 200 /home/quorteth/bowtie-0.12.7/indexes/Brassica_Rapa Rapa_1_1.fq Rapa_1_2.fq
    I get 5009/10000 mapping (by lines in the headerless SAM file) for the first half of the mate pair.
    I get 4880/10000 mapping for the second half of the mate pair.
    When I give it both pieces of information at once, I get 9987/20000 reads mapping, wheras the sum of the two is only 9889/20000 - So Tophat is using information from one mate pair to help uniquely assign(-g 1) the other it seems, as it did pick up another 98 reads, by my understanding. Is this correct reasoning? I had always been told that Tophat maps things independently.

    B2)Still confused about how Top-hat interprets the mate-pair, i tried not setting unique, i.e *not* setting the max multi-hits to 1;
    Code:
    tophat --segment-mismatches 3 --library-type fr-unstranded --butterfly-search -r 100 /home/quorteth/bowtie-0.12.7/indexes/Brassica_Rapa Rapa_1_1.fq Rapa_1_2.fq
    And then got my BAM->SAM, and that upped the number of lines to ~14000, out of a possible 20000 (i.e. 10000 from both parts of the mate pairs), ~75% mapping. That's acceptable.

    My question is basically is about how tophat interprets each member of the mate pair; when I am using -g 1, I get significantly less reads mapping, which is normally very understandable compared to not using it, especially given a duplicate rich genome such as this one. However, again, working on the premise that tophat (as I have been told) maps the mate pairs independently, does it apply the no multi-mapping requirement to the entire PAIR or to each 100bp member individually?

    For example, if I have a mate pair read for which P_1 maps to gene A or gene B, but P_2 maps exclusively to gene B, forcing the mate-pair to actually land on gene B, what happens?
    1)Tophat maps P_1, cant find a unique hit, throws it away, maps P_2, finds it exclusive to Gene B, and keeps P_2 Exclusively (1/2 of the reads map)
    2)Tophat maps P_1 and P_2, finds that the only rational place of origin for this mate pair is Gene B, and maps both properly to Gene B
    3)Or even more complex, if P_1 was in Gene A and B, P_2 was In Gene B and C, if P_1 and P_2 are mate paired, would tophat throw out each in turn, or would it correctly map the mate pair to Gene B?

    Basically, is -g 1 applying to the PAIR or each member individually? I wish to have the power to resolve expression differences between genes with similar sequences, and I only want to quantify reads as originating from a gene if that read can ultimately only be attributed to one gene.

    Thanks in advance for any input on this problem. =)
  • lintianfeng
    Junior Member
    • Sep 2011
    • 3

    #2
    I have the same question, does anyone have any clue? Thanks

    Comment

    • biznatch
      Senior Member
      • Nov 2010
      • 124

      #3
      See the SAM spec for flags, http://samtools.sourceforge.net/, link at the top right. You can also use this to explain them http://picard.sourceforge.net/explain-flags.html.

      You don't have to use "--library-type fr-unstranded" that is the default setting (I know it doesn't say that in the manual but one of the Tophat devs posted that on here somewhere).

      I'm not sure about the -g stuff.
      Last edited by biznatch; 12-02-2011, 09:32 AM.

      Comment

      • polyatail
        Member
        • Dec 2010
        • 25

        #4
        Basically, is -g 1 applying to the PAIR or each member individually?
        From what I can tell, -g is passed by TopHat directly to bowtie as the -k (maximum number of alignments reported) and -m (suppress all alignments if >n are found) parameters. As each mate is aligned separately, -g 1 is applied to each member individually. Information from one mate is not used to direct the other mate to the correct location.

        TopHat splits each read into a number of segments (set by --segment-length) that are aligned independently. Interestingly, it appears the values of -k and -m passed to bowtie in these alignments are twice that specified by -g. Anyone know why that is?

        I wish to have the power to resolve expression differences between genes with similar sequences, and I only want to quantify reads as originating from a gene if that read can ultimately only be attributed to one gene.
        My sense is that you're going to bias expression quantification with this approach. Just because one copy of the gene has more unique sequences than a few of its paralogs doesn't necessarily mean it is more highly expressed. Cufflinks is pretty good at estimating expression in situations like this--have your -g 20 (default) runs produced bad results? If there are only a handful of paralogs, you might try identifying unique regions from a multiple alignment and going from there.

        And then got my BAM->SAM, and that upped the number of lines to ~14000, out of a possible 20000 (i.e. 10000 from both parts of the mate pairs), ~75% mapping. That's acceptable.
        Each line can contain only one mapping, and reads aligning to multiple locations, or multiple ways a single read can align to the same position (i.e. they have different CIGAR strings) will appear on multiple lines. Can you confirm that the number of lines accurately represents the number of aligned reads?

        Perhaps compare the results to:

        Code:
        samtools view accepted_hits.bam | awk '{printf $1"\n"}' | sort -u | wc -l
        Hope this helps!


        Andrew

        Comment

        • lintianfeng
          Junior Member
          • Sep 2011
          • 3

          #5
          Hello, I'm still not very clear about the -g stuff for the paired end reads. For example, if one mate has only one match to the reference genome, but the other mating read have several matches, will tophat report the paired-end read when setting -g=1?

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Pathogen Surveillance with Advanced Genomic Tools
            by seqadmin




            The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
            03-24-2025, 11:48 AM
          • seqadmin
            New Genomics Tools and Methods Shared at AGBT 2025
            by seqadmin


            This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

            The Headliner
            The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
            03-03-2025, 01:39 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 03-20-2025, 05:03 AM
          0 responses
          49 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-19-2025, 07:27 AM
          0 responses
          57 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-18-2025, 12:50 PM
          0 responses
          50 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-03-2025, 01:15 PM
          0 responses
          200 views
          0 reactions
          Last Post seqadmin  
          Working...