Hello.
I am working with transcriptome data from a new HiSeq Machine, 100xbp Paired End Reads, and trying to map it to the newly released Brassica rapa genome using tophat.
A)General question - I get the BAM output after using both mate-pairs and convert it to SAM; the Bit-wise flags are bizarre numbers; 99, 147,163, etc, rather than what I am used to seeing in runs where there are no mate pairs, i.e. just seeing 0 and 16, etc...
If I just put one of the mate pairs into Tophat, I get the normal bit flags. What do these new bitflags tell me? Is there a table somewhere of these new bitflags? I realize they have to do with the fact that each 100bp read in this case is a new line of the SAM file, even if it is only one half of a mate-pair. Mate pairs also seem to have different flags, i.e. one half of the pair will have one flag, and the next member of the pair on the next line will have another (99,147, for example). What is all this about?
I like to have the BAM and SAM files, and converting it this way leaves the SAM file free of a header, so its ready to use in a custom pipe.
B)I took the first 10000 reads (first 40k lines of each .fastq file) and tried to play with mapping settings to get an idea of how rough the genome was, what settings would be best, etc, as it is newly published and potentially has gaps. I have few questions here:
B1)So I tried doing both mate pairs separately at first,then together...
I get 5009/10000 mapping (by lines in the headerless SAM file) for the first half of the mate pair.
I get 4880/10000 mapping for the second half of the mate pair.
When I give it both pieces of information at once, I get 9987/20000 reads mapping, wheras the sum of the two is only 9889/20000 - So Tophat is using information from one mate pair to help uniquely assign(-g 1) the other it seems, as it did pick up another 98 reads, by my understanding. Is this correct reasoning? I had always been told that Tophat maps things independently.
B2)Still confused about how Top-hat interprets the mate-pair, i tried not setting unique, i.e *not* setting the max multi-hits to 1;
And then got my BAM->SAM, and that upped the number of lines to ~14000, out of a possible 20000 (i.e. 10000 from both parts of the mate pairs), ~75% mapping. That's acceptable.
My question is basically is about how tophat interprets each member of the mate pair; when I am using -g 1, I get significantly less reads mapping, which is normally very understandable compared to not using it, especially given a duplicate rich genome such as this one. However, again, working on the premise that tophat (as I have been told) maps the mate pairs independently, does it apply the no multi-mapping requirement to the entire PAIR or to each 100bp member individually?
For example, if I have a mate pair read for which P_1 maps to gene A or gene B, but P_2 maps exclusively to gene B, forcing the mate-pair to actually land on gene B, what happens?
1)Tophat maps P_1, cant find a unique hit, throws it away, maps P_2, finds it exclusive to Gene B, and keeps P_2 Exclusively (1/2 of the reads map)
2)Tophat maps P_1 and P_2, finds that the only rational place of origin for this mate pair is Gene B, and maps both properly to Gene B
3)Or even more complex, if P_1 was in Gene A and B, P_2 was In Gene B and C, if P_1 and P_2 are mate paired, would tophat throw out each in turn, or would it correctly map the mate pair to Gene B?
Basically, is -g 1 applying to the PAIR or each member individually? I wish to have the power to resolve expression differences between genes with similar sequences, and I only want to quantify reads as originating from a gene if that read can ultimately only be attributed to one gene.
Thanks in advance for any input on this problem. =)
I am working with transcriptome data from a new HiSeq Machine, 100xbp Paired End Reads, and trying to map it to the newly released Brassica rapa genome using tophat.
A)General question - I get the BAM output after using both mate-pairs and convert it to SAM; the Bit-wise flags are bizarre numbers; 99, 147,163, etc, rather than what I am used to seeing in runs where there are no mate pairs, i.e. just seeing 0 and 16, etc...
If I just put one of the mate pairs into Tophat, I get the normal bit flags. What do these new bitflags tell me? Is there a table somewhere of these new bitflags? I realize they have to do with the fact that each 100bp read in this case is a new line of the SAM file, even if it is only one half of a mate-pair. Mate pairs also seem to have different flags, i.e. one half of the pair will have one flag, and the next member of the pair on the next line will have another (99,147, for example). What is all this about?
I like to have the BAM and SAM files, and converting it this way leaves the SAM file free of a header, so its ready to use in a custom pipe.
B)I took the first 10000 reads (first 40k lines of each .fastq file) and tried to play with mapping settings to get an idea of how rough the genome was, what settings would be best, etc, as it is newly published and potentially has gaps. I have few questions here:
B1)So I tried doing both mate pairs separately at first,then together...
Code:
tophat -g 1 --segment-mismatches 3 --library-type fr-unstranded --butterfly-search -r 200 /home/quorteth/bowtie-0.12.7/indexes/Brassica_Rapa Rapa_1_1.fq tophat -g 1 --segment-mismatches 3 --library-type fr-unstranded --butterfly-search -r 200 /home/quorteth/bowtie-0.12.7/indexes/Brassica_Rapa Rapa_1_2.fq tophat -g 1 --segment-mismatches 3 --library-type fr-unstranded --butterfly-search -r 200 /home/quorteth/bowtie-0.12.7/indexes/Brassica_Rapa Rapa_1_1.fq Rapa_1_2.fq
I get 4880/10000 mapping for the second half of the mate pair.
When I give it both pieces of information at once, I get 9987/20000 reads mapping, wheras the sum of the two is only 9889/20000 - So Tophat is using information from one mate pair to help uniquely assign(-g 1) the other it seems, as it did pick up another 98 reads, by my understanding. Is this correct reasoning? I had always been told that Tophat maps things independently.
B2)Still confused about how Top-hat interprets the mate-pair, i tried not setting unique, i.e *not* setting the max multi-hits to 1;
Code:
tophat --segment-mismatches 3 --library-type fr-unstranded --butterfly-search -r 100 /home/quorteth/bowtie-0.12.7/indexes/Brassica_Rapa Rapa_1_1.fq Rapa_1_2.fq
My question is basically is about how tophat interprets each member of the mate pair; when I am using -g 1, I get significantly less reads mapping, which is normally very understandable compared to not using it, especially given a duplicate rich genome such as this one. However, again, working on the premise that tophat (as I have been told) maps the mate pairs independently, does it apply the no multi-mapping requirement to the entire PAIR or to each 100bp member individually?
For example, if I have a mate pair read for which P_1 maps to gene A or gene B, but P_2 maps exclusively to gene B, forcing the mate-pair to actually land on gene B, what happens?
1)Tophat maps P_1, cant find a unique hit, throws it away, maps P_2, finds it exclusive to Gene B, and keeps P_2 Exclusively (1/2 of the reads map)
2)Tophat maps P_1 and P_2, finds that the only rational place of origin for this mate pair is Gene B, and maps both properly to Gene B
3)Or even more complex, if P_1 was in Gene A and B, P_2 was In Gene B and C, if P_1 and P_2 are mate paired, would tophat throw out each in turn, or would it correctly map the mate pair to Gene B?
Basically, is -g 1 applying to the PAIR or each member individually? I wish to have the power to resolve expression differences between genes with similar sequences, and I only want to quantify reads as originating from a gene if that read can ultimately only be attributed to one gene.
Thanks in advance for any input on this problem. =)
Comment