Hi everyone,
i'm working for a university project about "resequencing" of a small genome (the reference genome is laidlawii).
I have the reference genome and two fastq files containing the reads of an illumina mate-pair library from a target genome.
Going to the point: i have a problem when i'm asked to generate a track for IGV representing the "percentage of oriented mates", i simply can't understand which read in each pair is the left and which is the right one.
Each read in the two fastq files has an id and is also marked with tag /1 /2: in one file i have all the /1 and in the other one i have all the /2.
Now the question is if there is a strong relation between the tag and the fact that the read is the left or right.
For aligment i use PASS (pass.cribi.unipd.it) that outputs a sam file with different informations among which the reverse complemented alignmen (flag bit 0x10 setted).
In almost every pair one read is aligned l->r while the other is reverse complemented aligned (maybe illumina sequences the borders from different strands?).
Making it easier: can i say that every mate-pair with /1 aligning left->right and /2 aligning reversed complemented, is left->right oriented on the reference?
And in the opposite case, then the mate pair aligns reversed on the reference?
(the assumption to prove is that /1 is always the left (or right) mate)
Thank you,
i hope i've made myself understood (english is not my first language and i'm a poor informatician :P)
edit: to be exhaustive as much as possible, here a situation that make me me crazy:
This is the output of PASS aligner, with mates ordered by id.
You see... the first 2 mates pairs align so that the "first segment in the template is reversed complemented" (flag = 83 with bit 5 and 7 setted according to sam specs) and the "second segment is forward aligned" (flag = 163). And this is the case of the hundreds of pairs preceding that point, so for the first 1612 bases i have /2 forward aligned and /1 reversed complemented.
Then the third mate pairs in the example is different. flag = 99 means that "this is the first segment and is forward aligned" and flag = 147 means "this is the second segment and is reversed complemented". So in this case /2 is reversed and /1 is forward.
After that, all returns normal...
This example make me think that there's no strong relation between the /1 /2 indication and the fact that a read is left or right.
In fact if it was like that, how can i explain that i have /2 of the second mate pair aligning forward on position 1612 and /1 of third mate pair aligning reversed in the same position?
The only possible case is that i have another region of my genome with the same code reversed, but in this case i'd have multiple reads, and this is not the case (the reads are id sorted so i should notice...).
An example of multiple read is this:
In this case you see that the two mate pairs aligns correctly in different parts of genome with the first mate aligning /1 forward and /2 reversed, while the second mate align /1 reversed and /2 forward.
This is plausible considering that probably (but i know it is) the code around position 1200000 is the same of around position 70000 but reversed complemented.
But if i dont know which one between /1 /2 is the left mate, i can't say where in my target genome i have the inversion.
Anyway, do you know if that read identificator is splittable for gain more information? I've noticed quite a regularity like if the value of the first 2 "boolean" values could say which read is the left one (if "_0_1_" meant that the second read is the left and "_1_0_" that the first read is the left, then i'd have solved my question). However i've no documentation about that and it does not match the fastq illumina standards.
i'm working for a university project about "resequencing" of a small genome (the reference genome is laidlawii).
I have the reference genome and two fastq files containing the reads of an illumina mate-pair library from a target genome.
Going to the point: i have a problem when i'm asked to generate a track for IGV representing the "percentage of oriented mates", i simply can't understand which read in each pair is the left and which is the right one.
Each read in the two fastq files has an id and is also marked with tag /1 /2: in one file i have all the /1 and in the other one i have all the /2.
Now the question is if there is a strong relation between the tag and the fact that the read is the left or right.
For aligment i use PASS (pass.cribi.unipd.it) that outputs a sam file with different informations among which the reverse complemented alignmen (flag bit 0x10 setted).
In almost every pair one read is aligned l->r while the other is reverse complemented aligned (maybe illumina sequences the borders from different strands?).
Making it easier: can i say that every mate-pair with /1 aligning left->right and /2 aligning reversed complemented, is left->right oriented on the reference?
And in the opposite case, then the mate pair aligns reversed on the reference?
(the assumption to prove is that /1 is always the left (or right) mate)
Thank you,
i hope i've made myself understood (english is not my first language and i'm a poor informatician :P)
edit: to be exhaustive as much as possible, here a situation that make me me crazy:
Code:
sq_1607_4547_0_1_0_0_0:0:1_0:0:0_3f6b5 83 Chromosome 4547 50 50M = 1607 -2990 GACTACATCGGTTCCGGAGGGGAAACGAAGTATTTTTTATATGAGCATAA sq_1607_4547_0_1_0_0_0:0:1_0:0:0_3f6b5 163 Chromosome 1607 49 5M1D45M = 4547 2990 ACTCGTTGTCAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTA sq_1610_3842_0_1_0_0_0:0:1_0:0:0_6dbfe 83 Chromosome 3842 50 50M = 1612 -2280 ATACCCGGATACAGCAAAAATCATACCTGTTAATTTTCCTACTGTCATTA sq_1610_3842_0_1_0_0_0:0:1_0:0:0_6dbfe 163 Chromosome 1612 49 49M = 3842 2280 GTTGTCAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTATAA sq_1611_220_1_0_0_0_0:0:1_0:0:1_2059e 99 Chromosome 220 49 9M1D41M = 1612 1442 TAATAAATTGTCGTTTCTTATGCTATCATAGTTTTACATAAATTATTAAC sq_1611_220_1_0_0_0_0:0:1_0:0:1_2059e 147 Chromosome 1612 50 50M = 220 -1442 GTTGTCAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTATAAA sq_1611_4420_0_1_0_0_0:0:1_0:0:0_35f5b 83 Chromosome 4420 50 50M = 1612 -2858 AAGCGTTAAAAAGTGCGCTTTTTTACTTATATTATGTTATAATATAATAG sq_1611_4420_0_1_0_0_0:0:1_0:0:0_35f5b 163 Chromosome 1612 50 50M = 4420 2858 GTTGTCAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTATAAA sq_1617_4456_0_1_0_0_0:0:0_0:0:0_3e90d 83 Chromosome 4456 50 50M = 1617 -2889 TTATAATATAATAGGTAGGTGAATGAAGCGTATGAATCATTTTGAGTTAG sq_1617_4456_0_1_0_0_0:0:0_0:0:0_3e90d 163 Chromosome 1617 50 50M = 4456 2889 CAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTATAAAAATGA
You see... the first 2 mates pairs align so that the "first segment in the template is reversed complemented" (flag = 83 with bit 5 and 7 setted according to sam specs) and the "second segment is forward aligned" (flag = 163). And this is the case of the hundreds of pairs preceding that point, so for the first 1612 bases i have /2 forward aligned and /1 reversed complemented.
Then the third mate pairs in the example is different. flag = 99 means that "this is the first segment and is forward aligned" and flag = 147 means "this is the second segment and is reversed complemented". So in this case /2 is reversed and /1 is forward.
After that, all returns normal...
This example make me think that there's no strong relation between the /1 /2 indication and the fact that a read is left or right.
In fact if it was like that, how can i explain that i have /2 of the second mate pair aligning forward on position 1612 and /1 of third mate pair aligning reversed in the same position?
The only possible case is that i have another region of my genome with the same code reversed, but in this case i'd have multiple reads, and this is not the case (the reads are id sorted so i should notice...).
An example of multiple read is this:
Code:
sq_76677_74195_1_0_0_0_0:1:0_0:0:0_61c68 99 Chromosome 73696 50 50M = 76178 2532 ATTTATCGGTTTAAGAGGGGTCTGCGGCGCATTAGTTAGTTGGTGGGGTA sq_76677_74195_1_0_0_0_0:1:0_0:0:0_61c68 147 Chromosome 76178 49 50M = 73696 -2532 AATATATGCTAAGTGGAAACGGAAGTAGAGATGCACAAACAGCCAGGAGG sq_76677_74195_1_0_0_0_0:1:0_0:0:0_61c68 83 Chromosome 1204898 50 50M = 1202206 -2742 TACCCCACCAACTAACTAATGCGCCGCAGACCCCTCTTAAACCGATAAAT sq_76677_74195_1_0_0_0_0:1:0_0:0:0_61c68 163 Chromosome 1202206 49 50M = 1204898 2742 CCTCCTGGCTGTTTGTGCATCTCTACTTCCGTTTCCACTTAGCATATATT
This is plausible considering that probably (but i know it is) the code around position 1200000 is the same of around position 70000 but reversed complemented.
But if i dont know which one between /1 /2 is the left mate, i can't say where in my target genome i have the inversion.
Anyway, do you know if that read identificator is splittable for gain more information? I've noticed quite a regularity like if the value of the first 2 "boolean" values could say which read is the left one (if "_0_1_" meant that the second read is the left and "_1_0_" that the first read is the left, then i'd have solved my question). However i've no documentation about that and it does not match the fastq illumina standards.
Comment