Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mate Pair orientation in illumina

    Hi everyone,
    i'm working for a university project about "resequencing" of a small genome (the reference genome is laidlawii).
    I have the reference genome and two fastq files containing the reads of an illumina mate-pair library from a target genome.
    Going to the point: i have a problem when i'm asked to generate a track for IGV representing the "percentage of oriented mates", i simply can't understand which read in each pair is the left and which is the right one.
    Each read in the two fastq files has an id and is also marked with tag /1 /2: in one file i have all the /1 and in the other one i have all the /2.
    Now the question is if there is a strong relation between the tag and the fact that the read is the left or right.
    For aligment i use PASS (pass.cribi.unipd.it) that outputs a sam file with different informations among which the reverse complemented alignmen (flag bit 0x10 setted).
    In almost every pair one read is aligned l->r while the other is reverse complemented aligned (maybe illumina sequences the borders from different strands?).

    Making it easier: can i say that every mate-pair with /1 aligning left->right and /2 aligning reversed complemented, is left->right oriented on the reference?
    And in the opposite case, then the mate pair aligns reversed on the reference?
    (the assumption to prove is that /1 is always the left (or right) mate)

    Thank you,
    i hope i've made myself understood (english is not my first language and i'm a poor informatician :P)


    edit: to be exhaustive as much as possible, here a situation that make me me crazy:
    Code:
    sq_1607_4547_0_1_0_0_0:0:1_0:0:0_3f6b5	83	Chromosome	4547	50	50M	=	1607	-2990	GACTACATCGGTTCCGGAGGGGAAACGAAGTATTTTTTATATGAGCATAA	
    sq_1607_4547_0_1_0_0_0:0:1_0:0:0_3f6b5	163	Chromosome	1607	49	5M1D45M	=	4547	2990	ACTCGTTGTCAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTA	
    sq_1610_3842_0_1_0_0_0:0:1_0:0:0_6dbfe	83	Chromosome	3842	50	50M	=	1612	-2280	ATACCCGGATACAGCAAAAATCATACCTGTTAATTTTCCTACTGTCATTA	
    sq_1610_3842_0_1_0_0_0:0:1_0:0:0_6dbfe	163	Chromosome	1612	49	49M	=	3842	2280	GTTGTCAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTATAA
    sq_1611_220_1_0_0_0_0:0:1_0:0:1_2059e	99	Chromosome	220	49	9M1D41M	=	1612	1442	TAATAAATTGTCGTTTCTTATGCTATCATAGTTTTACATAAATTATTAAC	
    sq_1611_220_1_0_0_0_0:0:1_0:0:1_2059e	147	Chromosome	1612	50	50M	=	220	-1442	GTTGTCAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTATAAA	
    sq_1611_4420_0_1_0_0_0:0:1_0:0:0_35f5b	83	Chromosome	4420	50	50M	=	1612	-2858	AAGCGTTAAAAAGTGCGCTTTTTTACTTATATTATGTTATAATATAATAG	
    sq_1611_4420_0_1_0_0_0:0:1_0:0:0_35f5b	163	Chromosome	1612	50	50M	=	4420	2858	GTTGTCAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTATAAA
    sq_1617_4456_0_1_0_0_0:0:0_0:0:0_3e90d	83	Chromosome	4456	50	50M	=	1617	-2889	TTATAATATAATAGGTAGGTGAATGAAGCGTATGAATCATTTTGAGTTAG	
    sq_1617_4456_0_1_0_0_0:0:0_0:0:0_3e90d	163	Chromosome	1617	50	50M	=	4456	2889	CAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTATAAAAATGA
    This is the output of PASS aligner, with mates ordered by id.
    You see... the first 2 mates pairs align so that the "first segment in the template is reversed complemented" (flag = 83 with bit 5 and 7 setted according to sam specs) and the "second segment is forward aligned" (flag = 163). And this is the case of the hundreds of pairs preceding that point, so for the first 1612 bases i have /2 forward aligned and /1 reversed complemented.
    Then the third mate pairs in the example is different. flag = 99 means that "this is the first segment and is forward aligned" and flag = 147 means "this is the second segment and is reversed complemented". So in this case /2 is reversed and /1 is forward.
    After that, all returns normal...
    This example make me think that there's no strong relation between the /1 /2 indication and the fact that a read is left or right.
    In fact if it was like that, how can i explain that i have /2 of the second mate pair aligning forward on position 1612 and /1 of third mate pair aligning reversed in the same position?
    The only possible case is that i have another region of my genome with the same code reversed, but in this case i'd have multiple reads, and this is not the case (the reads are id sorted so i should notice...).
    An example of multiple read is this:
    Code:
    sq_76677_74195_1_0_0_0_0:1:0_0:0:0_61c68	99	Chromosome	73696	50	50M	=	76178	2532	ATTTATCGGTTTAAGAGGGGTCTGCGGCGCATTAGTTAGTTGGTGGGGTA
    sq_76677_74195_1_0_0_0_0:1:0_0:0:0_61c68	147	Chromosome	76178	49	50M	=	73696	-2532	AATATATGCTAAGTGGAAACGGAAGTAGAGATGCACAAACAGCCAGGAGG
    sq_76677_74195_1_0_0_0_0:1:0_0:0:0_61c68	83	Chromosome	1204898	50	50M	=	1202206	-2742	TACCCCACCAACTAACTAATGCGCCGCAGACCCCTCTTAAACCGATAAAT	
    sq_76677_74195_1_0_0_0_0:1:0_0:0:0_61c68	163	Chromosome	1202206	49	50M	=	1204898	2742	CCTCCTGGCTGTTTGTGCATCTCTACTTCCGTTTCCACTTAGCATATATT
    In this case you see that the two mate pairs aligns correctly in different parts of genome with the first mate aligning /1 forward and /2 reversed, while the second mate align /1 reversed and /2 forward.
    This is plausible considering that probably (but i know it is) the code around position 1200000 is the same of around position 70000 but reversed complemented.
    But if i dont know which one between /1 /2 is the left mate, i can't say where in my target genome i have the inversion.

    Anyway, do you know if that read identificator is splittable for gain more information? I've noticed quite a regularity like if the value of the first 2 "boolean" values could say which read is the left one (if "_0_1_" meant that the second read is the left and "_1_0_" that the first read is the left, then i'd have solved my question). However i've no documentation about that and it does not match the fastq illumina standards.
    Last edited by d3mux; 08-10-2014, 05:36 AM.

  • #2
    Do you mean paired-end or mate pair? For Illumina technology, the orientation of the two reads relative to each other is different for paired end and mate pair.

    The read identifiers in your sam file don't look like typical Illumina read IDs.

    Anyways, if your reads are paired end Illumina reads, it is just random whether the reads in file /1 align to the + strand or the - strand, some of the /1 reads will align to one strand, and some to the other strand.
    The reads in file /2 will align to the opposite strand from the paired read in file /1.

    Comment


    • #3
      Originally posted by mastal View Post
      Do you mean paired-end or mate pair? For Illumina technology, the orientation of the two reads relative to each other is different for paired end and mate pair.

      The read identifiers in your sam file don't look like typical Illumina read IDs.

      Anyways, if your reads are paired end Illumina reads, it is just random whether the reads in file /1 align to the + strand or the - strand, some of the /1 reads will align to one strand, and some to the other strand.
      The reads in file /2 will align to the opposite strand from the paired read in file /1.
      I was told Mate Pair (i received also written instructions).
      Well... if there's no relation between /1 /2 and orientation, i really wonder how i can decide which orientation my mate pairs have...
      About the reads, i can't exclude that they have been sequenced artifically (with a simulation on sample genome).
      Thanks for the reply

      Comment


      • #4
        They should be RF. That being said, there is generally significant contamination in MP libraries of PE. Check out NextClip and the Illumina technical bulletin on MP library analysis.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Recent Developments in Metagenomics
          by seqadmin





          Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
          09-23-2024, 06:35 AM
        • seqadmin
          Understanding Genetic Influence on Infectious Disease
          by seqadmin




          During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

          Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
          09-09-2024, 10:59 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 10-02-2024, 04:51 AM
        0 responses
        12 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-01-2024, 07:10 AM
        0 responses
        20 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 09-30-2024, 08:33 AM
        0 responses
        25 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 09-26-2024, 12:57 PM
        0 responses
        18 views
        0 likes
        Last Post seqadmin  
        Working...
        X