Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to tag reads for alignment

    Hi all,

    I have raw fastq files of paired-end short reads for two samples. I would like to tag the reads of the two samples with some identifier and pool them into a single fastqfile and perform the alignment and variant calling. Below is the detailed explanation of what i want to achieve.

    In general, we align the reads of the two samples independently and add read groups to the bam files using samtools or picard tools and perform variant calling using GATK or samtools. Here the variant calling algorithm will treat them as two different samples based on the readgroup information.

    But, i would like to tag the reads of both samples with two different read groups before doing alignment which would produce a single alignment bam file carrying the read group information of both the samples. And this bam file would be used for variant calling, where the algorithm treats it as two samples from read group information.

    Could anyone help!!!

  • #2
    Hi Meher,
    That may need a small script. This necessitates that you copy paste the first sequence from both files to give one an idea of how to help you.

    Comment


    • #3
      Originally posted by Apexy View Post
      Hi Meher,
      That may need a small script. This necessitates that you copy paste the first sequence from both files to give one an idea of how to help you.
      Hi,

      I can provide the few line but before that, i have a question,

      Does it make any difference to the final alignment result if we tag them and perform the alignment d generate a single bam file when compared to aligning independently and merging the 2bam files of two samples into a single bam file?.

      Would there be any bias in the alignment if we choose one method over the other?

      Comment


      • #4
        Better way is to perform the alignments separately, assigning unique read group IDs (some aligners, e.g. bowtie, will add read group IDs during alignment) and then merging the BAM files before proceeding to variant detection. Pay attention to the header information which is attached to the merged output as you need to make sure that every read group ID present in the file is referenced in the header. samtools merge does not handle this automatically, you have to supply a properly formatted header. I'm not sure if Picard MergeSamFiles properly merges the header or not.

        But I do wonder why you want to do this. GATK does not require merged BAM files; from the GATK Best Practices document:

        Because the GATK can dynamically merge BAM files, it isn't critical to have merged files by lane into sample bams, or even samples bams into cohort bams.

        Comment


        • #5
          Hello Meher,
          I do not think it matters if the insert size in both sample is expected to be the same. At least with bowtie (specified by -1 and -2) all you need is to tell it which file is which. However, you must pay particular attention during merging in relation to header info. There is an extensive manual here
          Last edited by Apexy; 11-16-2012, 06:07 AM.

          Comment


          • #6
            Originally posted by kmcarr View Post
            Better way is to perform the alignments separately, assigning unique read group IDs (some aligners, e.g. bowtie, will add read group IDs during alignment) and then merging the BAM files before proceeding to variant detection. Pay attention to the header information which is attached to the merged output as you need to make sure that every read group ID present in the file is referenced in the header. samtools merge does not handle this automatically, you have to supply a properly formatted header. I'm not sure if Picard MergeSamFiles properly merges the header or not.

            But I do wonder why you want to do this. GATK does not require merged BAM files; from the GATK Best Practices document:
            Yes, it is not required to merge bams. The actual task which i want to accomplish is to detect the variants from the two samples in a single VCF file and infer the depth of the variant from both the samples(i.e if a variant has depth 100, i would like to find how many of the reads came from each of these samples). Performing multisample variant calling on the two bam files using GATK will accomplish this.

            But, I would really like to know if there could be any biases in doing as described as above. when compared to doing a single alignment by tagging the reads before alignment and then performing variant calling.

            Which of these would get rid of any biases, if they are supposed to be present.

            Comment


            • #7
              Originally posted by Apexy View Post
              Hello Meher,
              I do not think it matters if the insert size in both sample is expected to be the same. At least with bowtie (specified by -1 and -2) all you need is to tell it which file is which. However, you must pay particular attention during merging in relation to header info. There is an extensive manual here
              Hi any way these are the first few lines,
              sample1_1.fastq

              @HWI-ST188:1:1101:1225:2112#0/1
              AGANAGTAAGTAAAATCTATTATGATATTCTTATAAAGAAAAGCCCACTTTTGAAGATTTCAGAAGTGCTTCTAAAGGAGGTAGCGCGGCATAATACTGGG
              +
              Z^_BS\ccgg`eghhhhhhhhhhhhhhhhhhhhhhhhgggdcfhhhhhhhhhdhghhfhbghhff]]egfdghf]cdgfbdTZacebbababb_bb]`cb`
              @HWI-ST188:1:1101:1221:2160#0/1
              TTCNAATAAAATAAATAAAAGATGAGATGAATATTCATTTTGACTTCATTTTCTACTTTTTTTTCAGAATACTTAAAGTTTGAGAGAAATGTGAGACAACT
              +
              __bBS`ccggcggiihhfghicghhiieghihehihfibghifhehhffhiiiiffghiiiiihdggg_b`bddbbcbabdd`_`bc``Y_bbZ_T_^BBB


              sample1_2.fastq

              @HWI-ST188:1:1101:1225:2112#0/2
              ATGAATCAGATTGAAAATGCAAACTGTGACATGAGGCAGAGGCATTTATTTTATTTNGTGGGGAATCGGGAAAGGAAATTGCTAGGTTTCTGCAGCCCCAG
              +
              bbbeeeeegffgcgifhhihihiiif`agh`ghifhhhhiiihhcffXagXcce_cBL[Z_eaghfeedcS\^`dcbZZZ`b`bY^T]_bb]RGYba^[^_
              @HWI-ST188:1:1101:1221:2160#0/2
              TTAAATCTTAAAAGTGTATGTAAAAATGTTCAAAATATTAGTTTTCTTTAAATTTTNGTAGAAAAGGCATTATCTTCACATTAAGTGACATGAGATAACGC
              +
              bbbeeeeegggfghQbK`hhbigiiieh[ddgdgfhbgfffS^fddgiiidXaeSXBOO^eg`efbghfYHWbee_cffgccV`g]b_gHZZZZ^Y_bBBB

              Comment


              • #8
                Hello,

                Something must have made me forget to replythis. if you want to merge two fastq files, use the attached script. However, I cannot relate the info ( two samples) with the sequences you provided. If you have 2 samples (paired), then you should in fact have 4 files. I think you have just provided a forward seq from sample1_1.fastq and a reverse seq from sample1_2.fastq
                Attached Files

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM
                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 07:03 AM
                0 responses
                14 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-10-2024, 06:35 AM
                0 responses
                36 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-09-2024, 02:46 PM
                0 responses
                43 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-07-2024, 06:57 AM
                0 responses
                38 views
                0 likes
                Last Post seqadmin  
                Working...
                X