Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • mboursnell
    Member
    • Jul 2012
    • 17

    Data Processing Pipeline question

    Can anyone suggest an optimum data processing pipeline for analysing dog next gen sequencing data. We start with BAM files, and generally have about 5 cases and 5 controls. (10 samples = 10 BAM files). We don't want to ignore known SNPs since different breeds have different SNPs.

    Is there a way to analyse all the BAM files in parallel so that information from all of them can be used in producing aligned cleaned deduped BAM files?
  • Rocketknight
    Member
    • Sep 2011
    • 86

    #2
    There is indeed, if you're using GATK. Make sure every BAM file has a unique read group (you can use Picard's AddOrReplaceReadGroups function), mark duplicates with Picard, then merge all the BAM files into one (each read will retain its read group identifier, so you can distinguish them later).

    After that, proceed with the standard GATK pipeline on this single merged BAM file. This has several advantages over single-sample processing: Firstly, novel indels from one sample can be used to help realignment in other samples. Secondly, you can call variants for all samples simultaneously with GATK's UnifiedGenotyper. You can then do VQSR on this multi-sample VCF file, which allows you to use population-level information (InbreedingCoeff) to find false-positive SNPs.

    Comment

    • mboursnell
      Member
      • Jul 2012
      • 17

      #3
      Thanks. Do I use picard/MergeSamFiles to do the merging?

      Comment

      • Rocketknight
        Member
        • Sep 2011
        • 86

        #4
        Yep, that will work. (Edit: Make sure you have SORT_ORDER=coordinate set when you merge, as GATK will expect your BAMs to be sorted)

        Comment

        • mboursnell
          Member
          • Jul 2012
          • 17

          #5
          Do you use the Queue.jar and the DataProcessingPipeline.scala file to run the standard GATK pipeline, or do you make your own pipeline (e.g. in PERL) to do the same thing?

          Comment

          • Rocketknight
            Member
            • Sep 2011
            • 86

            #6
            I tinkered with GATK-Queue, but I had a couple of problems (and I'm not too familiar with Java/Scala), so in the end I just went with a simple Python script to run everything. I used Python's multiprocessing module to run multiple samples at once in order to take advantage of multiple cores without having to split single samples by region and recombine, but this won't be possible if you're merging all your BAM files into one (unless you have several multi-sample BAM files you'd like to process concurrently).

            Comment

            • mboursnell
              Member
              • Jul 2012
              • 17

              #7
              Would it be possible to have a look at your Python script to help me setting up my PERL script? [email protected]

              Comment

              • Rocketknight
                Member
                • Sep 2011
                • 86

                #8
                Sure thing, just sent it there.

                Comment

                • angelinasusan
                  Junior Member
                  • Dec 2012
                  • 3

                  #9
                  Could I take a look at your script?? I badly need some help with a pipeline I am building and this would be very helpful. my id : [email protected]

                  Comment

                  • mrood
                    Junior Member
                    • Feb 2013
                    • 5

                    #10
                    script please?

                    Hi, would anyone be willing to send me their script to look at? I am new to programming and would love an example to build my own off of! [email protected]
                    Thanks in advance!

                    Comment

                    • Jeremy
                      Senior Member
                      • Nov 2009
                      • 190

                      #11
                      You could also use Samtools mpileup and vcftools, it treats each bam file as a separate sample.

                      Comment

                      Latest Articles

                      Collapse

                      • SEQadmin2
                        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                        by SEQadmin2


                        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                        Here are nine questions we think about, in roughly the order they matter, before...
                        06-18-2026, 07:11 AM
                      • SEQadmin2
                        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                        by SEQadmin2


                        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                        ...
                        06-02-2026, 10:05 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, 06-17-2026, 06:09 AM
                      0 responses
                      26 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-09-2026, 11:58 AM
                      0 responses
                      43 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-05-2026, 10:09 AM
                      0 responses
                      48 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-04-2026, 08:59 AM
                      0 responses
                      49 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...