Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • running haplotypcaller in small buckets

    Hi all,

    I am going to be running the haplotypecaller on some bunch genomes. My estimates are that each sample will take around 24hrs to run on our machines, which is too long. I was thinking of splitting the genome up into smaller batches (say each individual chromosome), and then run the caller on each one separately. This way I can utilize multiple machines and many of the cores at once, and should greatly speed up the process. My question, is after everything has run, I will have many gvcf files for each sample, will the GenotypeGVCFs be able to handle all of the files. Specifically, is the program smart enough to be able to match the sample names in the headers of the gvcfs and merge the individual files that way? Or should I combine all of the files for each sample together first, then run the GenotypeGVCFs on the merged files?

    Thanks for any help

  • #2
    Hi Ire1234,
    1) If you have "some bunch of genomes", why aren't you running the genomes in parallel instead of the chromosomes of each genome? This would give the same speed improvement without the need of splitting anything, wouldn't it?
    2) I don't know if GenotypeGVCF can handle these multiple files (though I doubt it), but file concatenation would be a very easy and straight forward approach here.

    Comment


    • #3
      Thanks. So I am looking at human genomes. I am able to run the files over multiple machines in a cluster and thought that breaking it up into smaller pieces and running many of them at the same time would speed everything up.

      Your point #2 was really my question. Can GenotypeGVCF hand multiple files from the same individual. I have been looking around, but haven't been able to find an answer. Perhaps a concatenation of the files first would be the best approach.

      Comment


      • #4
        Let's just assume the following:
        You have 24 Genomes, 24 chr each and you can use 24 nodes. Why would it be faster to run 1 genome with 24 chr on each node in comparison to running 24x 1 chromosome on each node? The latter will actually be slower. Or do I misunderstand your question?

        Comment


        • #5
          So essentially, I would be running all 24 chr simultaneously on the 1 machine, which should be faster than running the whole genome on 1 node. This should speed things up. Also, the GATK docs says that there may be issues with using the -nct to multithread the run.

          At the end, I would end, for 1 subject, I will have 24 gvcf files (1 for each chr) for that sample. The question is in the next step of genotypeGVF. Would I have to concatenate all of the individual GVCF's together, or run genotypeGVCF on all individual samples and it will be smart enough to match the sample in the individual files.

          Comment


          • #6
            Have you thought about using Platypus for the task instead of Haplotype caller:


            Both are haplotype based callers but Platypus is multiple times faster.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Recent Advances in Sequencing Analysis Tools
              by seqadmin


              The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
              05-06-2024, 07:48 AM
            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 02:46 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-07-2024, 06:57 AM
            0 responses
            13 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-06-2024, 07:17 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-02-2024, 08:06 AM
            0 responses
            23 views
            0 likes
            Last Post seqadmin  
            Working...
            X