Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pipeline to simulate short reads and call indels

    Hi

    http://biotechies.blogspot.com/2011/...ct-indels.html is a kind of howto for doing the following pipeline:

    1. simulate short reads (wgsim)

    2. align short reads
    2.1. using BWA
    2.2. using Bowtie2

    3. call indels
    3.1. using GATK UnifiedGenotyper
    3.2. using Dindel

    I initially started to write it for my own usage to remember command calls but it may help others since I've spent a lot of time discovering software parameters and pipelines.

    Please help me to correct and complete the post as much as possible. Any suggestion is very welcome too.

    Regards.
    @pascal_biotech

  • #2
    Some of those commands are not as useful for real life problems, so maybe you want to add alternate commands, for once you really start working.

    For starters, you may want to note that the default bwa index won't work on a genome > 2 Gb. You need to add "-a bwtsw" to the command. So I think chr 20 will index alright, but the whole genome will not. And in real life, most people have access to computers with a few processors, so they'd add a "-t 2" or whatever to the bwa aln command.

    In real life, .sam files are just too big. If you pipe the command that makes the .sam file straight into samtools view, you won't make a .sam file, just the much smaller .bam file. You can use samtools view to convert part of the .bam back into a .sam if you want to look at the raw alignments. And I'd add -h to the samtools view command, to make sure the headers are carried on. You need them for some commands to work.

    So the bwa/samtools command woud be:

    bwa sampe -r '@RG\tID:foo\tSM:bar' human_g1k_v37_chr20.fasta out1.sai out2.sai out.read1b.fq out.read2b.fq | samtools view -hbS - > out.bam

    Comment


    • #3
      wow :-O
      Thanks for taking the time to read it and for your very interesting comments. I will modify the post accordingly.

      I believe that wgsim is limited compared to maq simulator and I plan to add a section along wgsim to simulate reads using MAQ. What do you thing about that?
      @pascal_biotech

      Comment


      • #4
        Personally, It's a lot more fun to learn with real data, like off of SRA, but it might be hard to find a data set that's suitable for practice.

        Comment


        • #5
          I confirm your suspicion: it is not that easy to find a data sets suitable for practicing. Simulated data sets is a good complementary source of data.

          Of course you have data sets from 1000g project and co. but I also like very much the flexibility of simulated data sets (eg. indels are well known, you can have as many SVs as you want etc.).
          @pascal_biotech

          Comment


          • #6
            Originally posted by swbarnes2 View Post
            Some of those commands are not as useful for real life problems, so maybe you want to add alternate commands, for once you really start working.

            For starters, you may want to note that the default bwa index won't work on a genome > 2 Gb. You need to add "-a bwtsw" to the command. So I think chr 20 will index alright, but the whole genome will not. And in real life, most people have access to computers with a few processors, so they'd add a "-t 2" or whatever to the bwa aln command.

            In real life, .sam files are just too big. If you pipe the command that makes the .sam file straight into samtools view, you won't make a .sam file, just the much smaller .bam file. You can use samtools view to convert part of the .bam back into a .sam if you want to look at the raw alignments. And I'd add -h to the samtools view command, to make sure the headers are carried on. You need them for some commands to work.

            So the bwa/samtools command woud be:
            One comment I'll add is that if you intend to sort the bam file with samtools sort after generating it, you can add the -u option to form an uncompressed bam file. After sorting with samtools it will be compressed. This will cut down on the time of generating the bam file.

            Comment


            • #7
              Thanks Heisman for your comment.

              If I have understood correctly you propose to generate the .bam file in an uncompressed format:

              bwa sampe -f out.sam -r '@RG\tID:foo\tSM:bar' human_g1k_v37_chr20.fasta out1.sai out2.sai out.read1b.fq out.read2b.fq | samtools view -hbS -u - > out.bam

              because the command

              samtools sort out.bam out_sorted

              will compress it anyway? So it will save time on computing the sort?
              @pascal_biotech

              Comment


              • #8
                Yes, correct. Very simple to test too if you have a stopwatch.

                Comment


                • #9
                  Thank you Sir! I added your comment into the post.
                  @pascal_biotech

                  Comment


                  • #10
                    Can dindel be run on Exome data?

                    Im trying to use dindel on exome human data but am getting about 100 indels/exome. I would like to know if there is something wrong with my pipeline (setting parameter values incorrectly; using re-calibrated, de-duped bams as opposed to using the raw bams, etc..) or if dindel cannot be used on exome data at all.

                    Any information is helpful.

                    Comment


                    • #11
                      Is there a reason you leave out the realignment step (RealignerTargetCreator and IndelRealigner) when using UnifiedGenotyper? The step is recommended in Broad's "best practices" (http://www.broadinstitute.org/gsa/wi...th_the_GATK_v3), and Dindel does the equivalent using the result of getCIGARindels.

                      Comment


                      • #12
                        I have done the realignment step too. The SNPs using GATK pipeline look ok but indels do not. I have the intermediate files and am wondering what indel caller to use?

                        Comment


                        • #13
                          @Alex Renwick: I am not sure whether it helps to do realignment around known indels here (I guess you refer to the steps described in http://www.broadinstitute.org/gsa/wi..._around_indels) as this pipeline starts from simulating reads and simulating indels. My understanding is that indels are scattered randomly so there are probably not "known" sites and not referenced in dbSNP or 1000genomes databases.

                          Please let me know if I misunderstood it and if local realignment is necessary.
                          @pascal_biotech

                          Comment


                          • #14
                            Originally posted by pmaugeri View Post
                            @Alex Renwick: I am not sure whether it helps to do realignment around known indels here ... as this pipeline starts from simulating reads and simulating indels...
                            The Broad documentation does seem to focus on the issue of handling known sites, but you can use the realigner to deal with novel indels.

                            RealignerTargetCreator scans the bam file for likely indel regions (probably looking at the CIGAR string) and then IndelRealigner performs a local alignment in those areas. The local alignment takes all reads into account at once, which requires more computation and finds a better result than the usual practice of instead of handling each read independently.

                            You can use wgsim_eval.pl to compare the correctness of the bam file before and after realignment, and you'll see read placement improve.

                            Comment


                            • #15
                              @Alex,

                              I just ended another round of indels calling against several sets of simulated data. In brief I simulated sets of reads for chromosome 20 at four different coverages (5x, 11x, 22x, 44x).

                              Then I run GATK UnifiedGenotyper with or without local realignment. I found no difference between the number of indels calls with or without realignment?!

                              So is it worth doing local realignment steps prior to call for indels ON simulated data?

                              Or maybe I just did something wrong in the realignement steps?!

                              Here are my commands:

                              java -jar GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T RealignerTargetCreator -o output.intervals --known 1000G_biallelic.indels.b37.vcf

                              java -jar GenomeAnalysisTK.jar -I out_sorted.bam -R human_g1k_v37_chr20.fasta -T IndelRealigner -targetIntervals output.intervals -o realigned_out_sorted.bam -known 1000G_biallelic.indels.b37.vcf --consensusDeterminationModel KNOWNS_ONLY -LOD 0.4

                              java -jar GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T UnifiedGenotyper -I realigned_out_sorted.bam -o gatk_indels.vcf -glm INDEL

                              Regards.
                              @pascal_biotech

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Understanding Genetic Influence on Infectious Disease
                                by seqadmin




                                During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                                Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                                09-09-2024, 10:59 AM
                              • seqadmin
                                Addressing Off-Target Effects in CRISPR Technologies
                                by seqadmin






                                The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                                08-27-2024, 04:44 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 09-11-2024, 02:44 PM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-06-2024, 08:02 AM
                              0 responses
                              146 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-03-2024, 08:30 AM
                              0 responses
                              153 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 08-27-2024, 04:40 AM
                              0 responses
                              163 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X