Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • pipeline for ion torrent data

    Hi,
    I just received one ion torrent data to test (sample.fastq). I wanted to have the community's point of view on what I am doing.

    1/ aligning the reads
    -------------------------
    I use:
    $bwa -aln hg19ref.fa sample.fastq > sample.sai
    $bwa samse -r "@RG\tID:IDa\tSM:sample\tPL:PL" hg19ref.fa sample.sai sample.fastq > sample.sam
    $java -Xmx4g -jar picard/SortSam.jar SO=coordinate INPUT=sample.sam OUTPUT=sample.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

    looking to flagstat, I get:
    428427 in total
    0 QC failure
    0 duplicates
    373491 mapped (87.18%)

    RMK/questions:
    * should I rather use bwa -bwasw ?? or bowtie2??
    * is the platform header information included in the sam used somewhere?

    2/ gatk1.3 guideline: realigning reads, etc.
    -----------------------------------------------

    ooooo Marking PCR duplicates BAM ooooo
    java -Xmx4g -jar picard/MarkDuplicates.jar INPUT=sample.bam OUTPUT=sample.marked.bam METRICS_FILE=metrics VALIDATION_STRINGENCY=LENIENT

    ooooo Indexing MARKED BAM ooooooooo '
    samtools index sample.marked.bam

    ooooo Realignment around indels (1) create Indels Table ooooo
    java -Xmx4g -jar gatk.jar -T RealignerTargetCreator -R hg19ref.fa -filterMBQ -nt 16 --known GATK_1000G_PHASE1_INDELS --known GATK_MILLS1000G_GOLD_INDELS -o indeltablelist -I sample.marked.bam

    ooooo Realignment around indels (2) realigns reads around those targets ooooo
    java -Xmx4g -jar gatk.jar -I sample.marked.bam -R hg19ref.fa -filterMBQ -known GATK_1000G_PHASE1_INDELS -known GATK_MILLS1000G_GOLD_INDELS -T IndelRealigner -targetIntervals indeltablelist -o sample.marked.realigned.bam

    ooooo Fix mate for pair end reads oooooooo '
    java -Xmx4g -jar picard/FixMateInformation.jar INPUT=sample.marked.realigned.bam
    OUTPUT=sample.marked.realigned.fixed.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

    ooooo Quality score recalibration (1) Countcovariates oooooooo
    java -Xmx4g -jar gatk.jar -l INFO -R hg19ref.fa -nt 16 -knownSites:dbsnp,VCF dbsnp_135.b37.vcf -I sample.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile sample.recal_data.csv


    RMK: This was failing with gatk 1.5 (complaining that the platform was unknown - wanted illumina or 454 or solid). It seems it works with gatk1.6

    ooooo Quality score recalibration (2) Tablerecalibration oooooooo '
    java -Xmx4g -jar gatk.jar -l INFO -R hg19ref.fa -I sample.marked.realigned.fixed.bam -T TableRecalibration --out sample.marked.realigned.fixed.recal.bam -recalFile sample.recal_data.csv


    Does it looks ok?
    Then calling the variant using the rest of the pipeline guideline.

    any comments?
    thanks
    tuka

  • #2
    hi aituka,
    i was trying to figure out how to use GATK with torrent data as well. More or less i ended up with something similar to your pipeline. But my customers gave me some info about expected deletions which i could not find using that pipeline. Turned out to be a problem with markduplicate step. After the marking lots of SNP and Indels disappear from the vcf even though one can see the expected deletion in the bam file when using IGV.

    So for now after i add readgroup, sort and index the bam file i directly go to UnifiedGenotyper and VariantFiltration from GATK. So i skip all recalibration and etc. Interestingly the results with and without recalibration,both excluding markduplicates are almost identical.

    Btw, i use 'bwa bwasw' to align my data to human.

    Comment


    • #3
      Hi tuka,
      I do agree with kenietz. The markduplicate trick might remove too much reads. The problem relates to this post: http://seqanswers.com/forums/showthread.php?t=6854 (Removing duplicates is it really necessary?). When sequencing the whole exome, you don't expect too much duplicates. Removing them might be a good option to avoid PCR duplicates. However, in targeted sequencing, it might be natural to have a lot of duplicates. Removing them might not be the good option.

      In an example of ion PGM data, I got:
      429,424 reads, among which I could align (bwa bwasw) 383,091 (89.21%) reads.
      But there was 371,738 duplicates !

      As for Kenietz, results with or without duplicates were identical.

      Comment


      • #4
        @colinmolter:
        exactly the same situation. i have this targeted reseq with coverage of 1000-2000x and around 95% of the reads were marked duplicates.
        Its sad that the documentation on GATK is not enough. But unfortunately that is the situation one can not describe all possibilities and hence sometimes when one follows the general guide one is getting lost. Is good that there are many forums around tho

        Cheers

        Comment


        • #5
          Hi guys,
          what argument did you use for UnifiedGenotyper and VariantFiltration?

          Comment


          • #6
            Originally posted by kenietz View Post
            Hi guys,
            what argument did you use for UnifiedGenotyper and VariantFiltration?
            I used something like this:

            Code:
            java -Xmx4g -jar gatk.jar -T UnifiedGenotyper -R bwa6.1/h_g1k_v37.fasta -L targetintervals.bed -nt 16 -A AlleleBalance -A DepthOfCoverage -stand_call_conf 30.0 -stand_emit_conf 10 -glm BOTH --dbsnp dbsnp_135.b37.vcf -o DS.SNV.all.vcf -metrics DS.SNV.all.vcf.metric -I s1.bam -I s2.bam
            what about yours?
            any comments?
            colin

            Comment


            • #7
              Hi colin,
              i followed a guide for GATK for Illumina which i found on this site. But had to modify it just a tiny bit.

              `java -jar /opt/GATK/GenomeAnalysisTK.jar -R $refgen -T UnifiedGenotyper -I realigned.recal.bam --dbsnp /mnt/hd/GATK_recource_bundle/hg19/dbsnp_135.hg19.vcf.reordered -o gatk_var.raw.vcf --num_threads 8 -L pbrm1_intervals.bed --genotype_likelihoods_model BOTH --metrics_file snp.metrics -stand_call_conf 30 -stand_emit_conf 10 -A DepthOfCoverage -A AlleleBalance`;

              `java -Xmx4g -jar /opt/GATK/GenomeAnalysisTK.jar -R $refgen -T VariantFiltration --variant gatk_var.raw.vcf -o gatk_var_filtered.vcf --clusterWindowSize 10 --filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" --filterName "HARD_TO_VALIDATE" --filterExpression "DP < 5 " --filterName "LowCoverage" --filterExpression "QUAL < 30.0 " --filterName "VeryLowQual" --filterExpression "QUAL > 30.0 && QUAL < 50.0 " --filterName "LowQual" --filterExpression "QD < 1.5 " --filterName "LowQD" --filterExpression "SB > -10.0 " --filterName "StrandBias"`;

              Seems that results from GATK and VariantCaller plugin from torrent suite differ
              Now trying to make the standalone variantcaller to run. Partial success for now tho

              Comment


              • #8
                Has anyone tried TMAP over BWA?

                Comment


                • #9
                  @genericforms

                  I am sorry, but I am not sure if your question is more general or specifically towards this problem.

                  In case it is more general, then yes I have tried both TMAP and BWA on our datasets and it seems like there is not much of a difference if you select the right BWA algorithm (aln-samse / bwasw).. As TMAP selects the appropriate algorithm based on the characteristics (most probably read length).. unless off-course TMAP decides to run SSAHA, which would give different results.

                  Thanks,
                  Praful

                  Comment


                  • #10
                    Originally posted by aggp11 View Post
                    @genericforms

                    I am sorry, but I am not sure if your question is more general or specifically towards this problem.

                    In case it is more general, then yes I have tried both TMAP and BWA on our datasets and it seems like there is not much of a difference if you select the right BWA algorithm (aln-samse / bwasw).. As TMAP selects the appropriate algorithm based on the characteristics (most probably read length).. unless off-course TMAP decides to run SSAHA, which would give different results.

                    Thanks,
                    Praful
                    Yes, I was curious in your experience how BWA-SW matches up to TMAP...

                    Comment


                    • #11
                      Originally posted by aggp11 View Post
                      @genericforms

                      I am sorry, but I am not sure if your question is more general or specifically towards this problem.

                      In case it is more general, then yes I have tried both TMAP and BWA on our datasets and it seems like there is not much of a difference if you select the right BWA algorithm (aln-samse / bwasw).. As TMAP selects the appropriate algorithm based on the characteristics (most probably read length).. unless off-course TMAP decides to run SSAHA, which would give different results.

                      Thanks,
                      Praful

                      Simulate some Ion 100-200bp human data and compare the two software for yourself (included are the simulator, simulation results evaluator, and plotting software
                      . Feel free to create a new post to give feedback

                      Comment


                      • #12
                        We are working on it Nils. In a moment of weakness I got lazy and wanted to know if someone had already looked into this...

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin


                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                          Today, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        37 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        41 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        35 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        54 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X