Is GATK really suited for cancer genomes? Are their options to set the Unified Genotyper to call alleles in a population (of heterogeneous cancer cells) rather than an individual that would have up to 2 alleles?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
The Broad have written some SNP calling software (syzygy) for pooled heterogeneous samples:
They're using it to look for SNPs where many individuals are in the same library and for targeting a smaller set of genes rather than doing whole exomes. So this may be better for cancer genomes than GATK, but I have no experience there.
Chris
Comment
-
Comment
-
Thanks. I am looking forward to trying MuTect when it is available. And here is a possibly silly question: Should there be a different approach to genotyping heterogenous tumors versus pooled populations? In a pooled population you typically know how many genomes are included. Cells in tumors should likely be more related and probably differ in handful of mutations that helped facilitate tumorigenesis and in genomically unstable regions. Any thoughts?
Comment
-
Hi I am also working for SNP differences between 2 different cell lines which is got from 3 disease and 3 normal individual. First I did bowtie for alignment. Then I used samtools mpileup for comparing multiple bam files.
my question how can I change my parameters for mpileup command line to get more quality SNP between these 12 files.
default is mpileup -uf
Do you suggest another parameters? How can I find good paper for this?
mpileup -6 -uDSf ?
Could you explain? I really appreciate any help.
Thanks
Comment
-
from samtools mpileup:
-6 assume the quality is in the Illumina-1.3+ encoding
Depends on which quality values are in the BAM file:
-D output per-sample DP in BCF (require -g/-u)
This is an output option - depth per sample if you give samtools multiple samples to call SNPs. Usually depth is total depth over all samples.
-S output per-sample strand bias P-value in BCF (require -g/-u)
This is also an output option - strand bias per sample and not just over all samples.
So no change to the SNP calling algorithm unless you call SNP's using illumina phred scores rather than sanger phred scores - this will probably make a big difference if wrong.
Read the post of user ulz_peter where he suggests this link if you want to optimise your SNP calling parameters and the papers suggested by the user Simon Anders in this thread.
Chris
Comment
-
Hi Chris,
Thanks so much for your comments. I will read them. I hope I will fıgure out soon.
For example, I would like to see whether individual and different tissue differences or not between samples and would like to get table DP4 values for every sample to compare each other.
So first after bowtie alignment,
mpileup -Euf ref.fa sample1.bam sample2.bam sample3.bam and goes on
view -bcvg
for filtering -D 100
And to get DP4 values specifically, I ran mpileup for each sample alone wıth same parameters like
mpileup -Euf ref.fa sample1.bam
So do you recommend any other parameters to get good DP4 values ?
Am I missing any point according to my parameters? Should I also get AF1 to see SNP differences between sample?
Should I add removing indel option by putting -I to command line?
I really appreciate any help. Thanks Chris
Aslihan
Comment
-
Hello Dear Chris
We received our exome data and now i have 2 files (snps and indels) in text format.
I copy and paste a part of that in below. Please let me know what is next stage for data analysis and what shall I do ??!!! Can i use annovar for next stage?? its header is not suitable for annovar ?!
#$ COLUMNS seq_name pos bcalls_used bcalls_filt ref Q(snp) max_gt Q(max_gt) max_gt|poly_site Q(max_gt|poly_site) A_used C_used G_used T_used
chr1 12783 2 0 G 24 AA 5 AA 5 2 0 0 0
chr1 13057 3 1 G 3 GG 4 CG 31 0 1 2 0
chr1 13351 1 0 T 1 TT 10 GT 3 0 0 1 0
chr1 14673 2 0 G 32 CC 5 CC 5 0 2 0 0
Best
Comment
-
@aslihan
To get better data, I'd recommend first to use BWA or Bowtie2 rather than the original Bowtie. See this post for the latest info about these different alignment programs:
Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
In terms of DP4 - these are pretty much set based on the alignments you get from bowtie/bwa, although you can filter them to make sure you get reads that have good alignments (-q flag - see below) only and also for bases that are good quality (-Q flag below)
-q INT skip alignments with mapQ smaller than INT [0]
-Q INT skip bases with baseQ/BAQ smaller than INT [13]
In GATK, they set -Q to be 17 by default.
The definition of AF1 is this:
AF1 EM [expectation maximum] estimate of the site allele frequency of the strongest non-reference allele.
there is a section about this on the mpileup page:
"the procedure to estimate AFS is:
bcftools view -NIbl cond.txt data.bcf > cond.bcf
bcftools view -cGP cond2 cond.bcf > round1.vcf 2> round1.afs
bcftools view -cGP round1.afs cond.bcf > /dev/null 2> round2.afs
bcftools view -cGP round2.afs cond.bcf > /dev/null 2> round3.afs
......
until the AFS converges, which usually takes less than 10 rounds of EM iterations. The first command line above extracts sites in cond.txt for efficiency in later steps. Option -P specifies the initial AFS (in SNP calling, this is prior), which can be a file (as in the 3rd and 4th command lines) or 'full', 'cond2' or 'flat' (as in the 2nd command line). Choosing the right initial AFS helps accuracy and reduces iterations and potential overfitting"
For the -I option, this is only relevant if you are interested in SNP's only. Sometimes indels can be relevant in exome data, so it's probably worth not setting -I.
Chris
Comment
-
Originally posted by afaghalavi View PostHello Dear Chris
We received our exome data and now i have 2 files (snps and indels) in text format.
I copy and paste a part of that in below. Please let me know what is next stage for data analysis and what shall I do ??!!! Can i use annovar for next stage?? its header is not suitable for annovar ?!
#$ COLUMNS seq_name pos bcalls_used bcalls_filt ref Q(snp) max_gt Q(max_gt) max_gt|poly_site Q(max_gt|poly_site) A_used C_used G_used T_used
chr1 12783 2 0 G 24 AA 5 AA 5 2 0 0 0
chr1 13057 3 1 G 3 GG 4 CG 31 0 1 2 0
chr1 13351 1 0 T 1 TT 10 GT 3 0 0 1 0
chr1 14673 2 0 G 32 CC 5 CC 5 0 2 0 0
Best
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 12783 . G A
chr1 13057 . G C
chr1 13351 . T G
chr1 14673 . G C
e.g., it looks like they are specifying hetorozygous or homozygous SNP's in this way: "AC" or "CC" (where the reference base is A). In VCF, they would say things like ref=A, alt=C, genotype=0/1 for "AC" or genotype=1/1 for "CC". And sometimes maybe the best one is things like ref=G, best allele=GG, but I can't tell from your file format without some more explanation.
Chris
Comment
-
You might also consider the BSNP Bayesian Genotype caller. It's been tested on Illumina, 454, SOLiD and Sanger human alignments ans has some technology specific bias correction. It requires a samtools pileup as input, but is fully Bayesian, considers both alignment and sequence quality ans doesn't bias towards the reference, and was designed for comparing data from differing technologies. If its helpful, have a look at: http://compgen.bscb.cornell.edu/GPhoCS/BSNP/
-- Brad
Comment
-
Originally posted by RDW View Post
Do you know how to annotate the output from MuTect? I have 3800 mutation calls and I am stuck for almost a day..
Comment
-
Originally posted by cjp View PostTwo SNP and indel callers that you can search for in seqAnswers are samtools mpileup:
and GATK:
sections: 5.1, 5.4 (Unified Genotyper) and 5.5.
Chris
My base data is in ab1, but I'm assuming that can be converted to whatever format is needed.
Comment
Latest Articles
Collapse
-
by seqadmin
Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...-
Channel: Articles
03-22-2024, 06:39 AM -
-
by seqadmin
The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.
Avian Conservation
Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...-
Channel: Articles
03-08-2024, 10:41 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 06:37 PM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Yesterday, 06:37 PM
|
||
Started by seqadmin, Yesterday, 06:07 PM
|
0 responses
9 views
0 likes
|
Last Post
by seqadmin
Yesterday, 06:07 PM
|
||
Started by seqadmin, 03-22-2024, 10:03 AM
|
0 responses
49 views
0 likes
|
Last Post
by seqadmin
03-22-2024, 10:03 AM
|
||
Started by seqadmin, 03-21-2024, 07:32 AM
|
0 responses
67 views
0 likes
|
Last Post
by seqadmin
03-21-2024, 07:32 AM
|
Comment