Is GATK really suited for cancer genomes? Are their options to set the Unified Genotyper to call alleles in a population (of heterogeneous cancer cells) rather than an individual that would have up to 2 alleles?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
The Broad have written some SNP calling software (syzygy) for pooled heterogeneous samples:
They're using it to look for SNPs where many individuals are in the same library and for targeting a smaller set of genes rather than doing whole exomes. So this may be better for cancer genomes than GATK, but I have no experience there.
Chris
Comment
-
Comment
-
Thanks. I am looking forward to trying MuTect when it is available. And here is a possibly silly question: Should there be a different approach to genotyping heterogenous tumors versus pooled populations? In a pooled population you typically know how many genomes are included. Cells in tumors should likely be more related and probably differ in handful of mutations that helped facilitate tumorigenesis and in genomically unstable regions. Any thoughts?
Comment
-
Hi I am also working for SNP differences between 2 different cell lines which is got from 3 disease and 3 normal individual. First I did bowtie for alignment. Then I used samtools mpileup for comparing multiple bam files.
my question how can I change my parameters for mpileup command line to get more quality SNP between these 12 files.
default is mpileup -uf
Do you suggest another parameters? How can I find good paper for this?
mpileup -6 -uDSf ?
Could you explain? I really appreciate any help.
Thanks
Comment
-
from samtools mpileup:
-6 assume the quality is in the Illumina-1.3+ encoding
Depends on which quality values are in the BAM file:
-D output per-sample DP in BCF (require -g/-u)
This is an output option - depth per sample if you give samtools multiple samples to call SNPs. Usually depth is total depth over all samples.
-S output per-sample strand bias P-value in BCF (require -g/-u)
This is also an output option - strand bias per sample and not just over all samples.
So no change to the SNP calling algorithm unless you call SNP's using illumina phred scores rather than sanger phred scores - this will probably make a big difference if wrong.
Read the post of user ulz_peter where he suggests this link if you want to optimise your SNP calling parameters and the papers suggested by the user Simon Anders in this thread.
Chris
Comment
-
Hi Chris,
Thanks so much for your comments. I will read them. I hope I will fıgure out soon.
For example, I would like to see whether individual and different tissue differences or not between samples and would like to get table DP4 values for every sample to compare each other.
So first after bowtie alignment,
mpileup -Euf ref.fa sample1.bam sample2.bam sample3.bam and goes on
view -bcvg
for filtering -D 100
And to get DP4 values specifically, I ran mpileup for each sample alone wıth same parameters like
mpileup -Euf ref.fa sample1.bam
So do you recommend any other parameters to get good DP4 values ?
Am I missing any point according to my parameters? Should I also get AF1 to see SNP differences between sample?
Should I add removing indel option by putting -I to command line?
I really appreciate any help. Thanks Chris
Aslihan
Comment
-
Hello Dear Chris
We received our exome data and now i have 2 files (snps and indels) in text format.
I copy and paste a part of that in below. Please let me know what is next stage for data analysis and what shall I do ??!!! Can i use annovar for next stage?? its header is not suitable for annovar ?!
#$ COLUMNS seq_name pos bcalls_used bcalls_filt ref Q(snp) max_gt Q(max_gt) max_gt|poly_site Q(max_gt|poly_site) A_used C_used G_used T_used
chr1 12783 2 0 G 24 AA 5 AA 5 2 0 0 0
chr1 13057 3 1 G 3 GG 4 CG 31 0 1 2 0
chr1 13351 1 0 T 1 TT 10 GT 3 0 0 1 0
chr1 14673 2 0 G 32 CC 5 CC 5 0 2 0 0
Best
Comment
-
@aslihan
To get better data, I'd recommend first to use BWA or Bowtie2 rather than the original Bowtie. See this post for the latest info about these different alignment programs:
Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
In terms of DP4 - these are pretty much set based on the alignments you get from bowtie/bwa, although you can filter them to make sure you get reads that have good alignments (-q flag - see below) only and also for bases that are good quality (-Q flag below)
-q INT skip alignments with mapQ smaller than INT [0]
-Q INT skip bases with baseQ/BAQ smaller than INT [13]
In GATK, they set -Q to be 17 by default.
The definition of AF1 is this:
AF1 EM [expectation maximum] estimate of the site allele frequency of the strongest non-reference allele.
there is a section about this on the mpileup page:
"the procedure to estimate AFS is:
bcftools view -NIbl cond.txt data.bcf > cond.bcf
bcftools view -cGP cond2 cond.bcf > round1.vcf 2> round1.afs
bcftools view -cGP round1.afs cond.bcf > /dev/null 2> round2.afs
bcftools view -cGP round2.afs cond.bcf > /dev/null 2> round3.afs
......
until the AFS converges, which usually takes less than 10 rounds of EM iterations. The first command line above extracts sites in cond.txt for efficiency in later steps. Option -P specifies the initial AFS (in SNP calling, this is prior), which can be a file (as in the 3rd and 4th command lines) or 'full', 'cond2' or 'flat' (as in the 2nd command line). Choosing the right initial AFS helps accuracy and reduces iterations and potential overfitting"
For the -I option, this is only relevant if you are interested in SNP's only. Sometimes indels can be relevant in exome data, so it's probably worth not setting -I.
Chris
Comment
-
Originally posted by afaghalavi View PostHello Dear Chris
We received our exome data and now i have 2 files (snps and indels) in text format.
I copy and paste a part of that in below. Please let me know what is next stage for data analysis and what shall I do ??!!! Can i use annovar for next stage?? its header is not suitable for annovar ?!
#$ COLUMNS seq_name pos bcalls_used bcalls_filt ref Q(snp) max_gt Q(max_gt) max_gt|poly_site Q(max_gt|poly_site) A_used C_used G_used T_used
chr1 12783 2 0 G 24 AA 5 AA 5 2 0 0 0
chr1 13057 3 1 G 3 GG 4 CG 31 0 1 2 0
chr1 13351 1 0 T 1 TT 10 GT 3 0 0 1 0
chr1 14673 2 0 G 32 CC 5 CC 5 0 2 0 0
Best
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 12783 . G A
chr1 13057 . G C
chr1 13351 . T G
chr1 14673 . G C
e.g., it looks like they are specifying hetorozygous or homozygous SNP's in this way: "AC" or "CC" (where the reference base is A). In VCF, they would say things like ref=A, alt=C, genotype=0/1 for "AC" or genotype=1/1 for "CC". And sometimes maybe the best one is things like ref=G, best allele=GG, but I can't tell from your file format without some more explanation.
Chris
Comment
-
You might also consider the BSNP Bayesian Genotype caller. It's been tested on Illumina, 454, SOLiD and Sanger human alignments ans has some technology specific bias correction. It requires a samtools pileup as input, but is fully Bayesian, considers both alignment and sequence quality ans doesn't bias towards the reference, and was designed for comparing data from differing technologies. If its helpful, have a look at: http://compgen.bscb.cornell.edu/GPhoCS/BSNP/
-- Brad
Comment
-
Originally posted by RDW View Post
Do you know how to annotate the output from MuTect? I have 3800 mutation calls and I am stuck for almost a day..
Comment
-
Originally posted by cjp View PostTwo SNP and indel callers that you can search for in seqAnswers are samtools mpileup:
and GATK:
sections: 5.1, 5.4 (Unified Genotyper) and 5.5.
Chris
My base data is in ab1, but I'm assuming that can be converted to whatever format is needed.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-25-2024, 11:49 AM
|
0 responses
19 views
0 likes
|
Last Post
by seqadmin
04-25-2024, 11:49 AM
|
||
Started by seqadmin, 04-24-2024, 08:47 AM
|
0 responses
18 views
0 likes
|
Last Post
by seqadmin
04-24-2024, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
62 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
60 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
Comment