Seqanswers Leaderboard Ad

**adaptivegenome** · 10-31-2011, 06:04 PM

Is GATK really suited for cancer genomes? Are their options to set the Unified Genotyper to call alleles in a population (of heterogeneous cancer cells) rather than an individual that would have up to 2 alleles?

**cjp** · 11-01-2011, 01:50 AM

The Broad have written some SNP calling software (syzygy) for pooled heterogeneous samples:

http://www.broadinstitute.org/software/syzygy/

They're using it to look for SNPs where many individuals are in the same library and for targeting a smaller set of genes rather than doing whole exomes. So this may be better for cancer genomes than GATK, but I have no experience there.

Chris

**adaptivegenome** · 11-01-2011, 11:01 AM

Thanks Chris. This looks interesting, I just need to figure how it can plug into our existing GATK-based pipeline.

**RDW** · 11-01-2011, 01:41 PM

If you have matched somatic tumour and germline data, see also:

http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit#Cancer-specific_Variant_Discovery_Tools

Note, however, that the MuTect somatic SNP caller is still in restricted beta, and the somatic indel detector is based on the old GATK Indel Genotyper v2 (superceded for all other purposes by the Unified Genotyper's indel mode).

**adaptivegenome** · 11-01-2011, 02:05 PM

Thanks. I am looking forward to trying MuTect when it is available. And here is a possibly silly question: Should there be a different approach to genotyping heterogenous tumors versus pooled populations? In a pooled population you typically know how many genomes are included. Cells in tumors should likely be more related and probably differ in handful of mutations that helped facilitate tumorigenesis and in genomically unstable regions. Any thoughts?

**aslihan** · 11-02-2011, 09:31 AM

Hi I am also working for SNP differences between 2 different cell lines which is got from 3 disease and 3 normal individual. First I did bowtie for alignment. Then I used samtools mpileup for comparing multiple bam files.

my question how can I change my parameters for mpileup command line to get more quality SNP between these 12 files.

default is mpileup -uf

Do you suggest another parameters? How can I find good paper for this?

mpileup -6 -uDSf ?

Could you explain? I really appreciate any help.

Thanks

**cjp** · 11-02-2011, 11:52 AM

from samtools mpileup:

-6 assume the quality is in the Illumina-1.3+ encoding

Depends on which quality values are in the BAM file:

FASTQ format - Wikipedia

http://en.wikipedia.org/wiki/FASTQ_format#Quality

-D output per-sample DP in BCF (require -g/-u)

This is an output option - depth per sample if you give samtools multiple samples to call SNPs. Usually depth is total depth over all samples.

-S output per-sample strand bias P-value in BCF (require -g/-u)

This is also an output option - strand bias per sample and not just over all samples.

So no change to the SNP calling algorithm unless you call SNP's using illumina phred scores rather than sanger phred scores - this will probably make a big difference if wrong.

Read the post of user ulz_peter where he suggests this link if you want to optimise your SNP calling parameters and the papers suggested by the user Simon Anders in this thread.

SEQanswers

http://seqanswers.com/wiki/How-to/exome_analysis

Chris

**aslihan** · 11-03-2011, 09:18 PM

Hi Chris,

Thanks so much for your comments. I will read them. I hope I will fıgure out soon.

For example, I would like to see whether individual and different tissue differences or not between samples and would like to get table DP4 values for every sample to compare each other.

So first after bowtie alignment,

mpileup -Euf ref.fa sample1.bam sample2.bam sample3.bam and goes on
view -bcvg
for filtering -D 100

And to get DP4 values specifically, I ran mpileup for each sample alone wıth same parameters like
mpileup -Euf ref.fa sample1.bam

So do you recommend any other parameters to get good DP4 values ?

Am I missing any point according to my parameters? Should I also get AF1 to see SNP differences between sample?

Should I add removing indel option by putting -I to command line?

I really appreciate any help. Thanks Chris

Aslihan

**afaghalavi** · 11-08-2011, 10:24 AM

Hello Dear Chris

We received our exome data and now i have 2 files (snps and indels) in text format.
I copy and paste a part of that in below. Please let me know what is next stage for data analysis and what shall I do ??!!! Can i use annovar for next stage?? its header is not suitable for annovar ?!

#$ COLUMNS seq_name pos bcalls_used bcalls_filt ref Q(snp) max_gt Q(max_gt) max_gt|poly_site Q(max_gt|poly_site) A_used C_used G_used T_used
chr1 12783 2 0 G 24 AA 5 AA 5 2 0 0 0
chr1 13057 3 1 G 3 GG 4 CG 31 0 1 2 0
chr1 13351 1 0 T 1 TT 10 GT 3 0 0 1 0
chr1 14673 2 0 G 32 CC 5 CC 5 0 2 0 0

Best

**cjp** · 11-08-2011, 12:41 PM

@aslihan

To get better data, I'd recommend first to use BWA or Bowtie2 rather than the original Bowtie. See this post for the latest info about these different alignment programs:

Bowtie 2 versus BWA - SEQanswers

http://seqanswers.com/forums/showthread.php?t=15200

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

In terms of DP4 - these are pretty much set based on the alignments you get from bowtie/bwa, although you can filter them to make sure you get reads that have good alignments (-q flag - see below) only and also for bases that are good quality (-Q flag below)

-q INT skip alignments with mapQ smaller than INT [0]
-Q INT skip bases with baseQ/BAQ smaller than INT [13]

In GATK, they set -Q to be 17 by default.

The definition of AF1 is this:

AF1 EM [expectation maximum] estimate of the site allele frequency of the strongest non-reference allele.

there is a section about this on the mpileup page:

Multisample SNP Calling

http://samtools.sourceforge.net/mpileup.shtml

"the procedure to estimate AFS is:
bcftools view -NIbl cond.txt data.bcf > cond.bcf
bcftools view -cGP cond2 cond.bcf > round1.vcf 2> round1.afs
bcftools view -cGP round1.afs cond.bcf > /dev/null 2> round2.afs
bcftools view -cGP round2.afs cond.bcf > /dev/null 2> round3.afs
......
until the AFS converges, which usually takes less than 10 rounds of EM iterations. The first command line above extracts sites in cond.txt for efficiency in later steps. Option -P specifies the initial AFS (in SNP calling, this is prior), which can be a file (as in the 3rd and 4th command lines) or 'full', 'cond2' or 'flat' (as in the 2nd command line). Choosing the right initial AFS helps accuracy and reduces iterations and potential overfitting"

For the -I option, this is only relevant if you are interested in SNP's only. Sometimes indels can be relevant in exome data, so it's probably worth not setting -I.

Chris

**cjp** · 11-08-2011, 01:00 PM

Originally posted by afaghalavi View Post

Hello Dear Chris

We received our exome data and now i have 2 files (snps and indels) in text format.
I copy and paste a part of that in below. Please let me know what is next stage for data analysis and what shall I do ??!!! Can i use annovar for next stage?? its header is not suitable for annovar ?!

#$ COLUMNS seq_name pos bcalls_used bcalls_filt ref Q(snp) max_gt Q(max_gt) max_gt|poly_site Q(max_gt|poly_site) A_used C_used G_used T_used
chr1 12783 2 0 G 24 AA 5 AA 5 2 0 0 0
chr1 13057 3 1 G 3 GG 4 CG 31 0 1 2 0
chr1 13351 1 0 T 1 TT 10 GT 3 0 0 1 0
chr1 14673 2 0 G 32 CC 5 CC 5 0 2 0 0

Best

Do you know what software made this data? I think annovar can start from VCF files - so some of your data in that format could be converted to something like this in VCF (some of the columns in your data need to be explained a bit more to go into the VCF format though):

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 12783 . G A
chr1 13057 . G C
chr1 13351 . T G
chr1 14673 . G C

e.g., it looks like they are specifying hetorozygous or homozygous SNP's in this way: "AC" or "CC" (where the reference base is A). In VCF, they would say things like ref=A, alt=C, genotype=0/1 for "AC" or genotype=1/1 for "CC". And sometimes maybe the best one is things like ref=G, best allele=GG, but I can't tell from your file format without some more explanation.

Chris

**bgulko** · 05-18-2012, 01:16 PM

You might also consider the BSNP Bayesian Genotype caller. It's been tested on Illumina, 454, SOLiD and Sanger human alignments ans has some technology specific bias correction. It requires a samtools pileup as input, but is fully Bayesian, considers both alignment and sequence quality ans doesn't bias towards the reference, and was designed for comparing data from differing technologies. If its helpful, have a look at: http://compgen.bscb.cornell.edu/GPhoCS/BSNP/
-- Brad

**shyam_la** · 06-12-2012, 09:11 AM

Originally posted by RDW View Post

If you have matched somatic tumour and germline data, see also:

http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit#Cancer-specific_Variant_Discovery_Tools

Note, however, that the MuTect somatic SNP caller is still in restricted beta, and the somatic indel detector is based on the old GATK Indel Genotyper v2 (superceded for all other purposes by the Unified Genotyper's indel mode).

Hi,

Do you know how to annotate the output from MuTect? I have 3800 mutation calls and I am stuck for almost a day..

**pag** · 06-29-2012, 06:20 AM

Originally posted by cjp View Post

Two SNP and indel callers that you can search for in seqAnswers are samtools mpileup:

Multisample SNP Calling

http://samtools.sourceforge.net/mpileup.shtml

and GATK:

http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit

sections: 5.1, 5.4 (Unified Genotyper) and 5.5.

Chris

does mpileup detect SNPs, indels and the like via finding regions of high homology to each other during alignments or does it look at the chromatographic data and detect peak-under-peak and offset peaks to come up with alternate calls for regions? Or something else? If it doesn't do peak-under-peak and offsets, is there a tool out there that DOES?

My base data is in ab1, but I'm assuming that can be converted to whatever format is needed.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News