Seqanswers Leaderboard Ad

**csoong** · 02-09-2011, 09:51 AM

Either you are very lucky to find a cancer (or at least cell line) specific mutation or perhaps due to the informatics. I'd Sanger sequence the mutation and see if it's indeed what you think it is. Another thing to consider would be pseudogenes and duplicated genomic regions. Both of which may have slight variation with the gene of interest but still pass informatic filters.

**Richard Finney** · 02-09-2011, 11:34 AM

There is something going here. This does appear to be "common" in the many bamfiles I've looked at.

Are you seeing something like this in this region (image is centered on your "snp")? :

The SNP in question is in the middle of the di-sulfide bridge bond motif.
Notes for Summary of A1BG (from refseq)
The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins.

**fpepin** · 02-09-2011, 11:59 AM

csoong: this seems improbable enough that I'm betting against it being a cancer gene (hence my posting here). It would be an incredible case of beginner's luck to find something that prevalent that no one has find before. As I said, this is one example of over a hundred. Resequencing seems like a decent suggestion. I'll check for pseudogenes or duplicated regions and post back.

Richard: It's more focused in my case, very little changes in the other bases. This sample 9/40 reads that have the mutation here. I'm glad to see that I'm not the only one seeing this.

Attached Files

igv_snapshot.jpg (38.1 KB, 83 views)

**Richard Finney** · 02-09-2011, 01:57 PM

I don't think it's real. The noise I see is too consistent.

1) Is this Illumina ?
2) Is the SNP call all from one direction of aligned reads (all plus or all minus)?
3) Is the SNP being call only have evidence for it in reads with a sudden drop off in quality right near the called SNP?

**husamia** · 02-09-2011, 02:13 PM

fpepin: What is it? I think its a false positive due to misalignment. How to filter it? depends on first question. I think its a false positive because I also found it in an exom sample with similar ~45 depth, ELAND alignment. 30% of reads the variant was at ends. 10% there was other mismatches withing 5bps from this position. I have it annotated differently from your annotation as (c.832T>G het p.T278TP not T109G can you check your annotation?) are you using illumina? what alignment algorithm did you use?

csoong: i have found handful of pseudogenes that cause false positive calls one of which I was able to confirm with sequencing. My experience been two and more variants within close proximity from each other <10bps usually in same read. I have some examples but I don't want to go into details. What platform are you using? what alignment algorithm did you use?

thanks for this post. This has been worrying me and I havn't found a good way to handle it. Is there other than BWT algorithms that have been used for exom alignment. I have used two different algorithms with same results both use BWT methodology mainly because of large reference one of them is ELAND but I like to try another method of alignment.

**fpepin** · 02-09-2011, 03:20 PM

Originally posted by Richard Finney View Post

I don't think it's real. The noise I see is too consistent.

1) Is this Illumina ?
2) Is the SNP call all from one direction of aligned reads (all plus or all minus)?
3) Is the SNP being call only have evidence for it in reads with a sudden drop off in quality right near the called SNP?

1) Yes, it's a GAII.
2) Yes, that is correct, in this case. Checking with some of my other problem genes, this seems to be the case in all but 1 (which I'm willing to believe is actually interesting). What made you think that this might be the case?
3) Generally. There is a trend toward the end of the reads (last 5 bases) and other cases where the quality drops sharply and comes back up but there are also a fair number of cases where the quality remains strong for a while after (10+ bases).

It also seems that some of these reads are mis-aligned, 2 of my problematic genes are on chromosome Y. They account for the totality of the mutations called on chromosome Y on my breast cancer cell lines. These genes also have a lot of mutations close together, leading me to believe that those were mapping errors.

Looking more in depth, it looks like I've got 16 genes with the strange one-strand-only mutations and 193 that are more likely to be due to mapping errors.

Is there an explanation as to what would cause a such a consistent artifact like this only on reads from a given strand over dozens of samples?

--edit--
husamia: The alignment is done using BWA. You are correct, position 109 is wrong, it is indeed 278. Looks like the annotation code that I inherited is wrong.

**Michael.James.Clark** · 02-09-2011, 03:33 PM

Try using BWA or Novoalign to realign this data. Call variants with samtools or GATK.

Edit: Okay. Which variant caller then?

Did not find the same thing in my exome data. Aligned with novoalign (hard clipping on) and used GATK for variant calling.

**Richard Finney** · 02-09-2011, 03:35 PM

I was getting some hints of a mutation there, in other Illumina samples I have.

Using bambino viewer tool, I see this :

ImageShack - Best place for all of your image hosting and image sharing needs

http://img9.imageshack.us/i/bambino.png/

Unlimited space to host images, easy to use image uploader, albums, photo hosting, sharing, dynamic image resizing on web and mobile.

The little red nib at top between the T and Y is a measure of snpness.
All of the snpness is from reads in one direction and they are all on the border of some quality valley. The quality is likely bad the extra two reads in , just not showing up as bad. I asked the bambino author (tool picture shown) and he said :

I see some reads have runs of nucleotide quality 2, which smells like the Illumina pipeline masking issue ...

Illumina FASTQ Quality Scores - Missing Value - SEQanswers

http://seqanswers.com/forums/showthread.php?t=4721

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

(comment from Torst, 04-24-2010, 09:21 PM)

I'm NOT sure if he's right or wrong,but that wall of quality drop off is very strange to look at. Maybe someone else knows what's up.

**fpepin** · 02-09-2011, 04:55 PM

Originally posted by Michael.James.Clark View Post

Try using BWA or Novoalign to realign this data. Call variants with samtools or GATK.

Edit: Okay. Which variant caller then?

Did not find the same thing in my exome data. Aligned with novoalign (hard clipping on) and used GATK for variant calling.

This is also using GATK (java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -pl SOLEXA ...) for the variant calling. Is samtools likely to be different?

Could it be that slightly more stringent threshold for trimming reads might get rid of these artifacts, since they tend to be at the end of the reads?

**Michael.James.Clark** · 02-09-2011, 05:09 PM

Originally posted by fpepin View Post

This is also using GATK (java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -pl SOLEXA ...) for the variant calling. Is samtools likely to be different?

Could it be that slightly more stringent threshold for trimming reads might get rid of these artifacts, since they tend to be at the end of the reads?

I think if you're seeing it with GATK, you'll probably see it with samtools, though you could try it.

Why not try clipping the ends off, realigning, and doing variant calling? Just trim five or ten bases off the end. This is something people often do.

Personally, I use Novoalign's hard clipping and adaptor stripping, and that seems to bump up specificity.

Also, for GATK, take a look at the -baq parameter (see their wiki). It's recommended to use that if you didn't.

**Bruins** · 02-10-2011, 01:26 AM

Were these data sequenced one sample per lane (which I expect), or multiple samples per lane?

Were other, non cancer cell line samples sequenced on the same run as cancer cell line samples? (I mean, in our facility there's more than one research group that uses the sequencer, there may be like 5 whole exomes for one group and 3 for another going on)

If that's the case, does that data show the same artefact?

**fpepin** · 02-10-2011, 10:51 AM

Bruins:This is 2 lanes per sample. I haven't checked the results from the individual lanes in my haste to get results, but I'm running it now.

Also, the runs are basically all cancer-related. I could look for some normals were run with the same protocols, but the artificiality of the having all the mutations in reads on one strand strongly suggest that the problem istechnical in nature.

Michael: I have an old version of GATK in my pipeline (1.0.3185) that doesn't have -baq implemented. It looks like I would need to tweak things to get the newer version to work, as it complains about the ordering of my contigs (lexicographically: 10, 11 , ..., 1 ,20, 21, ..., instead of karyotypically: 1,2,3). It can't be hard to fix, though.

Having not followed the development of GATK, how important is it to be up to date?

**Michael.James.Clark** · 02-11-2011, 11:28 AM

Originally posted by fpepin View Post

Michael: I have an old version of GATK in my pipeline (1.0.3185) that doesn't have -baq implemented. It looks like I would need to tweak things to get the newer version to work, as it complains about the ordering of my contigs (lexicographically: 10, 11 , ..., 1 ,20, 21, ..., instead of karyotypically: 1,2,3). It can't be hard to fix, though.

Having not followed the development of GATK, how important is it to be up to date?

It seems to be fairly important at this stage because of baq implementation and the -glm DINDEL feature of the UnifiedGenotyper (the UG is used to more sensitively detect indels than the IndelGenotyperV2!).

**fpepin** · 02-11-2011, 01:53 PM

Thanks Micheal, I'll update my pipeline then.

I also talked to more people about this issue and the answer is that this is a well-known issue when you have a specific tetramer (CCGG, if I remember correctly) that ends up causing bases to be skipped.

This explains why the differences are strand specific and why it happens in very specific places in the genome (including in completely different labs). The reads generally end up being clipped soon after because the indel, but there are cases that are repetitive enough that the few other bases match by chance, leading for the trimming to occur later and the read to be kept.

I'm not quite sure where would be best place to recognize this corner case and deal with it, but in my case I'll take the lazy way out and just filter them out at the end.

Thanks for all who offered suggestions.

Topics	Statistics	Last Post
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 16 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 20 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM

Seqanswers Leaderboard Ad

Announcement

very common mutation not in SNP databases

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News