Unconfigured Ad

**tahamasoodi** · 02-18-2013, 01:09 AM

There seems to be no response for my post. Is there anybody who can help?

**Zam** · 02-18-2013, 01:33 AM

Depends on your input data (what species, do you have data from multiple samples).
If these things are genuine SVs segregating in the population, then you should see them in many samples, and the two alleles should behave roughly the way you might expect (some people are hom-ref, some are hom-alt, and some are het).
This approach has been put into action in Genome Strip (I forget which letters they capitalise), and in the Cortex population/segregation filter, and results in both of these methods having low FDR. So if you have data on many samples, or can go and genotype many samples (even 10 or 20 would do), look at the allele-balance. Excess heterozygosity is a signal of artefacts caused by mismapping/missing repeats in the reference genome,

**tahamasoodi** · 02-18-2013, 02:05 AM

Thanks Zam,
I'm working on humans with around 60 cancer samples. It is very difficult to find the common SVs in all the samples as already mentioned that I'm getting more than 10000 variants for one sample. How can I find excessive heterozygosity?

**tahamasoodi** · 02-19-2013, 11:45 PM

Will anybody give more details regarding this?

**Zam** · 02-19-2013, 11:56 PM

Hi there. I didn't realise you were talking about cancer samples.
1. the thing that bothers you seems to me not to be the most difficult problem you face. 10,000 variants per sample does not seem like a big deal to me, especially if you have 60 samples and you say you are looking for something shared by them all. Just look to see which of the 600,000 are in them all. Have you genotyped all your samples at all these called sites?
2. the fact that you are essentially sampling a pool from a population of cells which presumably have different genomes makes the problem much harder. Do you expect to have both normal and multiple tumour genomes in there?

By excess heterozygosity,I meant, take one of the specific variants and look to see how many of your samples have both alleles of that variant. But anyway, the test I proposed was really applicable for germline variants in a population of humans, I wasn't thinking of cancer. To be honest I think there are people reading this better qualified than me to help.

Good luck!

**tahamasoodi** · 02-20-2013, 12:11 AM

Thanks,
The problem I'm facing is that when I check individual SVs from the aligned BAM file using IGV, I could see that most of the SVs are false positives. That means I have to check all the 60,000 variants one by one, it will take a long long time. Is there any way to ignore the false positives? I have both normal and cancer samples (60 pairs).

Thanks

**Zam** · 02-20-2013, 12:14 AM

Is that 10000 variants in tumour but not normal per sample then?

**tahamasoodi** · 02-20-2013, 12:43 AM

Because BreakDancer and GASV gives only breakpoints for each SV and these breakpoints mostly cannot match (around 95%) between the cancer and normal because a slight difference in breakpoints (say 1-50 basepairs) means that the variants is same but how can we identify that?

**Bukowski** · 02-20-2013, 01:47 AM

Originally posted by tahamasoodi View Post

Because BreakDancer and GASV gives only breakpoints for each SV and these breakpoints mostly cannot match (around 95%) between the cancer and normal because a slight difference in breakpoints (say 1-50 basepairs) means that the variants is same but how can we identify that?

Convert them to bed files and use intersectBed from BedTools to identify regions with any overlap?

**tahamasoodi** · 02-24-2013, 11:08 PM

Are there any more suggestions?

**LiLin** · 02-25-2013, 02:11 AM

1,Breakdancer[PE]+tigra_sv[local assembling]+cross_match[alignment] may filter out some false positives.
2,There is another software CREST[split-reads], you may get positive breakpoints for cancer research[somatic SV breakpoints]. Also you should use pair-end reads to ensure the results. However, the SV type from CREST is not exactly some time. CREST also can detect sv breakpoints for one sample, but I think you want to get somatic SVs.

**cwhelan** · 02-25-2013, 09:09 AM

My workflow usually goes like this:

1) convert calls to bedpe format (see the docs for BEDtools for examples)
2) use bedtools pairToPair to subtract germline SVs from the breakpoints from the cancer sample
3) Filter the remaining somatic SV candidates using BEDtools by removing any in which:

a) one end overlaps a simple or low complexity repeat
b) one end is in a segmental duplication
c) both ends match (with some slop) a breakpoint categorized in the normal population by the 1000genomes project (or your favorite set of previously validated SVs)

Then rank the remaining set by score/number of supporting read pairs and use a cutoff that gets them down to a reasonable number.

Finally try a local assembly method (TIGRA, etc) or a method that can refine calls using split read mappings (DELLY, etc) to validate in silico. Sometimes I will also sometimes use a more sensitive aligner (like MEGABLAST) to find alternative concordant mappings for the supporting read pairs for each candidate SV.

**syfo** · 02-28-2013, 02:18 AM

Originally posted by Bukowski View Post

Convert them to bed files and use intersectBed from BedTools to identify regions with any overlap?

Yes, I would also use something like this to rank the SVs by number of supporting samples and use that criteria in cwhelan's workflow above (after step c).

**wdemos** · 03-07-2014, 12:16 PM

I would like to use bedtools to complete step a as noted in swhelan's post above. In bedtools manual the recommeded step is:
6.2.3 Retain only paired-end BAM alignments where neither end overlaps simple
sequence repeats.
$ pairToBed -abam reads.bam -b SSRs.bed -type neither > reads.noSSRs.bam

I have 'somatic' SVs in bedpe format from step 2. Can this be done with a bedpe format or do i need to get that information into bam format? Thanks.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Filtering out false positive structural variants

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News