Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • tahamasoodi
    Success
    • May 2012
    • 130

    Filtering out false positive structural variants

    Hi,
    I'm using BreakDancer and GASV for SV predictions but their output is a huge number of SVs most of them are false positives (e.g., for one whole genome, I'm getting around 10000 SVs). Is there any way to filter out false positives?

    Thanks
    Thanks,
  • tahamasoodi
    Success
    • May 2012
    • 130

    #2
    There seems to be no response for my post. Is there anybody who can help?
    Thanks,

    Comment

    • Zam
      Member
      • Apr 2010
      • 51

      #3
      Depends on your input data (what species, do you have data from multiple samples).
      If these things are genuine SVs segregating in the population, then you should see them in many samples, and the two alleles should behave roughly the way you might expect (some people are hom-ref, some are hom-alt, and some are het).
      This approach has been put into action in Genome Strip (I forget which letters they capitalise), and in the Cortex population/segregation filter, and results in both of these methods having low FDR. So if you have data on many samples, or can go and genotype many samples (even 10 or 20 would do), look at the allele-balance. Excess heterozygosity is a signal of artefacts caused by mismapping/missing repeats in the reference genome,

      Comment

      • tahamasoodi
        Success
        • May 2012
        • 130

        #4
        Thanks Zam,
        I'm working on humans with around 60 cancer samples. It is very difficult to find the common SVs in all the samples as already mentioned that I'm getting more than 10000 variants for one sample. How can I find excessive heterozygosity?
        Thanks,

        Comment

        • tahamasoodi
          Success
          • May 2012
          • 130

          #5
          Will anybody give more details regarding this?
          Thanks,

          Comment

          • Zam
            Member
            • Apr 2010
            • 51

            #6
            Hi there. I didn't realise you were talking about cancer samples.
            1. the thing that bothers you seems to me not to be the most difficult problem you face. 10,000 variants per sample does not seem like a big deal to me, especially if you have 60 samples and you say you are looking for something shared by them all. Just look to see which of the 600,000 are in them all. Have you genotyped all your samples at all these called sites?
            2. the fact that you are essentially sampling a pool from a population of cells which presumably have different genomes makes the problem much harder. Do you expect to have both normal and multiple tumour genomes in there?

            By excess heterozygosity,I meant, take one of the specific variants and look to see how many of your samples have both alleles of that variant. But anyway, the test I proposed was really applicable for germline variants in a population of humans, I wasn't thinking of cancer. To be honest I think there are people reading this better qualified than me to help.

            Good luck!

            Comment

            • tahamasoodi
              Success
              • May 2012
              • 130

              #7
              Thanks,
              The problem I'm facing is that when I check individual SVs from the aligned BAM file using IGV, I could see that most of the SVs are false positives. That means I have to check all the 60,000 variants one by one, it will take a long long time. Is there any way to ignore the false positives? I have both normal and cancer samples (60 pairs).

              Thanks
              Thanks,

              Comment

              • Zam
                Member
                • Apr 2010
                • 51

                #8
                Is that 10000 variants in tumour but not normal per sample then?

                Comment

                • tahamasoodi
                  Success
                  • May 2012
                  • 130

                  #9
                  Because BreakDancer and GASV gives only breakpoints for each SV and these breakpoints mostly cannot match (around 95%) between the cancer and normal because a slight difference in breakpoints (say 1-50 basepairs) means that the variants is same but how can we identify that?
                  Thanks,

                  Comment

                  • Bukowski
                    Senior Member
                    • Jan 2010
                    • 388

                    #10
                    Originally posted by tahamasoodi View Post
                    Because BreakDancer and GASV gives only breakpoints for each SV and these breakpoints mostly cannot match (around 95%) between the cancer and normal because a slight difference in breakpoints (say 1-50 basepairs) means that the variants is same but how can we identify that?
                    Convert them to bed files and use intersectBed from BedTools to identify regions with any overlap?

                    Comment

                    • tahamasoodi
                      Success
                      • May 2012
                      • 130

                      #11
                      Are there any more suggestions?
                      Thanks,

                      Comment

                      • LiLin
                        Member
                        • May 2011
                        • 15

                        #12
                        1,Breakdancer[PE]+tigra_sv[local assembling]+cross_match[alignment] may filter out some false positives.
                        2,There is another software CREST[split-reads], you may get positive breakpoints for cancer research[somatic SV breakpoints]. Also you should use pair-end reads to ensure the results. However, the SV type from CREST is not exactly some time. CREST also can detect sv breakpoints for one sample, but I think you want to get somatic SVs.

                        Comment

                        • cwhelan
                          Member
                          • Nov 2010
                          • 23

                          #13
                          My workflow usually goes like this:

                          1) convert calls to bedpe format (see the docs for BEDtools for examples)
                          2) use bedtools pairToPair to subtract germline SVs from the breakpoints from the cancer sample
                          3) Filter the remaining somatic SV candidates using BEDtools by removing any in which:

                          a) one end overlaps a simple or low complexity repeat
                          b) one end is in a segmental duplication
                          c) both ends match (with some slop) a breakpoint categorized in the normal population by the 1000genomes project (or your favorite set of previously validated SVs)

                          Then rank the remaining set by score/number of supporting read pairs and use a cutoff that gets them down to a reasonable number.

                          Finally try a local assembly method (TIGRA, etc) or a method that can refine calls using split read mappings (DELLY, etc) to validate in silico. Sometimes I will also sometimes use a more sensitive aligner (like MEGABLAST) to find alternative concordant mappings for the supporting read pairs for each candidate SV.

                          Comment

                          • syfo
                            Just a member
                            • Nov 2012
                            • 103

                            #14
                            Originally posted by Bukowski View Post
                            Convert them to bed files and use intersectBed from BedTools to identify regions with any overlap?
                            Yes, I would also use something like this to rank the SVs by number of supporting samples and use that criteria in cwhelan's workflow above (after step c).

                            Comment

                            • wdemos
                              Member
                              • Jun 2012
                              • 31

                              #15
                              I would like to use bedtools to complete step a as noted in swhelan's post above. In bedtools manual the recommeded step is:
                              6.2.3 Retain only paired-end BAM alignments where neither end overlaps simple
                              sequence repeats.
                              $ pairToBed -abam reads.bam -b SSRs.bed -type neither > reads.noSSRs.bam

                              I have 'somatic' SVs in bedpe format from step 2. Can this be done with a bedpe format or do i need to get that information into bam format? Thanks.

                              Comment

                              Latest Articles

                              Collapse

                              • GATTACAT
                                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by GATTACAT
                                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                                07-01-2026, 11:43 AM
                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 07-02-2026, 11:08 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-30-2026, 05:37 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              54 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...