Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • tez
    Junior Member
    • Jul 2011
    • 4

    Structural variation detection using BreakDancer on Whole Genome SOLiD data

    Hello,

    I have been struggling for the last few weeks to get Breakdancer to run accross some whole genome data. The data was sequenced on SOLiD machines and aligned using Bioscope.

    I have been able to get Breakdancer to build a configuration file using the parameters for SOLiD (the -C color space option), the actual command looks like:

    bam2cfg.pl -n 1000000 -g -h -C normal.bam tumor.bam > breakdancer.cfg

    I am then able to run breakdancer_max using that cofig file as such:

    breakdancer_max breakdancer.cfg -g output.GBrowse -d fast_q_evidence.o

    This command runs.. and runs.. and runs... and finally either runs out of memory or computation time.

    The last run I did ran for 100 hours, using 48GB of memory before the job was cancelled for running too long. The output of this was about 6.7 million "detected" structural variations. And it only just got up to chromosome 3!

    This leads me to believe it would need 1,000 hours or so of computation time to run fully, which is not feasible at the moment (42 days!). At that rate it would also find 67 million SV's, which doesn't quite seem right!

    Is this in line with anyone else's experience?

    The tumor and normal files are 120GB and 180GB each, so I don't expect it to be a fast process, but 40 days seems excessive.

    I have also attempted to run Breakdancer in single chromosome mode, but this fails with a segmentation fault immediately.

    Has anyone been able to get the single chromosome version to work? Or know why it would segfault?


    Thank you.
  • tez
    Junior Member
    • Jul 2011
    • 4

    #2
    I have now also seen that there is a "-r" option for setting the minimum number of read-pairs required to call an SV.

    There isn't much mention of this in the manual, but looking through the source code I see it is set to 2, which would explain the huge number of results, poor run time and memory usage.

    Does anyone have any experience with this parameter? Our data is supposed to be at ~30x depth. I am now giving it a try at min_read_pair=10, and I'll let you know how it goes.

    Cheers

    Comment

    • aquinom85
      Research Bioinformaticist
      • Dec 2011
      • 19

      #3
      How did things turn out by tweaking the results? I'm looking into BreakDancer but also there is no FAQ and it's rather hard to get a clear picture of the limitations of the software. Do you know if BreakDancer jointly calls samples or if you have to run it on each of your samples then cross-validate the results?

      Comment

      • tez
        Junior Member
        • Jul 2011
        • 4

        #4
        Hello,

        The results did not look good at all. Basically it called about 10,000 structural variations in the "normal" sample, and about 1,300 in the "tumour" sample.

        The only way I could get these results was to run break dancer with the -r 10 option, and then to break each whole genome down into chromosomes and run each chromosome separately. Even then it was still a 3-4 day process, running them all in parallel on fairly powerful cluster.

        Looks like the biggest issue is data quality. The alignment / mapping was not done by us, and it looks like it may contain quite a lot of noise. So we are now experimenting with different ways to "clean" up the data.

        Cheers

        Comment

        • P-Richmond
          Member
          • Oct 2010
          • 13

          #5
          Any luck in "cleaning up the data"? I have a similar problem, but I'm working in S. cerevisiae and keep running across artifacts of the alignements I'm using (read pairs that map to familial genes (genes with very high sequence identity on different chromosomes).

          One possible methodology would be to generate reads from a perfect genome, then run through breakdancer and call that the noise model. I have a system in place for this read generation if you are interested in trying that. Then by simply creating an intersect with the calls from your data, you could produce a set that is more likely to be structural variations that aren't simply artifacts of the alignment or the underlying sequence.

          -Phil

          Comment

          • aquinom85
            Research Bioinformaticist
            • Dec 2011
            • 19

            #6
            I just ran breakdancer on 1 human genome sample and got 29,500 SVs called, in my naive opinion this seems outrageously high. I think I'll try raising the -r value higher. Does anyone know what a normal range of SVs are in the human for comparison? Also, how should the confidence score be considered in general?

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              Yesterday, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM
            • SEQadmin2
              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
              by SEQadmin2

              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
              05-06-2026, 09:04 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Yesterday, 12:03 PM
            0 responses
            17 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, Yesterday, 11:40 AM
            0 responses
            13 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 05-28-2026, 11:40 AM
            0 responses
            29 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 05-26-2026, 10:12 AM
            0 responses
            31 views
            0 reactions
            Last Post SEQadmin2  
            Working...