Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • mrawlins
    Member
    • Apr 2010
    • 63

    Visualization Tools for Large Datasets

    We have a whole transcriptome dataset from a SOLiD sequencer that is a 100GB .bam file. In some places there is a read depth of greater than 1x10^7 reads. We have not been able to find a tool able to visualize this amount of data. IGV, MagicViewer, Tablet and Artemis have all died when looking at those portions of the genome (which for this experiment contain our genes of interest). Our visualization testing was done allowing up to 12GB of RAM, though we could probably get that up close to 20GB for a few tests.

    Is there a tool that can visualize this sort of data directly? If so, what kind of memory requirements would it have for this much data?
    Are there tools to pre-process or distill the data down to a visualizable summary?
  • nilshomer
    Nils Homer
    • Nov 2008
    • 1283

    #2
    Originally posted by mrawlins View Post
    We have a whole transcriptome dataset from a SOLiD sequencer that is a 100GB .bam file. In some places there is a read depth of greater than 1x10^7 reads. We have not been able to find a tool able to visualize this amount of data. IGV, MagicViewer, Tablet and Artemis have all died when looking at those portions of the genome (which for this experiment contain our genes of interest). Our visualization testing was done allowing up to 12GB of RAM, though we could probably get that up close to 20GB for a few tests.

    Is there a tool that can visualize this sort of data directly? If so, what kind of memory requirements would it have for this much data?
    Are there tools to pre-process or distill the data down to a visualizable summary?
    I am unaware of a tool to handle 10 million coverage. How about doing some data reduction, by removing all reads that start at the same position (and/or have the same sequence)? You can do the former with Picard's MarkDuplicates. Then you at least are able visualize some of the data.

    Comment

    • jkbonfield
      Senior Member
      • Jul 2008
      • 146

      #3
      The problem is that a lot of the layout algorithms really slow down on deep data. I'm intrigued to know how well my own code works on this so I'll experiment some, but I suspect with that much depth you're basically going to really struggle with all tools.

      One solution is just random sampling of the deep regions so you can get a representative set. More optimal may be duplicate removal as suggested, but this may take a long time to run.

      James

      Comment

      • jkbonfield
        Senior Member
        • Jul 2008
        • 146

        #4
        So I did a test using gap5 with a short repeated section of a genome, artificially made by replicating the same sequences so it compresses overly well and isn't the optimal test.

        It was 94bp long, with 5.8 million sequences (mostly 36bp) and a peak depth of around 4million.

        To open up the assembly (note NOT bam format but gap5's own) and view the "template display" showing all 5.8 million reads in a LookSeq style plot took 5 seconds and a shade under 1Gb of memory. I'm guessing LookSeq itself would be similarly fast if you convert the bam file to LookSeq's own sqlite format instead. Note that the speed of this plot is very much proportional to the number of objects visible and not their depth. So 10 million deep for 50bp is fine, as long as it's not 10 million deep for a 10kb region. How many sequences do you think would be visible on your plots?

        The template display "stacking" mode, where sequences get displayed with a Y coordinate attempting to prevent them all overlapping, takes longer as it has to run a layout algorithm to allocate Y values; but not unusably so - an extra 20seconds to display.

        To display the contig editor however was tragically sluggish, taking over 2 mins to just come up and rather annoyingly 40ish seconds to highlight what's under the mouse or 1min to scroll a base to the right. Not particularly usable at that level.

        The reason there's a difference between graphically drawing the sequences and actually displaying the alignments is due to how gap5 stores the data. The location, orientation, read-pairing and a few flags for sequences get stored together in the recursing binning system. The actual sequences, quality values and read-names are elsewhere - in the main sequence structures. Typically around only 5% of the database is consumed by the positional binning arrays, although this may differ for extreme depth cases - I haven't checked.

        In contrast, when viewing a bam file you'd need to load the entire data as it's all mingled together. This is why I think both Gap5 and LookSeq are worth investigation. (I believe LookSeq, if not running in bam mode, also only stores location information and not the seqs/qual).

        James

        Comment

        • imilne
          Member
          • Jan 2010
          • 68

          #5
          I've just run a quick test with Tablet - again with simulated data (I made it load the same set of reads over and over again until it had 10 million of them), and although it took a minute or so to load them from a BAM file, packing only took 5 or 6s, and it was fine during display too.

          Without access to the actual data though, it'll be hard for any of us to genuinely replicate the problem and try to fix it for you.

          Iain
          Our software: Tablet | Flapjack | Strudel | CurlyWhirly | TOPALi

          Comment

          Latest Articles

          Collapse

          • SEQadmin2
            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by SEQadmin2


            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


            Here are nine questions we think about, in roughly the order they matter, before...
            Today, 07:11 AM
          • SEQadmin2
            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
            by SEQadmin2


            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
            ...
            06-02-2026, 10:05 AM
          • SEQadmin2
            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
            by SEQadmin2


            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


            Introduction

            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
            05-22-2026, 06:42 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, Yesterday, 06:09 AM
          0 responses
          16 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-09-2026, 11:58 AM
          0 responses
          36 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-05-2026, 10:09 AM
          0 responses
          42 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-04-2026, 08:59 AM
          0 responses
          49 views
          0 reactions
          Last Post SEQadmin2  
          Working...