Unconfigured Ad

**nilshomer** · 04-26-2010, 01:10 PM

Originally posted by mrawlins View Post

We have a whole transcriptome dataset from a SOLiD sequencer that is a 100GB .bam file. In some places there is a read depth of greater than 1x10^7 reads. We have not been able to find a tool able to visualize this amount of data. IGV, MagicViewer, Tablet and Artemis have all died when looking at those portions of the genome (which for this experiment contain our genes of interest). Our visualization testing was done allowing up to 12GB of RAM, though we could probably get that up close to 20GB for a few tests.

Is there a tool that can visualize this sort of data directly? If so, what kind of memory requirements would it have for this much data?
Are there tools to pre-process or distill the data down to a visualizable summary?

I am unaware of a tool to handle 10 million coverage. How about doing some data reduction, by removing all reads that start at the same position (and/or have the same sequence)? You can do the former with Picard's MarkDuplicates. Then you at least are able visualize some of the data.

**jkbonfield** · 04-27-2010, 03:05 AM

The problem is that a lot of the layout algorithms really slow down on deep data. I'm intrigued to know how well my own code works on this so I'll experiment some, but I suspect with that much depth you're basically going to really struggle with all tools.

One solution is just random sampling of the deep regions so you can get a representative set. More optimal may be duplicate removal as suggested, but this may take a long time to run.

James

**jkbonfield** · 04-27-2010, 03:45 AM

So I did a test using gap5 with a short repeated section of a genome, artificially made by replicating the same sequences so it compresses overly well and isn't the optimal test.

It was 94bp long, with 5.8 million sequences (mostly 36bp) and a peak depth of around 4million.

To open up the assembly (note NOT bam format but gap5's own) and view the "template display" showing all 5.8 million reads in a LookSeq style plot took 5 seconds and a shade under 1Gb of memory. I'm guessing LookSeq itself would be similarly fast if you convert the bam file to LookSeq's own sqlite format instead. Note that the speed of this plot is very much proportional to the number of objects visible and not their depth. So 10 million deep for 50bp is fine, as long as it's not 10 million deep for a 10kb region. How many sequences do you think would be visible on your plots?

The template display "stacking" mode, where sequences get displayed with a Y coordinate attempting to prevent them all overlapping, takes longer as it has to run a layout algorithm to allocate Y values; but not unusably so - an extra 20seconds to display.

To display the contig editor however was tragically sluggish, taking over 2 mins to just come up and rather annoyingly 40ish seconds to highlight what's under the mouse or 1min to scroll a base to the right. Not particularly usable at that level.

The reason there's a difference between graphically drawing the sequences and actually displaying the alignments is due to how gap5 stores the data. The location, orientation, read-pairing and a few flags for sequences get stored together in the recursing binning system. The actual sequences, quality values and read-names are elsewhere - in the main sequence structures. Typically around only 5% of the database is consumed by the positional binning arrays, although this may differ for extreme depth cases - I haven't checked.

In contrast, when viewing a bam file you'd need to load the entire data as it's all mingled together. This is why I think both Gap5 and LookSeq are worth investigation. (I believe LookSeq, if not running in bam mode, also only stores location information and not the seqs/qual).

James

**imilne** · 04-28-2010, 02:53 AM

I've just run a quick test with Tablet - again with simulated data (I made it load the same set of reads over and over again until it had 10 million of them), and although it took a minute or so to load them from a BAM file, packing only took 5 or 6s, and it was fine during display too.

Without access to the actual data though, it'll be hard for any of us to genuinely replicate the problem and try to fix it for you.

Iain

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, Yesterday, 06:09 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 Yesterday, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 36 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 42 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

Visualization Tools for Large Datasets

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News