Seqanswers Leaderboard Ad

**joshuapk** · 06-04-2012, 05:51 AM

You may way to consider a CD-HIT run to lower complexity again by removing duplicate reads. I suggest getting access to a cluster and using Celera Assembler however, remember that alot of your contigs will be in the degenerates folder.

**Mark** · 06-04-2012, 06:02 AM

What kind of organism are you sequencing? This, of course, affects strategy?

**hicham** · 06-04-2012, 06:12 AM

Thanks for your answer
Indeed in each file reads are duplicated thousand of times. but we cant reduce these repeats because theses repeats aren't shared in the two files. For example, a read from the first file has 1000 exact repeats, but the correponding pair read hasn't the same repeats.
I forget the kind of this data : RNASeq (transcriptome assembly).
I think that Celera Assembler isn't suitable for this assembly because it's for genome assembly.

**hicham** · 06-04-2012, 06:14 AM

This is a Micro-algae

**ians** · 06-04-2012, 06:54 AM

Originally posted by hicham View Post

we cant reduce these repeats because theses repeats aren't shared in the two files. For example, a read from the first file has 1000 exact repeats, but the correponding pair read hasn't the same repeats.

Be careful here. Most assemblers do not look at header information to establish pairs. Rather, the 1st read in file a is paired with the 1st read in file b. If you remove any reads, be sure you also remove it's pair in the other file.

Originally posted by hicham View Post

I forget the kind of this data : RNASeq (transcriptome assembly).

Do not parse out repeats. The general expression levels are important for transcriptome assemblers. We use Trinity package currently. SOAPtrans is pretty fast and memory efficient, but i haven't had a chance to assess it's correctness.

**hicham** · 06-04-2012, 07:15 AM

Right, after the cleaning step we removed reads without pair and put them in an external file. and to keep order of reads in the files.
I truth on the importance of the expression level in the transcriptome assembly. the idea was to make a relative reduction of repeats to reduce the immense amount of data. but it's not was possible in this case of paired reads.
If we use Trinity, How much memory RAM would be necessary to dedicate for this assembly?

**ians** · 06-04-2012, 07:46 AM

Originally posted by hicham View Post

the idea was to make a relative reduction of repeats to reduce the immense amount of data. but it's not was possible in this case of paired reads.

It is a good idea (computationally) to reduce sequence to only as much sampling as you really need. Which organism did you sequence?

Originally posted by hicham View Post

If we use Trinity, How much memory RAM would be necessary to dedicate for this assembly?

The authors say,

Ideally, you will have access to a large-memory server, ideally having ~1G of RAM per 1M reads to be assembled (but often, much less memory may be required).

I don't have any numbers to share for paired end, but recently, we ran 160 M reads (1x100bp) with Trinity peaking memory at 18 GB.

**Pseudonym** · 06-04-2012, 09:02 PM

Originally posted by hicham View Post

Someone knows any strategy or pipeline to assemble a such large amount of illumina paired end data?

You might like to try Gossamer. It was designed with memory efficiency in mind, so it can do the same job as other assemblers, using smaller machines. (Or, alternatively, it can handle more data than other assembers on the same machine.)

Full disclosure: I'm one of the developers.

**arvid** · 06-05-2012, 12:20 AM

I'd second the suggestion to try Trinity on that dataset. You could reduce your dataset with diginorm, if necessary, though 81 Mio reads (pairs?) sounds reasonable to tackle with a ~64 GB server - though generally the memory consumption depends more on the transcriptome complexity than the actual number of reads.
What was wrong with the 159 Mio reads that you dropped? rRNA, adapters or just bad quality?

**hicham** · 06-05-2012, 12:21 AM

Hi,
Thank you very much for you answers.
I just read about Gosammer. In the paper it is described as good for genomic data.
It can be also valid for transcriptomic reads?

**hicham** · 06-05-2012, 12:36 AM

Hi arvid,
After the cleaning, I get 3 files: 2 files for pairs, and a file containing reads without pair, the sum of read in the tree files is 81 Millions
For the cleaning operation we used SeqTrimNext, this software remove adapters, contaminants, bad quality, and low complexity reads.

**arvid** · 06-05-2012, 12:52 AM

Originally posted by hicham View Post

Hi arvid,
After the cleaning, I get 3 files: 2 files for pairs, and a file containing reads without pair, the sum of read in the tree files is 81 Millions
For the cleaning operation we used SeqTrimNext, this software remove adapters, contaminants, bad quality, and low complexity reads.

For Trinity, you'd want to combine that into one file, it should be able to recognize the pairs on its own (might have changed recently though, as a paired end mapping step was introduced which might need a different input, check the documentation and examples). Otherwise I'd just use the standard parameters except setting the set kmer-method to "jellyfish" and setting the max memory for Jellyfish and the number of CPUs to use. I wouldn't expect problems with 81 Mio reads on a server with 70+ GB RAM (as indicated by your initial post), though expect the software to run overnight or even longer.

**Pseudonym** · 06-05-2012, 03:34 PM

Hicham,

Originally posted by hicham View Post

I just read about Gosammer. In the paper it is described as good for genomic data.
It can be also valid for transcriptomic reads?

About as well as ABySS-PE. Which is to say, not anywhere near as well as an actual transcriptome assembler like Trans-ABySS, Trinity or Oases.

The place where most genome assemblers do significantly worse than transcriptome assemblers is in pair threading and scaffolding, where it's useful to make the assumption that there is such a thing as "N times coverage". (This assumption is incorrect in RNA-Seq, because of differing expression levels.)

One thing that you could try is to use Gossamer as a pre-pass for Trinity. The input to Trinity is the output of a k-mer counter (Trinity's driver script uses Meryl by default). It would be fairly straightforward to use Gossamer as the k-mer counter by running its graph build and cleanup passes to bring it down to a managable size, then using the dump-graph to report the k-mer counts. You'd need to do a little scripting to convert it into Meryl format.

Having said that... we are actively working on the problem of resource-efficient transcriptome assembly. Nothing to announce yet, but watch this space.

**westerman** · 06-06-2012, 11:45 AM

Trinity has 'jellyfish' as a mer-counter. It is likely that in the next release jellyfish will become the default and that meryl will be removed since jellyfish is so much faster. So, if you are using Trinity, make sure that you specify jellyfish.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

De novo assembly for Illumina HighSeq paired end reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News