Seqanswers Leaderboard Ad

**lh3** · 08-05-2013, 02:57 PM

Different assemblers respond to errors and trimming in different ways. For example, if you read the SGA paper, the authors recommend not to trim reads. Allpath-lg does not trim reads, either, as I remember. Other assemblers, such as SPAdes, may be less sensitive to trimming as they trim reads by default. Also, I have read somewhere (could be wrong) that SOAPdenovo developers recommend not to correct reads if you have enough RAM, but SGA/Allpath-lg etc always include error correction as a necessary step. At the end of day, which trimming/error correct approach to use is assembler dependent. If it were me, I would just use the tools/pipelines recommended by the developers. If I had time, I would combine different strategies/correctors and see what I would get. Probably the result is data dependent.

K-mer based error correctors typically use short k-mers. I think that is fine. With shorter k-mers, we more often collapse segmental duplications/repeats and will not be able to correct errors when they occur right at the sites differentiating repeats. However, only a small fraction of errors are not correctable due to repeats. If such errors can be corrected with long k-mers, assemblers can usually handle them well. I would not worry to much about the k-mer length in error correction, unless it is too short.

**jwag** · 08-05-2013, 03:42 PM

Originally posted by lh3 View Post

Different assemblers respond to errors and trimming in different ways. For example, if you read the SGA paper, the authors recommend not to trim reads. Allpath-lg does not trim reads, either, as I remember. Other assemblers, such as SPAdes, may be less sensitive to trimming as they trim reads by default. Also, I have read somewhere (could be wrong) that SOAPdenovo developers recommend not to correct reads if you have enough RAM, but SGA/Allpath-lg etc always include error correction as a necessary step. At the end of day, which trimming/error correct approach to use is assembler dependent. If it were me, I would just use the tools/pipelines recommended by the developers. If I had time, I would combine different strategies/correctors and see what I would get. Probably the result is data dependent.

I've been trying to follow the pipeline of the folks that assembled the Giant Panda, as they used SOAPdenovo and exclusively short Illumina reads (I think the group that developed SOAPdenovo is the same that put out the Giant Panda assembly). Unfortunately, documentation with SOAPdenovo isn't the most thorough. According to their Nature paper (supplemental section), they did trim for low quality bases at the 3' end before error correction, though they don't specify their threshold.

After looking at the parameters in SOAPec, it seems that they might have trimmed for Q<2 because that is the only threshold available for trimming during error correction in that program.

I'm going to first try letting SOAPec do the trimming and error correction from the full raw data. Right now I have all paired-end data, and I don't think SOAPec can handle both paired-end and single end data while still keeping pairs together in the output.

If that doesn't work too well, I'll try trimming first with something like Q<10 before error correction.

**Wallysb01** · 08-05-2013, 05:04 PM

Originally posted by jwag View Post

I'm going to first try letting SOAPec do the trimming and error correction from the full raw data. Right now I have all paired-end data, and I don't think SOAPec can handle both paired-end and single end data while still keeping pairs together in the output.

You just have to do two seperate runs with the same output from KmerFreq.

**jwag** · 08-06-2013, 09:47 AM

Originally posted by Wallysb01 View Post

You just have to do two seperate runs with the same output from KmerFreq.

Ah that makes sense. So I can just initially put in all my data into KmerFreq, so that it counts all instances of K, and I can use any of my data against that same frequency distribution (even in smaller chunks). That will definitely save me some time. Thanks for the tip.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News