Seqanswers Leaderboard Ad

**Brian Bushnell** · 08-16-2014, 12:24 PM

Originally posted by illnoobina View Post

1) Is it acceptable to denovo assemble unprocessed reads?
2) If no, which quality enhancement methods are absolutely necessary?
3) How can I apply the FastX filter tools and still run velvet paired end assemblies?
4) How can I tell a good assembly from a bad assembly? By contig n50 or number of nodes? Kmer coverage?

1) Yes, particularly if the data is very high quality, but some processing will usually give you a better assembly.
2) I recommend adapter trimming and filtering out artifacts (primers and other synthetic molecules, phiX, human reads). Depending on the quality and assembler, quality-trimming or error-correction may be useful (Velvet is pretty robust, though). Depending on the library type, sometimes normalization is useful (e.g. for highly amplified single-cell data).
3) I suggest you not use FastX. Instead, use a tool like BBDuk which retains pairing when doing trimming/filtering operations.
4) This is a difficult question. You might try using a tool like Quast, which is designed for evaluating assemblies; it works best if you provide it with a reference, so you can use a known strain of E.coli for that. Also, mapping reads to the assembly is useful; the higher the mapping rate, and the lower the error count, the better the assembly reflects the reads. Looking at the coverage, you may also be able to spot things like collapsed repeats. You can also plot the cumulative length of the assembly as you include more contigs, starting with the longest and ending with the shortest; that line will tell you more than a single number like N50.

Also, once you have an assembly, you can BLAST it against nt or something to see if all your contigs are e.coli. If some are not, you can remove the contaminant reads and reassemble.

**illnoobina** · 08-17-2014, 09:40 AM

Hi Brian,

Thank you for your fast and very very helpful reply. I am using BBDuk you suggested (I guess you coded it) and it's awesome! I am down 2 nodes compared to my initial assembly without optimizing kmer lengths. I guess the 2 nodes were the phiX spike and TruSeq adapter contaminations which i filtered for.

Estimated Coverage = 17.9
Estimated Coverage cutoff = 8.9
146 nodes
n50 of 234k
max 537k
total 5034k

I used the adapter trim as you did in your tutorial (I hope that's fine with TruSeq paired end):

./bbduk.sh -Xmx1g in1=R1in.fastq in2=R2in.fastq out1=R1out.fastq out2=R2out.fastq ref=truseq.fa ktrim=r k=28 mink=12 hdist=1

The only thing I am worried about is that I still see kmer irregularties in the first 10bp in FastQC - i thought i am going to get rid of that after adapter trimming? is that true?

**Brian Bushnell** · 08-17-2014, 10:07 AM

Kmer frequency irregularities in the first 10-20bp are not unusual, depending on your fragmentation methodology. I don't think it happens with sonication, but with other approaches like Nextera (transposon) and "random" hexamer priming, it does. In my testing highly nonrandom base frequencies in the first 20bp do not exhibit inflated error rates, and thus are not due to artifacts or base-calling problems, but you can test this by running BBMap with the "mhist=mhist.txt" flag, against your assembly. This will show the error rate by read position; if it is not elevated for the first 20bp, then the nonuniformity is just an artifact of nonrandom cleavage and does not need trimming.

Adapters (for fragment libraries) are present on the right end, not the left end, so they don't affect the first 10-20bp unless you have a high population of adapter-dimers and so forth. Adapter-dimers should be removed during adapter trimming. There are other artifacts, though, like primer-dimers and various other artificial constructs; you may want to BLAST your assembled contigs against nt (or some other database) to see if any are contaminants. If so, you should filter out the contaminants from the raw reads and reassemble; you get the best assembly when contaminants are removed before assembling. Alternately, you could just use BBDuk to remove all known Illumina artifacts prior to assembly (which is what we do at JGI). I'm not supposed to distribute the files containing all Illumina contaminant sequences since some are patented, but they're not difficult to find online.

**cement_head** · 09-04-2014, 10:54 AM

Originally posted by Brian Bushnell View Post

Kmer frequency irregularities in the first 10-20bp are not unusual, depending on your fragmentation methodology. I don't think it happens with sonication, but with other approaches like Nextera (transposon) and "random" hexamer priming, it does. In my testing highly nonrandom base frequencies in the first 20bp do not exhibit inflated error rates, and thus are not due to artifacts or base-calling problems, but you can test this by running BBMap with the "mhist=mhist.txt" flag, against your assembly. This will show the error rate by read position; if it is not elevated for the first 20bp, then the nonuniformity is just an artifact of nonrandom cleavage and does not need trimming.

Adapters (for fragment libraries) are present on the right end, not the left end, so they don't affect the first 10-20bp unless you have a high population of adapter-dimers and so forth. Adapter-dimers should be removed during adapter trimming. There are other artifacts, though, like primer-dimers and various other artificial constructs; you may want to BLAST your assembled contigs against nt (or some other database) to see if any are contaminants. If so, you should filter out the contaminants from the raw reads and reassemble; you get the best assembly when contaminants are removed before assembling. Alternately, you could just use BBDuk to remove all known Illumina artifacts prior to assembly (which is what we do at JGI). I'm not supposed to distribute the files containing all Illumina contaminant sequences since some are patented, but they're not difficult to find online.

Should one remove adapters FIRST, before trimming using PHRED scores?

**Brian Bushnell** · 09-04-2014, 11:06 AM

There are arguments for each approach, but I prefer to remove adapters first; quality trimming never adds information. If you have severe contamination you can always trim adapters both before and after quality trimming, but that's generally a waste of time.

Basically -

If you quality-trim first, then you might remove so much adapter that the remaining little piece is no longer recognized, and therefore not trimmed.
If you quality-trim second, you may remove some error bases that prevented the detection of a shorter-than-K adapter prefix on the very end of the read.

So neither is perfect but quality-trimming second seems to be better.

**cement_head** · 09-05-2014, 03:14 AM

Originally posted by Brian Bushnell View Post

There are arguments for each approach, but I prefer to remove adapters first; quality trimming never adds information. If you have severe contamination you can always trim adapters both before and after quality trimming, but that's generally a waste of time.

Basically -

If you quality-trim first, then you might remove so much adapter that the remaining little piece is no longer recognized, and therefore not trimmed.
If you quality-trim second, you may remove some error bases that prevented the detection of a shorter-than-K adapter prefix on the very end of the read.

So neither is perfect but quality-trimming second seems to be better.

Ok, thanks!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Velvet assembly from MiSeq data - am I doing it right?

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News