Seqanswers Leaderboard Ad

**kmcarr** · 06-06-2011, 07:36 AM

Short answer, the random hexamer priming is "not so random". Illumina has acknowledged this in one of their FAQs:

Q482. Why is GC high in the first few bases?
It is perfectly normal to observe both a slight GC bias and a distinctly non-random base composition over the first 12 bases of the data. This is observed when looking, for instance, at the IVC (intensity versus cycle number) plots which are part of the output of the Pipeline. In genomic DNA sequencing, the base composition is usually quite uniform across all bases; but in mRNA-Seq, the base composition is noticeably uneven across the first 10 to 12 bases. Illumina believes this effect is caused by the "not so random" nature of the random priming process used in the protocol. This may explain why there is a slight overall G/C bias in the starting positions of each read. The first 12 bases probably represent the sites that were being primed by the hexamers used in the random priming process. The first twelve bases in the random priming full-length cDNA sequencing protocol (mRNA-seq) always have IVC plots that look like what has been described. This is because the random priming is not truly random and the first twelve bases (the length of two hexamers) are biased towards sequences that prime more efficiently.This is entirely normal and expected.

There was also a publication which investigated this:

Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res 2010 Apr.;

**blindtiger454** · 06-06-2011, 10:57 AM

Is it recommended to trim these first bases then? It sounds like they are valid mRNA sequence, even though there is a preference to certain reads from the "random" priming. The researchers who did lettuce transcriptome created better assemblies when they trimmed this region. I don't understand why this occurred. Maybe in the process of trimming the reads they removed some poor quality regions in the 5' end??

**kmcarr** · 06-06-2011, 11:29 AM

Originally posted by blindtiger454 View Post

Is it recommended to trim these first bases then? It sounds like they are valid mRNA sequence, even though there is a preference to certain reads from the "random" priming. The researchers who did lettuce transcriptome created better assemblies when they trimmed this region. I don't understand why this occurred. Maybe in the process of trimming the reads they removed some poor quality regions in the 5' end??

I have carefully studied the UC Davis poster in the past and what strikes me is that the effect of trimming the 5' end appears nearly identical to that of trimming the 3' end so I'm not convinced of their conclusion that it is important to trim the initial 15nt. However I have heard from other researchers that they do present a particular problem for de novo assembly with de bruijn graph assemblers (which is just about all of the most popular short read assemblers, including velvet). The thinking is that the k-mer diversity of the first 15nt is significantly lower than the remainder of the read which seems to cause problems for the assembler.

If you are doing a de novo assembly why not give it a try both ways and see what your results are?

On the other hand if I am mapping the reads to a genome (vs de novo) I never trim the 5' ends of RNA-Seq reads and I find they map perfectly well.

**blindtiger454** · 06-06-2011, 08:15 PM

Thanks for the information. Our reads are 55bp, and it is from a tetraploid plant. Given the large amount paralogues and allelic diversity in plants, I want to do minimal trimming for the assembly. It's bad enough having 55bp. The UC Davis folks had 80bp reads. If I trimmed my reads down to 40bp, I'm afraid the assembler will incorrectly assembly paralogues. Sometimes 15 nucleotides is all the difference between 2 closely related transcripts/genes.

**IBseq** · 07-06-2012, 02:42 AM

FASTQ Trimmer tool

hi guys,
I'm new to this forum...can anyone tell how do I know homa many bases should I trim with FASTQ Trimmer?Wht is the ideal score and which values do I have to look at?(Q1, median or Q3)

Thanks!

**carmeyeii** · 10-10-2012, 09:50 AM

bump

**IBseq** · 10-10-2012, 10:29 AM

I sorted that out...if anyone needs info glad to help

**blanco** · 10-18-2012, 04:23 AM

Hi folks - hope some of you can help me clarify something about adapter contamination and adapter trimming.

I made TruSeq Illumina libraries and sequenced them for 100bp paired end reads.

When I view the 'per base sequence content' with fastQC I get something that looks like adapter contamination. I then used cutadapt to remove the adapter sequence. The 'per base sequence content' before and after cutadapt is shown in the attached pdf.

Now this is all fine and dandy but what I find a bit confusing is why the adapter sequence is at the beginning of the read. My understanding was that adapter contamination mainly arises when the read is too short so at the end of the read the sequencer starts to sequence the adapter.

So why does the adapter appear at the beginning of the read and not at the end?

Am I misunderstanding something? I would love to have a clarification of this.

Thanks,
blanco

Attached Files

adapter_contaminations.pdf (84.0 KB, 599 views)

**TonyBrooks** · 10-18-2012, 04:54 AM

Originally posted by blanco View Post

Hi folks - hope some of you can help me clarify something about adapter contamination and adapter trimming.

I made TruSeq Illumina libraries and sequenced them for 100bp paired end reads.

When I view the 'per base sequence content' with fastQC I get something that looks like adapter contamination. I then used cutadapt to remove the adapter sequence. The 'per base sequence content' before and after cutadapt is shown in the attached pdf.

Now this is all fine and dandy but what I find a bit confusing is why the adapter sequence is at the beginning of the read. My understanding was that adapter contamination mainly arises when the read is too short so at the end of the read the sequencer starts to sequence the adapter.

So why does the adapter appear at the beginning of the read and not at the end?

Am I misunderstanding something? I would love to have a clarification of this.

Thanks,
blanco

You can get adapter-dimer (where the DNA insert size is effectively 0) meaning that you only sequence adapter (hence it appears at the 5' end). If this is the case, I believe using cutadapt willl just remove those reads from your fastq file (maybe someone can confirm).
Those peaks don't look like dimer to me, more the random priming issue. When you get bad adapter, you can actually read the adapter sequence in your %base graph (see attached plot of a run that had 10% adapter dimer).

Attached Files

adpater-dimer.png (79.9 KB, 399 views)

**rmred** · 03-27-2013, 05:34 PM

I got the same problem to and produce exactly the same ACGT bias for the first 15bp/cycle. And I've asked the representative for Illumina and they mentioned that this is due to the hexamer random priming as mentioned above.

**isett** · 06-25-2013, 09:28 AM

What if it's WGS and not RNA-Seq. I see the same thing with the NexteraXT kit on the MiSeq. Is it a non-random recognition site for the Tagmentation enzyme?

**nareshvasani** · 08-05-2013, 10:05 AM

Hi IBseq

Originally posted by IBseq View Post

I sorted that out...if anyone needs info glad to help

I need help. Can you please help me to trim both ends 5' and 3'?

Thanks in advance.

**Tengfei Liu** · 09-23-2013, 05:12 AM

Originally posted by nareshvasani View Post

I need help. Can you please help me to trim both ends 5' and 3'?

Thanks in advance.

You can use cutadapt to trim both 5' and 3' bps. The fastx_clipper can only trim 3' end. When you use cutadapt, you must use cutadapt -g firstly, and use the processed sequence to do cutadapt -a. If you use -g and -a at the same time, it will only cut one end.

**Michael.Ante** · 09-25-2013, 07:04 AM

Originally posted by nareshvasani View Post

I need help. Can you please help me to trim both ends 5' and 3'?

Thanks in advance.

I always use the fastx_trimmer; you can use the -f and -l options to set the first and the last base to be kept.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Trimming left end (5') of reads??

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News