Unconfigured Ad

**gaffa** · 03-14-2011, 11:25 AM

What is your read length? Only one mismatch is not super strict if you have longer reads.

You should probably be able to tell if there's still a lot of adapter in there - for example the nucleotide distribution plot from FastQC would be spiky and you might even be able to correlate the spikes with the sequence of your adapters. Also if you can you could take a manual look on some of the reads that are present in many copies, and see if their sequence is close to your adapter or if there's something else going on.

**cedance** · 03-15-2011, 12:45 AM

The read length is 84. After I clip for adapters, I used ShortRead package to find the sequences that occur more than 10 times and checked for adapter sequences with 0 or 1 mismatch again, but none to avail. Maybe I could run the adapter clipping again with 2 mismatches just to compare the results I guess.

Thank you.

**simonandrews** · 03-15-2011, 12:46 AM

Originally posted by cedance View Post

In fastQC, the graph seems pretty nice with the number of sequences that occur 1, 2, etc.. 9 times gradually decreasing to about 0 and then for 10+ repeats it rises to about 20%.

That doesn't sound like a particularly nice graph. A nice graph drops almost immediately to zero and stays there. What you're describing is a pervasive low level duplication. This isn't likely to be caused by any kind of contamination, but is more likely the result either of general oversequencing or of PCR duplication artefacts. It could be that you have a particularly enriched library where this would be expected but we'd need to know more about your experiment and preferably see the QC results to be able to comment more specifically.

**cedance** · 03-15-2011, 02:24 AM

Hi Simon, thanks for your reply.
The data is from RNASeq of tomatoes. The fastq files contain reads of length 84 Nucleotides. I am working on data from paired end reads at the moment. Here, I attach the link to zipped fastqc results of just the forward read (I think its sufficient).

Dropbox - 404

http://dl.dropbox.com/u/3851628/27.100503.Transcriptome_Seedlings_Pool77_84_85bpPE.s_5_1_sequence_fastqc.zip

It would be nice to know your interpretations.

Thank you.

**kmcarr** · 03-15-2011, 10:37 AM

Cedance,

I've looked at quite a few FastQC reports for mRNA-Seq runs from plants and based on my experience your duplication report doesn't look bad. Depending on the tissue or developmental stage the transcript diversity in a plant can be low, so as Simon suggested you have probably reached the saturation point for sequencing.

More concerning to me would be the drop off in Q-scores at the ends of your reads. Based on that plot plan on doing some quality based trimming of your reads.

**simonandrews** · 03-15-2011, 10:49 AM

Originally posted by cedance View Post

Hi Simon, thanks for your reply.
The data is from RNASeq of tomatoes. The fastq files contain reads of length 84 Nucleotides. I am working on data from paired end reads at the moment. Here, I attach the link to zipped fastqc results of just the forward read (I think its sufficient).

Dropbox - 404

http://dl.dropbox.com/u/3851628/27.100503.Transcriptome_Seedlings_Pool77_84_85bpPE.s_5_1_sequence_fastqc.zip

It would be nice to know your interpretations.

As KMCarr said the duplication doesn't look terrible - it's pretty low level and may just represent oversequencing of the most abundant transcripts in your library.

The bigger concern (which you may easily be able to explain) is the strong initial bias in your sequences. Your first few bases show very strong bias - which is particularly obvious at position 4. Is this a barcoded sample? If not then you might have some kind of adapter contamination at the start of your sample.

The quality is somewhat poor at the end of your sequence and you might want to trim the ends back a bit if you're going to assemble, but it's not too bad, and the overall per-read quality looks pretty good.

**cedance** · 03-16-2011, 01:12 AM

kmcarr, simonandrews,
Yes, I provided the raw data. And yes they are barcodes. I have clipped for adapters, trimmed for quality and for barcodes and separated them as well. I did not provide it here. The duplication reduces to less than 10% for 10+ reads for those individual barcode-split files. If you would want to have a look at the individual paired end reads, I can link them as well. I guess, since, otherwise the sequenced reads were fine, it should be alright.

Thank you once again!

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 15 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 107 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 125 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

repetitive/duplicate reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News