Unconfigured Ad

**simonandrews** · 01-25-2011, 12:45 AM

Looking at your results it seems that you have a significant proportion of your library which is composed of CAA multimeric repeats. These would explain the slight peaks running through the per-base composition and per-base GC graphs, as well as the secondary peak on the per sequence GC plot. The Kmer enrichment also shows a few cyclic variants of these - the reason for the sharp peaks is that the repeats seem to be aligned to the start of your sequence, so that certain positions favour starting a new repeat. It seems that there is a bias towards your sequences starting with C (which we've seen in a lot of transcriptome data), so this would make sense.

The duplication graph shows that you have quite high levels of duplication, but that this is spread over the majority of sequences in your library (so it's not just a few outliers which are being heavily duplicated). This could be simple saturation of your library if you're working with a very small transcriptome, or it could be a more subtle PCR bias. You'd need to look at the coverage you get over transcribing genes to be able to tell these apart.

**kmcarr** · 01-25-2011, 06:09 AM

I'll add some observations from our experience with mRNA-Seq data to Simon's excellent explanation.

First, with regard to the sequence duplication level your results are very much in line with what we observe for most of the mRNA-Seq samples run in our core. I agree with both you and Simon that this is simply a result of saturating the diversity of the library. Getting a 'Failed' for Sequence Duplication on an mRNA-Seq sample usually doesn't concern me. It is my impression that the pass/fail cut offs for FastQC are based on sequencing genomic libraries so they may not be appropriate thresholds for assessing mRNA-Seq data.

We have one submitter who, in every RNA sample they submit for Illumina sequencing we find a low level of 'CA' repeat sequence. I don't have an explanation for this, I relate this just to let you know that you are the only person finding simple repeat reads in their mRNA-Seq data.

**simonandrews** · 01-25-2011, 08:10 AM

I suppose the best way of thinking about the duplication level plot is that it's a measure of the amount of sequencing you might have 'wasted' in your library - that is to say that a high duplication level means that you could have got much the same diversity in your library by doing a whole load less sequencing.

For RNA-Seq data you may have to accept that to see the most lowly expressed transcripts you need to oversequence the high expressed transcripts so you might consider this to be an expected fail.

The pass/warn/fail categories in FastQC are really just an indication of where you should focus your attention, not an absolute call that things are wrong. I've got a collection of perfectly good datasets which between then fail every test which FastQC does :-)

The repeat stuff is interesting though - we've seen datasets with CA repeats, but where we think this might be functionally relevant. We've not seen CAA repeats before though. I'm still never entirely sure whether to think of these as real effects or whether there's an artefact in the library prep or the sequencing which causes these.

**gconcepcion** · 01-26-2011, 08:02 AM

Thank you both for the insights. It's good to get confirmation that high levels of duplication in mRNA-seq data may be par for the course.

The STRs are interesting. As you could tell by the double peak on the GC content distribution graph they make up a significant portion of the library. They are not unique to the Illumina data, I also have a 454 dataset (350bp x 94000 reads = 32.9 Mb) with similar repeats comprising a large subset. Also, 'back in the day' when high-throughput sanger sequencing was all the rage, I helped sequence ~10,000 clones from EST libraries of two similar organisms and found a lot of similar repetitive elements. Definitely pushes me more towards being biologically significant rather than artefact as confirmed by 3 sequencing technologies & 3 different library prep protocols on 3 different taxa in the same group.

At any rate, i'm running into the inevitable hurdle of our lab computer with the most memory (32gb) being woefully inadequate for de novo assembly at this point.

While I await super computer access, would it be an OK strategy for me to circumvent the memory issues by mapping the Illumina reads to the 454 transcripts? Any special considerations?

**FWOS** · 05-25-2011, 11:56 AM

Help Interpreting mRNA Seq Duplicate Sequence Plot FastQC

Hi All,

I recently noticed some strange trends in the duplicate sequence plots generated from a 2x50bp RNA sequencing experiment performed on an Illumina HiSeq. I understand that the libraries will most likely contain some duplicates that might have resulted from oligo dt and/or random hexamer priming methods and/or PCR. It also makes sense that the FastQC thresholds are based on libraries created via DNA fragmentation etc...
What I am trying to figure out is how the duplicate sequence plot calculates the total percentage of non-unique sequences. Specifically, I have a data set with non-unique sequences calculated by FastQC to be > 53% of all sequences, but it seems like only two sequences are listed as "over represented" (>0.1%). I am not sure how it would be possible for such a small percentage of non-unique sequences to have such a large impact on the total number of non-unique sequences. Considering that only the following two over represented sequences are listed in the FastQC report:

1.) 0.673779203474544 TruSeq Adapter, Index 2 (100% over 50bp)
2.) 0.1471451982022855 TruSeq Adapter, Index 2 (100% over 49bp)

... Does anyone know how is it that the total percentage of duplicate sequences is 53% when only ~0.8% can be attributed to the primer contaminants?

Is there a calculation that relates specific contributions of overly expressed duplicate sequences to the total percentage of non-unique sequences, or something similar?

Please see attached, the Duplicate Sequence Plot that I am referring to:

Attached Files

duplication_levels_1.png (7.4 KB, 488 views)

**fkrueger** · 05-25-2011, 12:52 PM

Hi FWOS,

The section "overrepresented sequences" shows only sequences that are present above a certain threshold. Not quite sure about the exact value but 0.1% of the total sequences in the file seems reasonable. So if your input file was say 50 million reads, then any sequence present more than 50,000 times would show up. In your case there are only 2 minor adapter contaminations, so the library seems to be reasonably clean.

The Duplication Plot shows how many sequences were seen once, twice... up to more than 10 times (exact matches over the entire length). The duplication level is counted as unique sequences (present only once)/(unique sequences + duplicated sequences (present more than once) ) * 100 in %. Even though the figure you linked is too small to read anything it seems that a fair amount of sequences is present more than once, which is normally due to PCR amplification, but you are right, adapter contaminations will also contribute to the overall duplication level.

Even though 50% is not great, there are still plenty of reads which are unique or present in low abundance (and we have seen much worse levels than this). Hope this helps.

**simonandrews** · 05-25-2011, 12:55 PM

I've actually just written a blog post which tries to explain the duplicate sequences plot in a bit more detail because sometimes it's not obvious what it's saying.

Looking at the graph you attached I'm surprised to see the overall duplication level as low as 53% as it looks higher than that. Basically you can get really high duplication levels by having a very small number of sequences which dominate an otherwise diverse library (in which case they'll show up in the overrepresented sequences list), or by having a larger number of sequences with moderately high duplication. Only sequences which individually represent more than 0.1% of the library (so 20,000 duplicates in a 20 million read library) are shown in the overrepresented sequence list which is a pretty high barrier. It's therefore easy to get high duplication levels by having sequences duplicated a few hundred times each which won't put anything in the list of overrepresented sequences.

In your case you have a high number of sequences with >10 duplicates (and there's no way to tell how much greater than 10 they are from the plot), but these are going to contribute the majority of the duplication in your particular library.

**Celli** · 07-04-2011, 05:56 AM

Hello All,

I have some Illumina data (single end RNA-Seq) that has a 'funny' bias in Kmer distribution (FastQC plots) even after trimming. I have a attached a number of FastQC plots -explained below- of both the raw reads and the reads after adaptor + 32 3'bp trimming (due to low quality scores and adaptor sequencing at read ends). If anyone has thoughts on what may be causing these patterns and how to avoid similar data in future Illumina runs, they would be greatly appreciated! I can trim these off altogether by reducing my reads to ~30bp in length, but without knowing what is causing this pattern I can't assess if the short reads would be uncontaminated by whatever this problem is (I would like to do Diff. Expression analysis).

Thanks so much!
Celli

1. PerBaseQualityUntrimmed.pdf: quality seems okay until around 80-85 bp, trimmed to this length
2. PerBaseContentUntrimmed.pdf: 'flared' end lanes 5-8 mentioned in previous posts as adaptor sequencing. Uncertain what would cause 'bridges' from ~60bp to 110 bp in lanes 2&3.
3.PerBaseContentTrimmed.pdf: trimming removes all evidence of adapter sequencing from this diagnostic plot
4. KmerUntrimmed.pdf: large 'hills' at read ends in lanes 2 & 3 seem to reflect whatever is showing up in PerBaseContentUntrimmed.jpg. Uncertain if I should be concerned about lane 5 as well?
5. KmerTrimmed.pdf: even after trimming 'hills' in lanes 2 and 3 are apparent from about 35bp to read end.

Attached Files

**simonandrews** · 07-04-2011, 11:56 PM

As you are already aware you have adapter contamination in your various libraries, but with quite a lot of variability as to where in the library it starts. You're also seeing some bias at the start of your reads, but this happens in all RNA-Seq libraries, so don't worry about that too much.

It looks like your adapter trimming has mostly fixed the biases you were seeing. Although there are some Kmers still enriched in your trimmed data I'd suspect that these show only low level enrichment (you'd need to look at the table under the graph to see how enriched they are - the graph only shows the pattern of enrichment). No adapter trimmers manage to remove every trace of adapter so you might just be seeing the ones which snuck through your original screen. As long as these are a fairly small proportion of your library you should be OK.

The easiest way to test how good your trimmed library is is to try to map it. If you get good mapping efficiency then you've probably done OK in removing whatever contaminants were present.

**rpauly** · 10-28-2011, 12:21 PM

Hi...
I have a very similar problem, but I am not sure if the data is of good quality.Also my overrepresented sequences are almost 15% of the reads in some cases..should I be concerned?

**simonandrews** · 10-31-2011, 12:39 AM

Originally posted by rpauly View Post

Hi...
I have a very similar problem, but I am not sure if the data is of good quality.Also my overrepresented sequences are almost 15% of the reads in some cases..should I be concerned?

It's very difficult to comment specifically without knowing the details of your experiment. In some cases you might expect a few sequences to be hugely overrepresented in your library, but mostly this is a bad thing. The important thing is to try to understand where those sequences come from if they're not automatically identified by FastQC so you can try to avoid them in future. Having said that, 15% is a very high level of contamination by a small number of sequences and probably does indicate a problem in your library preparation - this doesn't mean the rest of the library isn't useful, but it's something you want to look at more closely.

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 24 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

FastQC - strange 'per base sequence content' graph

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News