Seqanswers Leaderboard Ad

**thinkRNA** · 11-11-2010, 08:12 PM

Originally posted by simonandrews View Post

FastQC v0.7.0 has been released. This adds a new QC module which looks for enriched Kmers in the library. This can help find overrepresented sequences which are not aligned in the data, such as a library which is reading through its inserts into the adapters, but at varying points in the data.

You can see an example of the new plot here.

You can get the new version from:

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

[If you don't see the new version of any page hit shift+refresh to force our cache to update]

Hi Simon, I cannot thank you enough for putting in this feature. I was struggling with figuring out what is wrong with my dataset as the replicates were not agreeing. I now see enrichement of k-mers in one sample but not the other.

Question 0: can you explain the graph a little. For example, in your illustration, the orange plot representing CCCCC has a peak at position 11 and a dip at position 7. Does that mean 100% of sequences have CCCCC at position 11 but only 20% have CCCCC at position 7?

Question 1: How exactly do you calculate the expected counts? What is the algorithm you follow to generate a sequence set that has same base content as the library.

Question 2: what is the difference between Obs/exp overall and Obs/Exp Max

Question 3: I don't understand why your program will not capture this: "If you have a partial sequence which is appearing at a variety of places within your sequence then this won't be seen either by the per base content plot or the duplicate sequence analysis."

Question 4: Have you thought about correcting for multiple testing and perhaps applying some statistics to get a p-value. I understand right now, you are reporting any fraction of obs/exp > 3.

Finally, is it possible for you to allow the user to look at 6-mers or even 7-mers (and make that a flexible option) and how will your statistics vary if you increase the kmer.

Once again, thank you so much for putting this out-I am sure it will be tremendously beneficial to the community at large.

Priyam

**Bruins** · 11-12-2010, 01:47 AM

Hi,

I haven't but it is high on my todo list.

**simonandrews** · 11-12-2010, 02:58 AM

Originally posted by thinkRNA View Post

Hi Simon, I cannot thank you enough for putting in this feature. I was struggling with figuring out what is wrong with my dataset as the replicates were not agreeing. I now see enrichement of k-mers in one sample but not the other.

Question 0: can you explain the graph a little. For example, in your illustration, the orange plot representing CCCCC has a peak at position 11 and a dip at position 7. Does that mean 100% of sequences have CCCCC at position 11 but only 20% have CCCCC at position 7?

No. The % relates to the maximum enrichment for that Kmer, so 0 really means no occurrences, but 100% means that that was the position at which the Kmer was most observed. If you use a real scale on the y-axis then you can get really thrown by different levels of enrichment in different Kmers. The intention is to look at the shape of the distribution rather than the total levels (which are given in the table). If you really have 100% of one Kmer in a library then this should be really obvious in the per-base sequence content graph.

Originally posted by thinkRNA View Post

Question 1: How exactly do you calculate the expected counts? What is the algorithm you follow to generate a sequence set that has same base content as the library.

I count the number of occurrences of G,A,T and C and then use these to calculate a probability for each individual Kmer sequence by multiplying the probabilities of the individual bases contained in the Kmer. I then multiply that by the total number of observed Kmers to get an expected value, and then divide the observed value by this to get O/E.

Originally posted by thinkRNA View Post

Question 2: what is the difference between Obs/exp overall and Obs/Exp Max

Overall is the enrichment of that Kmer as a whole, Obs/Exp Max is the highest enrichment of that Kmer at a single position in the read. If the enrichment is consistent throughout the reads (ie its position in a read is not biased) then the two values should be the same, if the enrichment occurs mostly at one position in the read then the max value will be much higher than the overall.

Originally posted by thinkRNA View Post

Question 3: I don't understand why your program will not capture this: "If you have a partial sequence which is appearing at a variety of places within your sequence then this won't be seen either by the per base content plot or the duplicate sequence analysis."

The Kmer analysis will capture this - that was the point I was trying to make, but the other modules may not. If your enriched region is not aligned in the library then it may not show up in the other analyses which rely on very biased composition at a single position, or exact duplication of sequences.

Originally posted by thinkRNA View Post

Question 4: Have you thought about correcting for multiple testing and perhaps applying some statistics to get a p-value. I understand right now, you are reporting any fraction of obs/exp > 3.

I think that would be a useful thing to add. The problem we've had when we've looked into this previously is that it's hard to calculate exact p-values for this kind of enrichment due to the high number of observations. You end up trying to work out permutations for enormous numbers which involves some very non-trivial maths. This is far enough outside our normal area that it's not something I'm comfortable doing - but it anyone on the list is happy to help out with this then please contact me and I'll look into it.

Originally posted by thinkRNA View Post

Finally, is it possible for you to allow the user to look at 6-mers or even 7-mers (and make that a flexible option) and how will your statistics vary if you increase the kmer.

Actually the code for the Kmer analysis lets you set a lower and upper threshold for the Kmer sizes you're interested in. The reason I stuck to 5mers was

If you use longer Kmers you have to store them in memory and the memory usage gets really big really quickly
If you use more than 1 Kmer size you have to do multiple passes through each read to extract the different lengths, and this can dramatically slow down the speed with which you process the file
The stats for long Kmers get a bit silly. You end up with huge enrichments just because you saw something a few times when any individual sequence is unlikely to occur. Proper stats would certainly help this.

5mers seemed to be a good compromise between things you would see some other way (eg through sequence composition) whilst still giving reasonably reliable fold enrichment values. I did play around with trying to assemble the enriched 5mers into longer stretches (ie where you have overlapping Kmers with offset correlation in their enrichment profiles then assume that one is an extension of the other), but whilst this worked really well in some cases, it actually removed interesting stuff in other cases, so I scrapped it.

Originally posted by thinkRNA View Post

Once again, thank you so much for putting this out-I am sure it will be tremendously beneficial to the community at large.

No worries - I'm sure there's still more which can be done with this, but hopefully the tool in its current form will let people get a better feel for what else they might want to see.

**thinkRNA** · 11-12-2010, 08:13 AM

Hi Simon,

thanks for your quick reply-everything makes sense now. Except, I am still not sure how you generate the expected model, can you illustrate this with a small example? may be only 4 short sequences and a 3-mer?

How can I get the code to run this with the option of different lengths of k-mers?

Thank YOU!!
priyamsingh81atg m a i ldot c o m

**simonandrews** · 11-12-2010, 08:33 AM

Originally posted by thinkRNA View Post

Hi Simon,

thanks for your quick reply-everything makes sense now. Except, I am still not sure how you generate the expected model, can you illustrate this with a small example? may be only 4 short sequences and a 3-mer?

OK, let's say that the frequency with with the different bases showed up in your data was:

A 21%
T 19%
G 31%
C 29%

...and that you observed 100,000 5-mers in total

If you were working out the expected value for GGATC you would do:

100,000 * (0.31*0.31*0.21*0.19*0.29) = 111

If you actually saw this sequence 200 times then your O/E would be 200/111 = 1.8

Originally posted by thinkRNA View Post

How can I get the code to run this with the option of different lengths of k-mers?

You can download the source code from our site. You'll need to change the static values:

MAX_KMER_SIZE and MIN_KMER_SIZE

in uk.ac.bbsrc.babraham.FastQC.Modules.KmerContent.java

and then recompile the program. If you're going to make the max size much more than 5 then you'll probably need to change the JVM startup options to allocate more memory so it doesn't run out (change the -Xmx value to something higher)

Good luck!

Simon.

**thinkRNA** · 11-12-2010, 08:42 AM

OK, let's say that the frequency with with the different bases showed up in your data was:

A 21%
T 19%
G 31%
C 29%

...and that you observed 100,000 5-mers in total

If you were working out the expected value for GGATC you would do:

100,000 * (0.31*0.31*0.21*0.19*0.29) = 111

If you actually saw this sequence 200 times then your O/E would be 200/111 = 1.8

I see, do you think a Markov chain would work better here. i.e. generate a base in the expected set based on the probability of occurrence of previous 2 or 3 bases in the observed set (just a suggestion).

You can download the source code from our site. You'll need to change the static values:

MAX_KMER_SIZE and MIN_KMER_SIZE

in uk.ac.bbsrc.babraham.FastQC.Modules.KmerContent.java

and then recompile the program. If you're going to make the max size much more than 5 then you'll probably need to change the JVM startup options to allocate more memory so it doesn't run out (change the -Xmx value to something higher)

Good luck!

Simon.

Thanks!unfortunately I am not a java programmer but pretty good perl one. Can you tell me the command to compile from the source code. Also, is there an option to run the program on all the sequences not just the top 20% and if not, can you make it available? I have tons of memory, so its really not an issue.

**simonandrews** · 11-12-2010, 08:57 AM

Originally posted by thinkRNA View Post

I see, do you think a Markov chain would work better here. i.e. generate a base in the expected set based on the probability of occurrence of previous 2 or 3 bases in the observed set (just a suggestion).

Possibly, but only if we consistently see that the simple approach estimates incorrectly. I think a true probabalistic measure of how likely we were to see the observed level of enrichment given the number of observations is going to be more useful - but I tried looking at that and drowned in the maths involved.

Originally posted by thinkRNA View Post

Thanks!unfortunately I am not a java programmer but pretty good perl one. Can you tell me the command to compile from the source code.

We don't ship build scripts with our code so it's probably easiest to load the project into something like Eclipse or Netbeans and let that handle the compilation for you. Find a friendly local java programmer if you want to play around with this.

Originally posted by thinkRNA View Post

Also, is there an option to run the program on all the sequences not just the top 20% and if not, can you make it available? I have tons of memory, so its really not an issue.

You can make it do the whole file by commenting out line 194 in the KmerContent.java file:

// if (skipCount % 5 != 0) return;

The program actually analyses every 5th sequence, rather than the first 20%. The reason for not doing the whole thing isn't memory (since this won't change anyway), but speed since generating lots of substrings is relatively slow. Unless you have very small datasets I doubt you'll see any difference by reading the whole thing.

**thinkRNA** · 11-12-2010, 11:31 AM

Simon, I changed the MAX_KMER_SIZE=8 and MIN_KMER_SIZE=8 and tried to recompile but get the following error: Can you please help?

~/progs/FastQC_Compile$ javac ./uk/ac/bbsrc/babraham/FastQC/FastQCApplication.java

./uk/ac/bbsrc/babraham/FastQC/Sequence/BAMFile.java:26: package net.sf.samtools does not exist
import net.sf.samtools.SAMFileReader;
^
./uk/ac/bbsrc/babraham/FastQC/Sequence/BAMFile.java:27: package net.sf.samtools does not exist
import net.sf.samtools.SAMRecord;
^
./uk/ac/bbsrc/babraham/FastQC/Sequence/BAMFile.java:34: cannot find symbol
symbol : class SAMFileReader
location: class uk.ac.bbsrc.babraham.FastQC.Sequence.BAMFile
private SAMFileReader br;
^
./uk/ac/bbsrc/babraham/FastQC/Sequence/BAMFile.java:36: cannot find symbol
symbol : class SAMRecord
location: class uk.ac.bbsrc.babraham.FastQC.Sequence.BAMFile
Iterator<SAMRecord> it;
^
./uk/ac/bbsrc/babraham/FastQC/Sequence/BAMFile.java:43: package SAMFileReader does not exist
SAMFileReader.setDefaultValidationStringency(SAMFileReader.ValidationStringency.SILENT);
^
./uk/ac/bbsrc/babraham/FastQC/Sequence/BAMFile.java:43: cannot find symbol
symbol : variable SAMFileReader
location: class uk.ac.bbsrc.babraham.FastQC.Sequence.BAMFile
SAMFileReader.setDefaultValidationStringency(SAMFileReader.ValidationStringency.SILENT);
^
./uk/ac/bbsrc/babraham/FastQC/Sequence/BAMFile.java:45: cannot find symbol
symbol : class SAMFileReader
location: class uk.ac.bbsrc.babraham.FastQC.Sequence.BAMFile
br = new SAMFileReader(file);
^
./uk/ac/bbsrc/babraham/FastQC/Sequence/BAMFile.java:69: cannot find symbol
symbol : class SAMRecord
location: class uk.ac.bbsrc.babraham.FastQC.Sequence.BAMFile
SAMRecord record = it.next();
^
8 errors

**simonandrews** · 11-12-2010, 12:33 PM

Originally posted by thinkRNA View Post

Simon, I changed the MAX_KMER_SIZE=8 and MIN_KMER_SIZE=8 and tried to recompile but get the following error: Can you please help?

We should take this off list as this is just a generic java problem (you need to add the picard jar file to your sourcepath). I'd suggest looking at some tutorials on java compilation to see some examples of how to recompile specific classes.

If you're still stuck you can email me directly ([email protected]).

**thinkRNA** · 11-12-2010, 03:54 PM

Hi Simon,

I have attached image of top 6 relatively enriched 5-mers in my alignment generated using tophat (mouse RNA-sequencing run from Illumina, length 75, single end). Most of the words are enriched between 2-2.7 fold from the expected set in this sample. I am particularly worried about the yellow, black and green peaks in the middle-have you seen this before? http://picasaweb.google.com/priyamsi...10021440257874.

The technical replicate of this sample shows very high variation in genes that have high expression, so essentially the technical replicates don't agree with each other. When I look at the enrichment in this samples, I find some of the same words (CAGCA, CTGGG) but the positioning is different.http://picasaweb.google.com/priyamsi...14233664056082

I also see that CAGCA and CAGAA are relatively enriched 2-fold in a number of my other samples. Would this be an indication of an error introduced during library preparation (or contamination or biases in sequencing)?

I guess what I am asking is how to apply your k-mer analysis to determine a red flag. Can you please comment on if you have seen such biases in other Illumina samples?

Just want to add that almost all my samples show a duplication of 60%! Does that mean more than half my data is redundant? This will make sense biologically if more than half my genes have a more than a single read representing them, so it isn't all that bad news.

**simonandrews** · 11-13-2010, 01:21 AM

Originally posted by thinkRNA View Post

Hi Simon,

I have attached image of top 6 relatively enriched 5-mers in my alignment generated using tophat (mouse RNA-sequencing run from Illumina, length 75, single end). Most of the words are enriched between 2-2.7 fold from the expected set in this sample. I am particularly worried about the yellow, black and green peaks in the middle-have you seen this before?

Staggered single peaks such as you're seeing with the yellow,black and red peaks are usually indicative of a single longer sequence which is being found mostly in the same place. In your case you can see that you can assemble the 3 staggered peaks to infer an actual sequence of AGCAGTCGGG which starts at position 25. As I said previously I did experiment with doing this kind of assembly in the program, but found that in some cases the assembly gave misleading results.

Originally posted by thinkRNA View Post

The technical replicate of this sample shows very high variation in genes that have high expression, so essentially the technical replicates don't agree with each other. When I look at the enrichment in this samples, I find some of the same words (CAGCA, CTGGG) but the positioning is different.
I also see that CAGCA and CAGAA are relatively enriched 2-fold in a number of my other samples. Would this be an indication of an error introduced during library preparation (or contamination or biases in sequencing)?

I suspect that a lot of these result from more generic bias at the start of your sequences. Pretty much all RNA-seq data we've seen shows an overall sequence bias (which you can see in the per base sequence content plot) over the first 10 bases or so. In your case all of the Kmers you mention start with C, and they all peak at base 1 so I suspect that these are the result of the library prep rather than a contamination.

Originally posted by thinkRNA View Post

I guess what I am asking is how to apply your k-mer analysis to determine a red flag.

Frankly we're still getting to grips with this data so I don't think we can say for sure how best to interpret all of this. If you have problems which were flagged up in the per-base sequence content plot or the overrepresented sequence analysis then the Kmer analysis will find these same things (but possibly in a less obvious way)! Where it's going to be useful is finding unaligned biases, where you have enrichment which spans several starting positions for a single Kmer. It may be that at the moment we're flagging things which where the enrichment high enough to actually be of concern, or maybe there's a better metric than simple enrichment to use. Other people's opinions or experience are very welcome.

Originally posted by thinkRNA View Post

Just want to add that almost all my samples show a duplication of 60%! Does that mean more than half my data is redundant? This will make sense biologically if more than half my genes have a more than a single read representing them, so it isn't all that bad news.

It means that 60% of your data is resequencing of a sequence you've already seen. That's pretty high really - our diverse ChIP-Seq libraries normally run below 20% and RNA-Seq is normally below 30%. Even in RNA-Seq data (unless you're doing tag sequencing) you shouldn't see a high proportion of exact duplicates since there are many possible reads you can place over a transcript. You have to have *really* high coverage of a gene before you start seeing high levels of exact duplication.

The other reason for seeing this is if you have some contamination. Libraries with adapter contamination quickly show high levels of duplicate sequences.

**thinkRNA** · 11-15-2010, 06:00 PM

I suspect that a lot of these result from more generic bias at the start of your sequences. Pretty much all RNA-seq data we've seen shows an overall sequence bias (which you can see in the per base sequence content plot) over the first 10 bases or so. In your case all of the Kmers you mention start with C, and they all peak at base 1 so I suspect that these are the result of the library prep rather than a contamination.

Simon, I also noted that in almost all my 8 samples except 1 (where you see the peak at 25), the overrepresented k-mers are located towards the beginning of the sequence. The per base CG content is also staggering toward the first 120-13 bases and then flattens out. Question: if I trim the first 10 bases and then align the reads, can I assume that this will get rid of the bias?

Finally, what is the algorithm you follow to calculate the sequence duplication levels? Do you allow for mis-matches in reads that you annotate as duplicate?

Also can you explain what you mean by libraries with "adaptor contamination" quickly show high levels of duplication.

Frankly we're still getting to grips with this data so I don't think we can say for sure how best to interpret all of this. If you have problems which were flagged up in the per-base sequence content plot or the overrepresented sequence analysis then the Kmer analysis will find these same things (but possibly in a less obvious way)! Where it's going to be useful is finding unaligned biases, where you have enrichment which spans several starting positions for a single Kmer. It may be that at the moment we're flagging things which where the enrichment high enough to actually be of concern, or maybe there's a better metric than simple enrichment to use. Other people's opinions or experience are very welcome.

None of my samples had overrepresented sequences in them, perhaps because I gave it a really good set of aligned sequences from tophat. The only other metric I can think of is to have an ideal sequence set with no biases and use that as the background or null model (ofcourse this would be species specific and even platform specific- One may generate this computationally also). Then apply a z-test statistic (not sure about this, can you suggest a better test?) to test the hypothesis that our observed sample is different from the expected sample. One can test every possible word and also correct for multiple testing using benjamini hochberg method. I may implement this locally and test it on my set.

**simonandrews** · 11-16-2010, 01:05 AM

Originally posted by thinkRNA View Post

Simon, I also noted that in almost all my 8 samples except 1 (where you see the peak at 25), the overrepresented k-mers are located towards the beginning of the sequence. The per base CG content is also staggering toward the first 120-13 bases and then flattens out. Question: if I trim the first 10 bases and then align the reads, can I assume that this will get rid of the bias?

I think it may get rid of the evidence for the bias, but probably not the bias itself. If the bias is in the priming sites for your random primers then this positional bias will still exist even if you trim the first 10 bases. It just means your reads will now be slightly downstream of a biased region, not over it. I think this is a case where you just have to live with it.

Originally posted by thinkRNA View Post

Finally, what is the algorithm you follow to calculate the sequence duplication levels? Do you allow for mis-matches in reads that you annotate as duplicate?

No, it's exact duplicates only. Anything more sophisticated would be too slow or too memory intensive to be practical for this kind of program.

Originally posted by thinkRNA View Post

Also can you explain what you mean by libraries with "adaptor contamination" quickly show high levels of duplication.

If you have an adapter sequence which makes up 20% of your library then you've instantly got a 20% duplication level since the adapter sequences will all be exactly the same.

Originally posted by thinkRNA View Post

The only other metric I can think of is to have an ideal sequence set with no biases and use that as the background or null model

I've actually done that in some of our other analyses which require this, but the problem is you can't make a generic library to simulate this since the values you work out are dependent on the sequence biases within the library and the length of the sequences. This means you end up having to generate a custom control set for each file you analyse, and this takes up as much resource as the real library. Also, if you're doing it properly you should run the control multiple times to get proper stats for how unlikely a given level of enrichment is, and that's just completely impractical for a general application.

**KevinLam** · 11-16-2010, 01:53 AM

Blazingly fast tool!
initially, I have put off using this as I presumed it worked only with Illumina fastq reads.
But AFAIK it worked beautifully with cs fastq generated with solid2fastq (from bfast)
e.g.

@745_16_144
T312..233.2.323332002101330.132.200.222.23.310132.3
+
20)!!'%%!3!%+*(%3&)(&%(%%-!'')!%*&!&%)!%(!%7)%)(!%

was recognised as
colorspace converted to bases
the nice thing was that the overrepresented sequences were presented in basespace

still understanding the output. but for SOLiD reads, the output is understandably bad as they prefer that you use mapping as a quality filter as opposed to filtering first.

question: should it be strange that I have 1.5 mil overrepresented kmers of TCAGA in a barcoded RNA-seq experiment?

**simonandrews** · 11-16-2010, 05:47 AM

Originally posted by KevinLam View Post

still understanding the output. but for SOLiD reads, the output is understandably bad as they prefer that you use mapping as a quality filter as opposed to filtering first.

Doing base conversion on colorspace data is not ideal, but I think that it can potentially tell you things about your library which sequence calls following mapping can't tell you. Contaminants and overrepresented sequences should still be detectable even with the failings of colorspace conversion.

Originally posted by KevinLam View Post

question: should it be strange that I have 1.5 mil overrepresented kmers of TCAGA in a barcoded RNA-seq experiment?

What is the distribution of the Kmers? We normally see quite a bit of bias in the first 10 bases of RNA-Seq data seemingly due to the random priming of the library. Does this match up with what you get?

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 50 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News