Seqanswers Leaderboard Ad

**SNPsaurus** · 05-13-2015, 11:13 AM

I saw your tweet about this and was pretty baffled. Just a few random questions... in the top graph, you have a barcode of CTGCT... this is just one demultiplexed file, right, not that the sequencing run had just one sample or one sample dominate?

Why do the SbfI kmers (CTGCA/TGCAG/GCAGG) include a final CAGGA? Is the "A" after the cut site that enriched or is it just showing that one because it is slightly more enriched and it is only showing the top 6?

After adapter removal you still see the adapter barcode (CTGCT) in the read. What do the sequences look like that have the barcode in the middle of the read?

It might be worth trimming the cut site away as well and doing the kmer enrichment to see what else is in the middle of the reads. I'd like to see some of the actual reads with the tallest kmer peaks (at 25 and 85 in the second graph).

**nucacidhunter** · 05-13-2015, 12:30 PM

I wonder if you would post whole FastQC results for the lane. That might give some clues for possible causes. Also how many samples were multiplexd in the run?

**stickleback** · 05-13-2015, 06:53 PM

Originally posted by SNPsaurus View Post

I saw your tweet about this and was pretty baffled. Just a few random questions... in the top graph, you have a barcode of CTGCT... this is just one demultiplexed file, right, not that the sequencing run had just one sample or one sample dominate?

The top graph is from a single sequencing run. It does look like this is overrepresentation from a single individual but I checked it in more detail. After demultiplexing, this individual does not have a much larger number of reads than the others; furthermore in 100 K randomly sampled reads from the library, 3.8% have this barcode - again consistent with all the other individuals.

I'm finding the FASTQC output a little confusing here - 100 relative enrichment means that this barcode is occurring 100 times more than any other? This doesn't seem to be the case!

Originally posted by SNPsaurus View Post

Why do the SbfI kmers (CTGCA/TGCAG/GCAGG) include a final CAGGA? Is the "A" after the cut site that enriched or is it just showing that one because it is slightly more enriched and it is only showing the top 6?

Yeah this is odd. It does seem that A at that position is enriched. From the 100k reads I randomly selected, this k-mer occurs in ~32% whereas the other possibilities (i.e. CAGGT/CAGGG/CAGGC) are 12-26%.

Originally posted by SNPsaurus View Post

After adapter removal you still see the adapter barcode (CTGCT) in the read. What do the sequences look like that have the barcode in the middle of the read?

Is that the adapter barcode? It doesn't appear in the adapter given to me by the sequencing centre. Unless you mean that this is similar to the kmer seen in the top graph?

Originally posted by SNPsaurus View Post

It might be worth trimming the cut site away as well and doing the kmer enrichment to see what else is in the middle of the reads. I'd like to see some of the actual reads with the tallest kmer peaks (at 25 and 85 in the second graph).

I grepped out some of those reads from the same individual. Here are those with ACACA at 25-29:

@2_1202_17969_100758_1
TGCAGGAACCGCTGACATCCCGACACACACTTCTGCGCCCAGCGCCGAGTTACTCACTCTCCTACAGAACCAAGCAGTGGATCAGCAGGCACACACTTATGCACACAGAGGTTCACATGCAAGCACATGTTCAGGTGCCTCTAGCAACAATACATAGCTGTGCTCTCACTCATTA
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGFGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGEGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG=GG
--
@2_1202_9135_101219_1
TGCAGGATGCTCAGTATGAAGTGTACACATCCAGCTTTTGCTCGACTGTTTTGCATTATTAGAAGCACACTTTGTTTTTGCTGCTACAGAACAAGCGCAATAGCTGCTTTTTAAGCTGTCTGCAGGCATGAGGCACGTTAACCACCAGACAATTTTTGTTCCCTCAAGTGCTTTT
+
GFGGGGGGGGGGGGBGGGGGGGGGGEGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGFFGDEGEGGGGGGDGGGGGGGGGGGGGG0CCGEGGGGGGEGGGGGEGGGGGGGGGGGGGGGGBGCBGG@EGGDGGGGGGGGGGGGEGGGE
--
@2_1203_18858_2620_1
TGCAGGCTCTTTCAAGCCTACAGAACACACATAGGATACATGTCTCATGTACGCCATGTTACATATGTACATTCCACAGTATACTTACTACCATATATGGTAAGGAAGAAGCCGAGAATGTTGTTTATTACATGCTGTAAACTGAGTTTTGTGTAAACCACGTGATCTTATTGTG
+
GGGGGGCEGGFGGGG1FGG>1FGGGFGGGGB1=<1@FG1:F1FFGC1DBGGGFGDGGF<1CFGCGGEGGGG>FBDFGC@FDGGC@C@@DG>FG@F00C@:EFGG00=E>DFGG@...:C=0;@FD@@D=FGCGG0CGGGEGD=EGGEGB..88@@@.8;E,<-5B;GEGGGGG55
--

And actually counting the reads with these k-mers, they don't actually seem hugely enriched. For example for the whole de-multiplexed individual, only 0.35% of reads have ACACA at the 25-29 position.

Incidentally, my counts using grep are way off those reported in FASTQC. The latter reports 2 377 660 occurrences of ACACA at the 25-29 position but grep returns just 15 506! Even being generous and allowing the ACACA k-mer to start somewhere between 25-29 bp still results in 279 124 reads.

The top kmer (i.e. the cut site) count is larger than the number of reads present in the fastq file. I am starting to wonder whether FASTQC might be the problem...

**Brian Bushnell** · 05-13-2015, 07:43 PM

I've always found the kmer enrichment graph baffling, due to the unitless Y-axis. The other graphs are useful, but I would not worry too much about this one.

**GenoMax** · 05-14-2015, 01:07 AM

According to Simon k-mer module in FastQC only tracks 2% of the data for a sample. Perhaps the way it is selecting those reads (1 in 50) that is causing this observation.

Topics	Statistics	Last Post
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, Yesterday, 06:55 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 06:55 AM
Genetic Mosaicism More Prevalent Than Previously Thought by seqadmin Started by seqadmin, 05-30-2024, 03:16 PM	0 responses 24 views 0 likes	Last Post by seqadmin 05-30-2024, 03:16 PM
Comprehensive Sequencing of Great Ape Sex Chromosomes Yields Insights into Evolution and Genetic Variability by seqadmin Started by seqadmin, 05-29-2024, 01:32 PM	0 responses 28 views 0 likes	Last Post by seqadmin 05-29-2024, 01:32 PM
New Toolkit Enhances Plant Mitochondrial Genome Research by seqadmin Started by seqadmin, 05-24-2024, 07:15 AM	0 responses 215 views 0 likes	Last Post by seqadmin 05-24-2024, 07:15 AM

Seqanswers Leaderboard Ad

Announcement

Repetitive kmer profile in RAD seq libraries?

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News