Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • stickleback
    Junior Member
    • Feb 2013
    • 8

    Repetitive kmer profile in RAD seq libraries?

    Hi all,

    I am analysing data from several RAD-seq libraries. The libraries were digested with Sbf1 and single end sequenced on an Illumina HiSeq to 200 bp.

    I ran the libraries through Fastqc to screen them/get an idea of the quality. Generally the library looked fine - a little filtering and trimming needed. However the kmer profile returned a progressive enrichment in TTTTT over the reads:


    After demultiplexing the individuals in the library and removing adapter sequence using the process_radtags module of the Stacks pipeline, this enrichment went away but then the individual kmer profiles show the following odd step-wise enrichment:


    Does anyone have any idea of what could be causing this? Thanks in advance for any suggestions?
  • SNPsaurus
    Registered Vendor
    • May 2013
    • 525

    #2
    I saw your tweet about this and was pretty baffled. Just a few random questions... in the top graph, you have a barcode of CTGCT... this is just one demultiplexed file, right, not that the sequencing run had just one sample or one sample dominate?

    Why do the SbfI kmers (CTGCA/TGCAG/GCAGG) include a final CAGGA? Is the "A" after the cut site that enriched or is it just showing that one because it is slightly more enriched and it is only showing the top 6?

    After adapter removal you still see the adapter barcode (CTGCT) in the read. What do the sequences look like that have the barcode in the middle of the read?

    It might be worth trimming the cut site away as well and doing the kmer enrichment to see what else is in the middle of the reads. I'd like to see some of the actual reads with the tallest kmer peaks (at 25 and 85 in the second graph).
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment

    • nucacidhunter
      Jafar Jabbari
      • Jan 2013
      • 1250

      #3
      I wonder if you would post whole FastQC results for the lane. That might give some clues for possible causes. Also how many samples were multiplexd in the run?

      Comment

      • stickleback
        Junior Member
        • Feb 2013
        • 8

        #4
        Originally posted by SNPsaurus View Post
        I saw your tweet about this and was pretty baffled. Just a few random questions... in the top graph, you have a barcode of CTGCT... this is just one demultiplexed file, right, not that the sequencing run had just one sample or one sample dominate?
        The top graph is from a single sequencing run. It does look like this is overrepresentation from a single individual but I checked it in more detail. After demultiplexing, this individual does not have a much larger number of reads than the others; furthermore in 100 K randomly sampled reads from the library, 3.8% have this barcode - again consistent with all the other individuals.

        I'm finding the FASTQC output a little confusing here - 100 relative enrichment means that this barcode is occurring 100 times more than any other? This doesn't seem to be the case!

        Originally posted by SNPsaurus View Post
        Why do the SbfI kmers (CTGCA/TGCAG/GCAGG) include a final CAGGA? Is the "A" after the cut site that enriched or is it just showing that one because it is slightly more enriched and it is only showing the top 6?
        Yeah this is odd. It does seem that A at that position is enriched. From the 100k reads I randomly selected, this k-mer occurs in ~32% whereas the other possibilities (i.e. CAGGT/CAGGG/CAGGC) are 12-26%.

        Originally posted by SNPsaurus View Post
        After adapter removal you still see the adapter barcode (CTGCT) in the read. What do the sequences look like that have the barcode in the middle of the read?
        Is that the adapter barcode? It doesn't appear in the adapter given to me by the sequencing centre. Unless you mean that this is similar to the kmer seen in the top graph?

        Originally posted by SNPsaurus View Post
        It might be worth trimming the cut site away as well and doing the kmer enrichment to see what else is in the middle of the reads. I'd like to see some of the actual reads with the tallest kmer peaks (at 25 and 85 in the second graph).
        I grepped out some of those reads from the same individual. Here are those with ACACA at 25-29:

        @2_1202_17969_100758_1
        TGCAGGAACCGCTGACATCCCGACACACACTTCTGCGCCCAGCGCCGAGTTACTCACTCTCCTACAGAACCAAGCAGTGGATCAGCAGGCACACACTTATGCACACAGAGGTTCACATGCAAGCACATGTTCAGGTGCCTCTAGCAACAATACATAGCTGTGCTCTCACTCATTA
        +
        GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGFGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGEGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG=GG
        --
        @2_1202_9135_101219_1
        TGCAGGATGCTCAGTATGAAGTGTACACATCCAGCTTTTGCTCGACTGTTTTGCATTATTAGAAGCACACTTTGTTTTTGCTGCTACAGAACAAGCGCAATAGCTGCTTTTTAAGCTGTCTGCAGGCATGAGGCACGTTAACCACCAGACAATTTTTGTTCCCTCAAGTGCTTTT
        +
        GFGGGGGGGGGGGGBGGGGGGGGGGEGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGFFGDEGEGGGGGGDGGGGGGGGGGGGGG0CCGEGGGGGGEGGGGGEGGGGGGGGGGGGGGGGBGCBGG@EGGDGGGGGGGGGGGGEGGGE
        --
        @2_1203_18858_2620_1
        TGCAGGCTCTTTCAAGCCTACAGAACACACATAGGATACATGTCTCATGTACGCCATGTTACATATGTACATTCCACAGTATACTTACTACCATATATGGTAAGGAAGAAGCCGAGAATGTTGTTTATTACATGCTGTAAACTGAGTTTTGTGTAAACCACGTGATCTTATTGTG
        +
        GGGGGGCEGGFGGGG1FGG>1FGGGFGGGGB1=<1@FG1:F1FFGC1DBGGGFGDGGF<1CFGCGGEGGGG>FBDFGC@FDGGC@C@@DG>FG@F00C@:EFGG00=E>DFGG@...:C=0;@FD@@D=FGCGG0CGGGEGD=EGGEGB..88@@@.8;E,<-5B;GEGGGGG55
        --
        And actually counting the reads with these k-mers, they don't actually seem hugely enriched. For example for the whole de-multiplexed individual, only 0.35% of reads have ACACA at the 25-29 position.

        Incidentally, my counts using grep are way off those reported in FASTQC. The latter reports 2 377 660 occurrences of ACACA at the 25-29 position but grep returns just 15 506! Even being generous and allowing the ACACA k-mer to start somewhere between 25-29 bp still results in 279 124 reads.

        The top kmer (i.e. the cut site) count is larger than the number of reads present in the fastq file. I am starting to wonder whether FASTQC might be the problem...

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          #5
          I've always found the kmer enrichment graph baffling, due to the unitless Y-axis. The other graphs are useful, but I would not worry too much about this one.

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            According to Simon k-mer module in FastQC only tracks 2% of the data for a sample. Perhaps the way it is selecting those reads (1 in 50) that is causing this observation.

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM
            • SEQadmin2
              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
              by SEQadmin2

              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
              05-06-2026, 09:04 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Today, 08:59 AM
            0 responses
            8 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 12:03 PM
            0 responses
            21 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 11:40 AM
            0 responses
            15 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 05-28-2026, 11:40 AM
            0 responses
            29 views
            0 reactions
            Last Post SEQadmin2  
            Working...