Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Repetitive kmer profile in RAD seq libraries?

    Hi all,

    I am analysing data from several RAD-seq libraries. The libraries were digested with Sbf1 and single end sequenced on an Illumina HiSeq to 200 bp.

    I ran the libraries through Fastqc to screen them/get an idea of the quality. Generally the library looked fine - a little filtering and trimming needed. However the kmer profile returned a progressive enrichment in TTTTT over the reads:


    After demultiplexing the individuals in the library and removing adapter sequence using the process_radtags module of the Stacks pipeline, this enrichment went away but then the individual kmer profiles show the following odd step-wise enrichment:


    Does anyone have any idea of what could be causing this? Thanks in advance for any suggestions?

  • #2
    I saw your tweet about this and was pretty baffled. Just a few random questions... in the top graph, you have a barcode of CTGCT... this is just one demultiplexed file, right, not that the sequencing run had just one sample or one sample dominate?

    Why do the SbfI kmers (CTGCA/TGCAG/GCAGG) include a final CAGGA? Is the "A" after the cut site that enriched or is it just showing that one because it is slightly more enriched and it is only showing the top 6?

    After adapter removal you still see the adapter barcode (CTGCT) in the read. What do the sequences look like that have the barcode in the middle of the read?

    It might be worth trimming the cut site away as well and doing the kmer enrichment to see what else is in the middle of the reads. I'd like to see some of the actual reads with the tallest kmer peaks (at 25 and 85 in the second graph).
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment


    • #3
      I wonder if you would post whole FastQC results for the lane. That might give some clues for possible causes. Also how many samples were multiplexd in the run?

      Comment


      • #4
        Originally posted by SNPsaurus View Post
        I saw your tweet about this and was pretty baffled. Just a few random questions... in the top graph, you have a barcode of CTGCT... this is just one demultiplexed file, right, not that the sequencing run had just one sample or one sample dominate?
        The top graph is from a single sequencing run. It does look like this is overrepresentation from a single individual but I checked it in more detail. After demultiplexing, this individual does not have a much larger number of reads than the others; furthermore in 100 K randomly sampled reads from the library, 3.8% have this barcode - again consistent with all the other individuals.

        I'm finding the FASTQC output a little confusing here - 100 relative enrichment means that this barcode is occurring 100 times more than any other? This doesn't seem to be the case!

        Originally posted by SNPsaurus View Post
        Why do the SbfI kmers (CTGCA/TGCAG/GCAGG) include a final CAGGA? Is the "A" after the cut site that enriched or is it just showing that one because it is slightly more enriched and it is only showing the top 6?
        Yeah this is odd. It does seem that A at that position is enriched. From the 100k reads I randomly selected, this k-mer occurs in ~32% whereas the other possibilities (i.e. CAGGT/CAGGG/CAGGC) are 12-26%.

        Originally posted by SNPsaurus View Post
        After adapter removal you still see the adapter barcode (CTGCT) in the read. What do the sequences look like that have the barcode in the middle of the read?
        Is that the adapter barcode? It doesn't appear in the adapter given to me by the sequencing centre. Unless you mean that this is similar to the kmer seen in the top graph?

        Originally posted by SNPsaurus View Post
        It might be worth trimming the cut site away as well and doing the kmer enrichment to see what else is in the middle of the reads. I'd like to see some of the actual reads with the tallest kmer peaks (at 25 and 85 in the second graph).
        I grepped out some of those reads from the same individual. Here are those with ACACA at 25-29:

        @2_1202_17969_100758_1
        TGCAGGAACCGCTGACATCCCGACACACACTTCTGCGCCCAGCGCCGAGTTACTCACTCTCCTACAGAACCAAGCAGTGGATCAGCAGGCACACACTTATGCACACAGAGGTTCACATGCAAGCACATGTTCAGGTGCCTCTAGCAACAATACATAGCTGTGCTCTCACTCATTA
        +
        GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGFGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGEGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG=GG
        --
        @2_1202_9135_101219_1
        TGCAGGATGCTCAGTATGAAGTGTACACATCCAGCTTTTGCTCGACTGTTTTGCATTATTAGAAGCACACTTTGTTTTTGCTGCTACAGAACAAGCGCAATAGCTGCTTTTTAAGCTGTCTGCAGGCATGAGGCACGTTAACCACCAGACAATTTTTGTTCCCTCAAGTGCTTTT
        +
        GFGGGGGGGGGGGGBGGGGGGGGGGEGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGFFGDEGEGGGGGGDGGGGGGGGGGGGGG0CCGEGGGGGGEGGGGGEGGGGGGGGGGGGGGGGBGCBGG@EGGDGGGGGGGGGGGGEGGGE
        --
        @2_1203_18858_2620_1
        TGCAGGCTCTTTCAAGCCTACAGAACACACATAGGATACATGTCTCATGTACGCCATGTTACATATGTACATTCCACAGTATACTTACTACCATATATGGTAAGGAAGAAGCCGAGAATGTTGTTTATTACATGCTGTAAACTGAGTTTTGTGTAAACCACGTGATCTTATTGTG
        +
        GGGGGGCEGGFGGGG1FGG>1FGGGFGGGGB1=<1@FG1:F1FFGC1DBGGGFGDGGF<1CFGCGGEGGGG>FBDFGC@FDGGC@C@@DG>FG@F00C@:EFGG00=E>DFGG@...:C=0;@FD@@D=FGCGG0CGGGEGD=EGGEGB..88@@@.8;E,<-5B;GEGGGGG55
        --
        And actually counting the reads with these k-mers, they don't actually seem hugely enriched. For example for the whole de-multiplexed individual, only 0.35% of reads have ACACA at the 25-29 position.

        Incidentally, my counts using grep are way off those reported in FASTQC. The latter reports 2 377 660 occurrences of ACACA at the 25-29 position but grep returns just 15 506! Even being generous and allowing the ACACA k-mer to start somewhere between 25-29 bp still results in 279 124 reads.

        The top kmer (i.e. the cut site) count is larger than the number of reads present in the fastq file. I am starting to wonder whether FASTQC might be the problem...

        Comment


        • #5
          I've always found the kmer enrichment graph baffling, due to the unitless Y-axis. The other graphs are useful, but I would not worry too much about this one.

          Comment


          • #6
            According to Simon k-mer module in FastQC only tracks 2% of the data for a sample. Perhaps the way it is selecting those reads (1 in 50) that is causing this observation.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Latest Developments in Precision Medicine
              by seqadmin



              Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

              Somatic Genomics
              “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
              05-24-2024, 01:16 PM
            • seqadmin
              Recent Advances in Sequencing Analysis Tools
              by seqadmin


              The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
              05-06-2024, 07:48 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:55 AM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-30-2024, 03:16 PM
            0 responses
            24 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-29-2024, 01:32 PM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-24-2024, 07:15 AM
            0 responses
            215 views
            0 likes
            Last Post seqadmin  
            Working...
            X