Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Kmer spectrum question

    Hi all,

    This is my first post on the forum, and I am new to genomic analysis so please bear with me.

    I am doing a k-mer analysis on Illumina reads from 250bp, 500bp and 800bp insert libraries. K=17, using jellyfish

    jellyfish count -m 17 -o a.out -C -c 7 -s 1000000000 -t 24 a.fas

    The kmer spectra look normal when I run the analysis separately for each library, that is, a huge number of kmers represented only once or twice (read errors?), and a single mode, and long tail out to the right (non-orthogous kmers arising from repetitive elements?).

    Here is the rub. Only the 250bp analysis yields a sensible estimate of genome size (1.8 Gb, estimated independently) (using number of Kmers/peak/2), and when I combine the spectra

    jellyfish merge -o 250+500.out 250.out 500.out

    I get two peaks. I would have thought the Illumina runs on the same samples using different short insert libraries would have been sampling the same overall sequence, and so the unimodal spectra should combine to yield a unimodal spectrum.

    Any Illumina buffs or bioinformaticists out there who can shed some light on what might be happening here?

    I have attached a file with the spectra.
    kmer_spectra.pdf

  • #2
    Your 250 bp library looks a little weird (nearly bimodal - or a very "broad" peak).

    Other than that you're right - you'll expect up to 4 peaks though:

    1. Depth 1-2: sequencing errors
    2. Heterozygote positions (small peak) - a small bump with low depth from kmer covering heterozygote positions
    3. The large peak - the typical coverage (the one you clearly see in your 500bp library) - used for genome size estimate.
    4. The repeat peak - a small bump with high depth covering repeat regions

    This is assuming that a random 17mer is typically unique in the genome. But no matter what: two libraries may scale differently (i.e. different coverage due to library size differences) - but the shape of the kmer spectrum should NOT be different - and it is in your case.

    What about quality check of the two libraries (fastqc?)

    Comment


    • #3
      OK, thanks for that advice. I installed fastqc and ran the 250 and 500 fastq files through it, and all looks good. I have attached examples of the fastqc output.

      Maybe there is a double peak in there in the 250 set (leading to the "broad peak") and the double peak becomes better defined as I add in more data from the 500 and 800bp reads. There may be no problem at all?

      Maybe do a subtraction between the 250 kmer set and the 500 kmer set to see if there is any systematic difference in representation. Might that clear the issue up? Any idea on how to do such a subtraction on jellyfish output files?
      Attached Files

      Comment


      • #4
        I think there is a problem - but maybe that relates to the genome of your sample(?)

        Is it a secret organism - or can you reveal anything? I thought a little more and I have more ugly suggestion: contamination (if you're sampling two genome with different coverage, you'll also get two peaks).

        I would probably try and assemble it (if it's an unknown organism) - and then maybe remap all the 500bp lib reads to the genome - the scaffolds with reads are from your target organism.

        Then the scaffolds only getting hits from the 250bp library and NOT the 500bp library is the "contaminant" - then you can blast and check it.

        A lot of work - maybe it's not worth it - depends on your question/project.

        On topic: I don't know how to subtract two jellyfish kmer spectra.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Recent Advances in Sequencing Analysis Tools
          by seqadmin


          The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
          05-06-2024, 07:48 AM
        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 02:46 PM
        0 responses
        12 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-07-2024, 06:57 AM
        0 responses
        13 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-06-2024, 07:17 AM
        0 responses
        17 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-02-2024, 08:06 AM
        0 responses
        23 views
        0 likes
        Last Post seqadmin  
        Working...
        X