Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • subuhikhan
    Junior Member
    • Dec 2011
    • 4

    Kmer content

    Hello,

    I have recently got back my Illumina RNA sequencing dataset and I have used Fastqc software to check its quality. I want to know what is Kmer content and what is its significance?

    Thank you
    Subuhi
  • cllorens
    Member
    • Nov 2011
    • 44

    #2
    Hola

    A k-mer is a motif (or a small word) of length k observed more than once in a genomic or sequenced sequence. The order of the kmer is defined by its word size.

    Examples for 2, 3, and 4

    for repeats

    acacacacacac.. (this is "AC" dinucleotides)


    gacgacgacgacgac (this is "GAC" trinucleotides)

    for spaced occurrences

    tttccGAGGaaggcgtagcgacgacGAGGaagcctca ( this is "GAGG" tetrads)


    The content is the number of times the kmer occurs in the sequence and the distribution is related with the enrichment of a genomic sequence based on a particular kmer.

    Taking into account that you can search for kmers of any size (the concept can be extended to larger words) the significances are diverse, searching and masking of repeats and mobile elements, preprocessing of fastqs, denovo assembling etc etc.

    This is a very short explanation it is just the basic but it can helps you to check papers related with software and pipeline for seraching repeats mobile elements de novo using kmers or also and of course papers and manuals for software oriented to de novo assembling etc.

    Best
    Carlos

    Comment

    • xlzhang
      Junior Member
      • Nov 2011
      • 6

      #3
      Originally posted by cllorens View Post
      Hola

      A k-mer is a motif (or a small word) of length k observed more than once in a genomic or sequenced sequence. The order of the kmer is defined by its word size.

      Examples for 2, 3, and 4

      for repeats

      acacacacacac.. (this is "AC" dinucleotides)


      gacgacgacgacgac (this is "GAC" trinucleotides)

      for spaced occurrences

      tttccGAGGaaggcgtagcgacgacGAGGaagcctca ( this is "GAGG" tetrads)


      The content is the number of times the kmer occurs in the sequence and the distribution is related with the enrichment of a genomic sequence based on a particular kmer.

      Taking into account that you can search for kmers of any size (the concept can be extended to larger words) the significances are diverse, searching and masking of repeats and mobile elements, preprocessing of fastqs, denovo assembling etc etc.

      This is a very short explanation it is just the basic but it can helps you to check papers related with software and pipeline for seraching repeats mobile elements de novo using kmers or also and of course papers and manuals for software oriented to de novo assembling etc.

      Best
      Carlos
      Hi,Carlos

      I have used SOAPdenovo, and the minimum length of its contig is the value of Kmer. I also used Cortex, in its result file there is the following string: lst_kmer:ATATTTTCTTACATGTTCCAAGGGT. I want to had a deeper understanding of Kmer.

      I am a beginner. Thanks for your help.

      Comment

      • Zam
        Member
        • Apr 2010
        • 51

        #4
        Hi there

        1. Kmers are just words (chunks of sequence) of length k.
        2. The current version of Cortex contains some unnecessary stuff in the output.
        This text
        lst_kmer:ATATTTTCTTACATGTTCCAAGGGT

        just tells you the last kmer in the contig. "lst" stands for last.
        fst_kmer is the first kmer. It was once useful, but is not any more, and I have just removed it from Cortex - when I make the next release, it will be gone.

        Sorry for this, I've been meaning to remove it for a while, it just confuses new users.

        Comment

        • cllorens
          Member
          • Nov 2011
          • 44

          #5
          Hi Zhang

          In addition of Zam comments (it is like that Zam says k-mers are words of a particular size that you can find repeated in a genome with a particular frequency that depends of their size), perhaps I attach some references on distinct topics using K-mers for you to read them if you want to get deeper.






          Hope you to enjoy them
          Carlos

          Comment

          • cllorens
            Member
            • Nov 2011
            • 44

            #6
            There is goes another interesting reference i forget to attach in the post above.

            Comment

            • xlzhang
              Junior Member
              • Nov 2011
              • 6

              #7
              Originally posted by Zam View Post
              Hi there

              1. Kmers are just words (chunks of sequence) of length k.
              2. The current version of Cortex contains some unnecessary stuff in the output.
              This text
              lst_kmer:ATATTTTCTTACATGTTCCAAGGGT

              just tells you the last kmer in the contig. "lst" stands for last.
              fst_kmer is the first kmer. It was once useful, but is not any more, and I have just removed it from Cortex - when I make the next release, it will be gone.

              Sorry for this, I've been meaning to remove it for a while, it just confuses new users.
              Thanks, Zam

              So, what is the meaning of "fst_r:GT fst_f:G" and "lst_r:A lst_f:AT"? I thought "r" stood for reverse and "f" stood for forward, am I right?

              If I want to get a consensus assembly from a set of reads possibly in SV structure, guess I should use Cortex_con? or Cortex_var? I don't understand the the fundamental difference between the two.

              And, If I run different Kmers, which result is better? "length" or "average_coverage"?

              Thank you for your answer!
              Last edited by xlzhang; 03-04-2012, 07:55 PM.

              Comment

              • xlzhang
                Junior Member
                • Nov 2011
                • 6

                #8
                Thanks, Carlos.

                Comment

                • Zam
                  Member
                  • Apr 2010
                  • 51

                  #9
                  Hi xlzhang

                  "fst_r:GT fst_f:G" and "lst_r:A lst_f:AT"

                  This describes the edges going in/out of the contig at the first/last nodes.
                  The first node has G and T edges going out in the reverse complement direction, and a G forwards. The last node has A and T going out forwards and A in the reverse. I don't think you need to pay attention to this though for most uses.

                  As for cortex_con versus cortex_var - the fundamental difference is one of goal. Con is for making a consensus/haploid assembly of a single whole genome - it delas with one sample. Var is for assembling polymorphism, in one or many samples. If you have a set of reads which you know are precisely the reads for an alternate haplotype/SV, then you have effectively reduced your problem to a haploid one, and I would try cortex_con (or any standard assembler of your choice, depends a bit on the size of your region). If you have a set of reads from a structurally variant region, from a sample which might be heterozygous, I would try cortex_var. There is a Cortex_var google group where you could post more detailed questions if you like

                  best wishes

                  Zam

                  Comment

                  • xlzhang
                    Junior Member
                    • Nov 2011
                    • 6

                    #10
                    Originally posted by Zam View Post
                    Hi xlzhang

                    "fst_r:GT fst_f:G" and "lst_r:A lst_f:AT"

                    This describes the edges going in/out of the contig at the first/last nodes.
                    The first node has G and T edges going out in the reverse complement direction, and a G forwards. The last node has A and T going out forwards and A in the reverse. I don't think you need to pay attention to this though for most uses.

                    As for cortex_con versus cortex_var - the fundamental difference is one of goal. Con is for making a consensus/haploid assembly of a single whole genome - it delas with one sample. Var is for assembling polymorphism, in one or many samples. If you have a set of reads which you know are precisely the reads for an alternate haplotype/SV, then you have effectively reduced your problem to a haploid one, and I would try cortex_con (or any standard assembler of your choice, depends a bit on the size of your region). If you have a set of reads from a structurally variant region, from a sample which might be heterozygous, I would try cortex_var. There is a Cortex_var google group where you could post more detailed questions if you like

                    best wishes

                    Zam
                    You've given me a lot of help. Thank you.

                    Comment

                    Latest Articles

                    Collapse

                    • GATTACAT
                      Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                      by GATTACAT
                      Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                      Yesterday, 11:43 AM
                    • SEQadmin2
                      Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                      by SEQadmin2


                      I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                      Here are nine questions we think about, in roughly the order they matter, before...
                      06-18-2026, 07:11 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by SEQadmin2, 06-30-2026, 05:37 AM
                    0 responses
                    11 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-26-2026, 11:10 AM
                    0 responses
                    18 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-17-2026, 06:09 AM
                    0 responses
                    52 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-09-2026, 11:58 AM
                    0 responses
                    111 views
                    0 reactions
                    Last Post SEQadmin2  
                    Working...