Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Kmer content vs overrepresented sequences

    What is the difference between Kmer content and overrepresented sequences in fasqc?

  • #2
    Well, the manual says ...

    "The analysis of overrepresented sequences will spot an increase in any exactly duplicated sequences, but there are a different subset of problems where it will not work.

    If you have very long sequences with poor sequence quality then random sequencing errors will dramatically reduce the counts for exactly duplicated sequences.
    If you have a partial sequence which is appearing at a variety of places within your sequence then this won't be seen either by the per base content plot or the duplicate sequence analysis.
    The Kmer module starts from the assumption that any small fragment of sequence should not have a positional bias in its apearance within a diverse library. There may be biological reasons why certain Kmers are enriched or depleted overall, but these biases should affect all positions within a sequence equally. This module therefore measures the number of each 7-mer at each position in your library and then uses a binomial test to look for significant deviations from an even coverage at all positions. Any Kmers with positionally biased enrichment are reported. The top 6 most biased Kmer are additionally plotted to show their distribution."

    Comment


    • #3
      I seeee. but still there is a confusion for me. Based on my understanding each position in a read is allocated by one nucleotide. so what does it mean by "measures the number of each 7-mer at each position" ?

      For example in position number 5 (in Kmer content graph) there is a pick for CGCCG. What does it mean? My knowledge is very basic and I think position 5 should be allocated only by any of A,C,G or T.

      Comment


      • #4
        Originally posted by Saeideh View Post
        I seeee. but still there is a confusion for me. Based on my understanding each position in a read is allocated by one nucleotide. so what does it mean by "measures the number of each 7-mer at each position" ?

        For example in position number 5 (in Kmer content graph) there is a pick for CGCCG. What does it mean? My knowledge is very basic and I think position 5 should be allocated only by any of A,C,G or T.
        I am not sure what you mean by the word 'allocated'. We may be having difficulties with English. So I can not answer your question directly. However I can try to explain what is happening, at least as far as I understand it.

        Take a read. At base (position) #1 it will have 1 of the possible 16,384 7-mers. At position #2 it will have a (most likely) different of the 16,384 7-mers. And so on for the entire read.

        Do the same for all of the other reads.

        Then look at position #1. Are any of the 16,384 7-mers found statistically more often at position #1 than at the other bases? If so report it. Ditto for all of the other positions -- report any 7-mers which statistically are found more often at that given position than any of the other positions?
        Last edited by westerman; 09-16-2015, 08:36 AM. Reason: Better use of the word 'position' instead of 'base'.

        Comment


        • #5
          Originally posted by Saeideh View Post
          For example in position number 5 (in Kmer content graph) there is a pick for CGCCG. What does it mean? My knowledge is very basic and I think position 5 should be allocated only by any of A,C,G or T.
          It means the kmer BEGINNING at position 5 (i.e., 5=C, 6=G, 7=C, 8=C, 9=G).

          Comment


          • #6
            Am I right?

            Originally posted by westerman View Post
            I am not sure what you mean by the word 'allocated'. We may be having difficulties with English. So I can not answer your question directly. However I can try to explain what is happening, at least as far as I understand it.

            Take a read. At base (position) #1 it will have 1 of the possible 16,384 7-mers. At position #2 it will have a (most likely) different of the 16,384 7-mers. And so on for the entire read.

            Do the same for all of the other reads.

            Then look at position #1. Are any of the 16,384 7-mers found statistically more often at position #1 than at the other bases? If so report it. Ditto for all of the other positions -- report any 7-mers which statistically are found more often at that given position than any of the other positions?
            ------------------------
            Rick based on what you said I get this: In each position, there might be 7 bases (continuously) which are repeated more. like this:

            Read1: ACGGTCGGTCG
            Read2: GTACCTGTAGC
            Read3: CGGTGCTGGTC
            Read4: CGTTAGCTTCG
            Read5: CGTAAGCTTGC
            Read6: CGTGGACGGAT
            Read7: GGGTCGGCTTA
            Read8: TTTTTCGTCGC
            Read9: CTGAGTTGGGC
            Read10: ACGCCCGGTCG
            Read11: GTACCTGTAGC
            Read12: CGGTGCTGGTC
            Read:13 CTTTAGCTTCG
            Read14: CGTAAGAATGC
            Read15: CGTGGACGGAT
            Read16: GCGTCTATTAA

            In position one of these 16 reads "AGCCCCG" is repeated.

            Is it right?

            Comment


            • #7
              You're misunderstanding Rick. Let's take the example of ACGGTCGGTCG. Its 7-mers are:

              Code:
              ACGGTCG     position 1
               CGGTCGG    position 2
                GGTCGGT   position 3
                 GTCGGTC  position 4
                  TCGGTCG position 5
              Every other read will have similar 7-mers, again, one starting at each position. The whole point of this is to see if you have a bias of some sequence at a given position.

              Comment


              • #8
                Originally posted by dpryan View Post
                You're misunderstanding Rick. Let's take the example of ACGGTCGGTCG. Its 7-mers are:

                Code:
                ACGGTCG     position 1
                 CGGTCGG    position 2
                  GGTCGGT   position 3
                   GTCGGTC  position 4
                    TCGGTCG position 5
                Every other read will have similar 7-mers, again, one starting at each position. The whole point of this is to see if you have a bias of some sequence at a given position.
                --------
                Thank you Devon

                Now I know what is Kmer.

                I attached Kmer content graph of my output from fastqc. Would you please guide me on how to relate it with your explanation about Kmer? (For example, what are the picks in the graphs? Or in the 2nd position, I have G, but why one in upper and the other is lower in the graph?)
                Attached Files

                Comment


                • #9
                  Honestly, that's just recapitulating the "Per sequence GC content" results, which probably indicate an abundance of high-GC content reads. The graphing method is actually a bit weird in my opinion. It graphs the observed/expected ratio of the given kmers at each position. However, it scales everything so that the range is 0-100. The graph isn't flat because there's always going to be variability in things like this.

                  As I said, in your case the graph doesn't mean much of anything. In other cases, people sometimes observe interesting patterns in graphs like this. For example, one of the bench scientists I work with was recently looking for the binding sites (and hopefully the accompanying motif) of a particular protein. I could actually see the motif in FastQC graphs like this (well, the per-base sequence content was more useful, but the motif popped up in this graph too).

                  Comment


                  • #10
                    I can't understand the graph.

                    If someone give me a kmer content graph, I can't analyze the graph. I just can say what kmers are repeated many times based on the table beside the graph

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Exploring the Dynamics of the Tumor Microenvironment
                      by seqadmin




                      The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                      07-08-2024, 03:19 PM
                    • seqadmin
                      Exploring Human Diversity Through Large-Scale Omics
                      by seqadmin


                      In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                      06-25-2024, 06:43 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Today, 11:09 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-19-2024, 07:20 AM
                    0 responses
                    148 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-16-2024, 05:49 AM
                    0 responses
                    123 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 07-15-2024, 06:53 AM
                    0 responses
                    111 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X