Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SPAdes: erroneous kmer threshold

    Hello Members,

    I'm working with E. coli, paired end, Illumina data. Using SPAdes 3.5 version.
    SLURM Environment. 128 gb RAM node.

    I checked out QUAST report for one of isolate which was alarmingly high for an E. coli:- 6.592727 mb, with 918 contigs (1000>= bp).

    Code:
    zcat file_R1_001.fastq.gz | awk '{if(NR%4==2) print length($1)} ' | head -1
    Read length: 103

    I traced back to warnings.log file for its assembly run, it says:

    === Error correction and assembling warnings:
    * 0:03:13.851 484M / 9G WARN General (kmer_coverage_model.cpp : 327) Valley value was estimated improperly, reset to 1
    * 0:02:26.558 620M / 9G WARN General (kmer_coverage_model.cpp : 327) Valley value was estimated improperly, reset to 4
    * 0:02:26.565 620M / 9G WARN General (kmer_coverage_model.cpp : 366) Failed to determine erroneous kmer threshold. Threshold set to: 4
    ======= Warnings saved to $HOME/Docs/warnings.log
    No idea how and what is causing it.
    Meanwhile, I shall check fastqc report, see if that would make sense to me.

    Any guidance would be of great help.

    Thank you.
    Last edited by bio_informatics; 05-14-2015, 07:04 AM.
    Bioinformaticscally calm

  • #2
    With such a large assembly, I'd suspect contamination. I suggest you look at the GC vs coverage distribution of contigs; you may get two distinct clouds for different organisms. Also blasting them may help figure it out.

    If this is an isolate rather than single cell, a kmer frequency histogram could also indicate the presence of multiple organisms. None of these will help much if it's two strains of e.coli, though.

    Comment


    • #3
      Hi Brian,

      Thanks for your reply.
      I checked fastqc report, they were reasonably well. Not too much of quality drop.
      It's an isolate.

      - Is there any free tool to check GC vs coverage distribution?
      - Should I blast the whole assembly in NCBI?
      - would kmer frequency by kmergeinie be something good?

      I apologize for such naive questions. I've not come across this situation.

      Thanks for your guidance.

      Originally posted by Brian Bushnell View Post
      With such a large assembly, I'd suspect contamination. I suggest you look at the GC vs coverage distribution of contigs; you may get two distinct clouds for different organisms. Also blasting them may help figure it out.

      If this is an isolate rather than single cell, a kmer frequency histogram could also indicate the presence of multiple organisms. None of these will help much if it's two strains of e.coli, though.
      Bioinformaticscally calm

      Comment


      • #4
        Was this single cell, or isolate? If single-cell, a kmer-frequency histogram won't help, but it will for isolates. Kmer-genie produces the wrong kind of histogram; the kind you need is for the number of unique kmers per depth for a fixed kmer length. You can generate that using BBNorm like this:

        khist.sh in=reads.fq hist=histogram.txt

        The GC versus coverage plot can be generated fairly easily with BBMap:

        bbmap.sh ref=assembly.fa in=reads.fq covstats=covstats.txt fast

        The covstats file will list the length, coverage, and gc content of all contigs.

        The simplest thing to do, though, is probably to blast the entire assembly versus nt and see what you get. I have never personally done that, though; I think when we do it here we use some kind of wrapper that summarizes which taxa are hit in which amounts. Not sure how complicated that wrapper is; I only interact with blast via a browser, one sequence at a time

        Comment


        • #5
          Not sure if I am following the thought here. Blasting will identify contamination i.e. some of those 918 contigs would not have E coli as the best hit?

          Comment


          • #6
            Originally posted by GenoMax View Post
            Not sure if I am following the thought here. Blasting will identify contamination i.e. some of those 918 contigs would not have E coli as the best hit?
            Yep. E.coli should only account for ~4.5Mbp of the assembly, so the remaining 2Mbp are probably either misassemblies or contaminant. That's enough for another complete (small) genome.

            Comment


            • #7
              Or one or more plasmids.

              But this may get complicated since depending on how good the assembly is E coli may not be the top/best/only hit. We shall see what OP finds.

              Comment


              • #8
                @Genomax, Brian: Thank you for your inputs.
                I spoke with my team mates. The data we have is contaminated. Isolate shall be put on re-sequencing in couple of weeks.

                I skipped k-mer v/s gc (something) plots, and BLAST.

                Thanks again for your valuable suggestions, and guidance.
                Bioinformaticscally calm

                Comment


                • #9
                  The plot that Brian mentioned can be produced with GAEMR (http://www.broadinstitute.org/softwa...erence-manual/). See an example of the plot at Figure 6.1 'Blast Bubbles'

                  Comment


                  • #10
                    @boetsie:
                    Thanks for pointing to this tool.
                    This looks to have a whole lot of utility with it.

                    Originally posted by boetsie View Post
                    The plot that Brian mentioned can be produced with GAEMR (http://www.broadinstitute.org/softwa...erence-manual/). See an example of the plot at Figure 6.1 'Blast Bubbles'
                    Bioinformaticscally calm

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Non-Coding RNA Research and Technologies
                      by seqadmin


                      Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                      [Article Coming Soon!]...
                      Yesterday, 08:07 AM
                    • seqadmin
                      Recent Developments in Metagenomics
                      by seqadmin





                      Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                      09-23-2024, 06:35 AM
                    • seqadmin
                      Understanding Genetic Influence on Infectious Disease
                      by seqadmin




                      During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                      Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                      09-09-2024, 10:59 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 10-02-2024, 04:51 AM
                    0 responses
                    14 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 10-01-2024, 07:10 AM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 09-30-2024, 08:33 AM
                    1 response
                    31 views
                    0 likes
                    Last Post EmiTom
                    by EmiTom
                     
                    Started by seqadmin, 09-26-2024, 12:57 PM
                    0 responses
                    20 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X