Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BLAST contamination search help

    Hi everyone.

    First of all, I'm new to the forum and new to the realm of bioinformatics.

    I'm currently working on a project where I'm set to analyse several sequenced e.coli strains.

    My first task is to check for contamination.
    Running Webblast i find many hits for other bacteria than e.coli with 0 E-value and 95-100% MaxIdent.

    E.coli is however by far the dominant hits.

    I need to get a general impression of contamination in all contigs for the 6 different E.coli strains I have, so I can decide if I can do further analyses with the contigs unmodified of if contamination needs to be removed.

    Seeing there's >100 contigs for each strain and webblast output is limited to one strain at a time, this is not feasable.

    Therefore I've installed blast+ and blastall locally (unix) and downloaded the nr database.

    When running blastall -i trh9.fna -p blastn -d nr -o result.txt

    I get an almost empty result.txt file as output.

    Have I installed the nr database correctly, or is something wrong with my syntax?

    I've downloaded all the archives and put them in a db directory.. (nr.00, nr.01, etc.)

    The input file is a standard(?) fasta formatted file.

    Tips, pointers, help would be greatly appreciated.


    Anders

  • #2
    Just a minor point, you can indeed run NCBI "legacy" standalone BLAST like this:

    Code:
    blastall -p blastn ...
    If you want to use the "new" standalone BLAST+ it would be:

    Code:
    blastn ...
    As to the fact you are getting an almost empty result file, this is probably due to using different settings compared to the web blast. Check things like the gap parameters, evalue threshold, and so on.

    Comment


    • #3
      To be more precise; the short output file that is produced only contains

      BLASTN 2.2.24 [Aug-08-2010]


      Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
      Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
      "Gapped BLAST and PSI-BLAST: a new generation of protein database search
      programs", Nucleic Acids Res. 25:3389-3402.

      Query= contig00001 length=3139 numreads=1128
      (3139 letters)


      So it seems the query is only the first contig of the fasta file, which contains >100 contigs. I need to get all the contigs to be processed.

      So basically there's zero output, and the computational time is very brief.
      Obviously, I'm doing something incorrectly.

      Don't know if adjusting the evalue or gap score would do anything here.

      Also should I go with blast+ instead of legacy?


      Sorry if I'm asking obvious ?'s, but I've googled my butt off the lately, and there seems to be little info to be found.

      Also, am I using the right blast program?
      I'm supposed to run the nucleotide data against the nr database.
      Seeing the nr database is a protein database I should be running blastx?
      Only when I did the search using webblast getting ample results, I was using nucleotide blast (i.e. blastn)...

      Comment


      • #4
        Originally posted by Anders Myrvold Dahl View Post
        To be more precise; the short output file that is produced only contains

        BLASTN 2.2.24 [Aug-08-2010]


        Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
        Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
        "Gapped BLAST and PSI-BLAST: a new generation of protein database search
        programs", Nucleic Acids Res. 25:3389-3402.

        Query= contig00001 length=3139 numreads=1128
        (3139 letters)
        That looks truncated - you'd normally then get some matches or it would say "no hits", then the next queries, and a footer at the end.

        There were no error messages? This is odd - but see below.

        Originally posted by Anders Myrvold Dahl View Post
        Also should I go with blast+ instead of legacy?
        I would certainly recommend you try it. The NCBI are (I think) currently still supporting legacy BLAST, but only in the short term. You'll have to switch to BLAST+ at some point, so it would be sensible to do it now.
        Originally posted by Anders Myrvold Dahl View Post
        Also, am I using the right blast program?
        I'm supposed to run the nucleotide data against the nr database.
        Seeing the nr database is a protein database I should be running blastx?
        Only when I did the search using webblast getting ample results, I was using nucleotide blast (i.e. blastn)...
        Yes, use blastx -- blastn is for nucleotide query against nucleotide database. There is a nice summary of the different blast programs here:
        The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
        Last edited by maubp; 11-18-2010, 01:25 PM. Reason: typo

        Comment


        • #5
          I've tried running both blastn and blastx from the blast+ package against the nr database now.

          Seems blastn gives me an indexing error ( because the database is proteins?).

          Blastx executes, but nothing happens.

          I.e. I have to ctrl+c to break the process. No output neither in the command window or in the output file.

          And yes, I have tried letting the process run for a while...

          Comment


          • #6
            Hi,

            Don't you get in the output the database you're using after the query:


            Database: genome.fa
            139,530 sequences; 107,332,603 total letters

            Maybe is the path to the database...

            Comment


            • #7
              I'm pretty confident the database path is correct.

              The database should also be blast-formatted; i.e. I've downloaded the nr.00.tar.gz, etc. archives from the ftp://ftp.ncbi.nlm.nih.gov/blast/db/ site.

              I've run blastdbcheck and get the following output:


              Writing messages to file (test.txt) at verbosity (Summary)
              ISAM testing is ENABLED.
              Legacy testing is DISABLED.
              By default, testing 200 randomly sampled OIDs.

              Testing 5 volume(s).
              /home/andersmy/Blast/db/nr.00 / MetaData: [ERROR] caught exception.
              /home/andersmy/Blast/db/nr.01 / MetaData: [ERROR] caught exception.
              /home/andersmy/Blast/db/nr.02 / MetaData: [ERROR] caught exception.
              /home/andersmy/Blast/db/nr.03 / MetaData: [ERROR] caught exception.
              /home/andersmy/Blast/db/nr.04 / MetaData: [ERROR] caught exception.
              Result=FAILURE. 5 errors reported in 5 volume(s).
              Testing 1 alias(es).
              Result=SUCCESS. No errors reported for 1 alias(es).

              Total errors: 5

              Is there something wrong with the database that makes blastx crash?

              blastx -query Oppgave/trh52.fna -db Blast/db/nr -out result.txt

              Writes the result.txt to disk, there is no command window output, and the command window freezes.

              Comment


              • #8
                Five errors from the five chunks of the NR database -- something is messed up

                Can you also download the nr.*.md5 files and use the md5sum command line tool to verify the nr.*.tar.gz files downloaded correctly? They are just tiny little text files which contain a list of md5 checksums and filenames. e.g. "md5sum --check nr.00.tar.gz.md5" should calculate the md5 checksum for nr.00.tar.gz, and thus spot if it was corrupted on download.

                Comment


                • #9
                  I've downloaded the nr.0*.tar.gz files once more as well as the md5 files, and reinstalled the database files.

                  I've performed the md5sum --check on all files and they're all ok.

                  Still I get the same error message from blastdbcheck after extracting these archives to my database directory.

                  And when I run blastx with the nr database, again the command interface just freezes.

                  I've tested downloading another nucleotide fasta file from NCBI, and blastx still freezes, so the input should not be to blame here. So somehow there's something funky with the database...

                  Comment


                  • #10
                    Hmm. Have you tried another database? e.g. the NCBI vector nucleotide database is very small.

                    Comment


                    • #11
                      I've run blastn successfully with my Fasta files using the vector database.

                      blastn checks all the contigs in my fasta file against the vector database and produces a smooth output file!


                      I've been told to use the non-redundant one though, and more importantly; I've to assess which of the hits are probable contamination, and not horizontal gene transfer.

                      I'm pretty blank as to how to discern these two. But I was told that any eucaryotic matches would highly likely be contamination of the E.coli strains.

                      Perhaps I should start a new thread regarding the contamination issue?

                      Or any good sources I should check out on the web?

                      Also, seeing theres >100 contigs in each file, is there an easy way to make a truncated list with only the best hits in each contig based on some conditions, say only eucaryotic genome?

                      Comment


                      • #12
                        It is good that blastn worked with the small NCBI provided vector database. That seems to confirm your installation of BLAST+ is OK.

                        My guess is that your machine does not have enough RAM to do a search against a large database like NR.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-27-2024, 06:37 PM
                        0 responses
                        12 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-27-2024, 06:07 PM
                        0 responses
                        11 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        53 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        69 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X