Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Protein ID that blast could not identify

    HI
    I downloaded a proteome in fasta formater, which contains hundreds of proteins (http://labs.umassmed.edu/chlamyfp/in...p?content=help). And I want to blast against these proteins with my data using Blast+, however, when I makeblastdb the proteome dataset, an error occurred
    *******************************************************************
    Error: NCBI C++ Exception:
    "/am/ncbiapdata/release/blast/src/2.2.26/IntelMAC-universal/c++/GCC401-ReleaseMT--IntelMAC-universal/../src/objects/seq/../seqloc/Seq_id.cpp", line 1679: Error: ncbi:bjects::CSeq_id::x_Init() - Unsupported ID type C_1150005
    *******************************************************************
    I thing there must be something wrong with the proteome data, cause the blast+ just worked well when I used the data downloaded directly from NCBI.

    Therefore, I opened the proteome data with textedit, and for example, the header of each sequence was like this
    *****************************************************************
    >C_680011|168600 FAP45, Flagellar Associated Protein Weakly Similar to Nasopharyngeal Epithelium Specific Protein 1
    MPQTPPRSGGYRSGKQSYVDESLFGGSKRTGAAQVETLDSLKLTAPTRTISPKDRDVVTLTKGDLTRMLKASPIMTAEDVAAAKREAEAKREQLQAVSKA
    RKEKMLKLEEEAKKQAPPTETEILQRQLNDATRSRATHMMLEQKDPVKHMNQMMLYSKCVTIRDAQIEEKKQMLAEEEEEQRRLDLMMEIERVKALEQYE
    ARERQRVEERRKGAAVLSEQIKERERERIRQEELRDQERLQMLREIERLKEEEMQAQIEKKIQAKQLMEEVAAANSEQIKRKEGMKVREKEEDLRIADYI
    LQKEMREQ
    *****************************************************************

    Here the "C_680011|168600" should be the protein ID I think, but there was no found if I search it in NCBI. I just wonder what kind of ID it is and how should I do to make the blast+ recognise it.

    Thanks!

  • #2
    Are you using the -parse_seqids option? If so, try it without this. I only ever use this if my FASTA file identifiers follow the NCBI naming conventions.

    It would be useful to show the command you used to run makeblastdb as that might help us to understand what you are doing.

    Comment


    • #3
      Originally posted by maubp View Post
      Are you using the -parse_seqids option? If so, try it without this. I only ever use this if my FASTA file identifiers follow the NCBI naming conventions.

      It would be useful to show the command you used to run makeblastdb as that might help us to understand what you are doing.
      Dear Maubp,
      Thanks for you reply.
      Yes I used -parse_seqids, and followed your suggestion, without the -parse_seqids, another error showed up,
      *******************************************************************
      Error: (CArgException::eNoArg) Argument "dbtype". Mandatory value is missing: `String, `nucl', `prot''
      Error: (CArgException::eNoArg) Application's initialization failed
      *****************************************************************

      The command I used was
      makeblastdb -in CrFP.fasta -out CrFP

      Thanks

      Comment


      • #4
        That error is clear isn't it? You have to tell makeblastdb if your FASTA file is protein or nucleotides. i.e. either:

        Code:
        makeblastdb -in CrFP.fasta -out CrFP -dbtype nucl
        or,

        Code:
        makeblastdb -in CrFP.fasta -out CrFP -dbtype prot

        Comment


        • #5
          Originally posted by maubp View Post
          That error is clear isn't it? You have to tell makeblastdb if your FASTA file is protein or nucleotides. i.e. either:

          Code:
          makeblastdb -in CrFP.fasta -out CrFP -dbtype nucl
          or,

          Code:
          makeblastdb -in CrFP.fasta -out CrFP -dbtype prot
          YES!
          What a stupid mistake I made. It succeeded now!

          Thank you!

          Comment


          • #6
            Originally posted by Tsuyoshi View Post
            It succeeded now!
            Oh good. Understanding the NCBI BLAST+ error messages gets easier with practice

            Comment


            • #7
              Originally posted by maubp View Post
              Oh good. Understanding the NCBI BLAST+ error messages gets easier with practice
              YEAP!

              I couldn't agree with you anymore. Many thanks!

              Comment


              • #8
                Originally posted by maubp View Post
                Oh good. Understanding the NCBI BLAST+ error messages gets easier with practice
                HI Maubp,
                But I still have a question about the protein ID, it seems like that there is no database name the proteins in that way, I mean, take several proteins as example, they are

                C_1620015|156900
                C_10830001|152917
                C_2020008|159281
                C_510029|166481
                C_510029|166481
                C_510029|166481
                C_510029|166481

                I do not think they are accession numbers for Chlamydomonas in NCBI, but I want to identify their correct or real NCBI accession numbers, do you have any idea about that?

                Comment


                • #9
                  That's a different question - the only way your sequences would have real NCBI accession numbers would be if they have already been submitted to one of the databases. I would explore the NCBI databases for this using Entrez search term "chlamydomonas[orgn]" and see if anything matches your dataset:


                  (square brackets in the URL confuse the forum software)

                  Or you could try BLAST'ing some of your sequences against the NR database to see if any give perfect matches?
                  Last edited by maubp; 09-10-2012, 03:10 AM. Reason: Trying to fix link

                  Comment


                  • #10
                    Originally posted by maubp View Post
                    That's a different question - the only way your sequences would have real NCBI accession numbers would be if they have already been submitted to one of the databases. I would explore the NCBI databases for this using Entrez search term "chlamydomonas[orgn]" and see if anything matches your dataset:

                    http://www.ncbi.nlm.nih.gov/sites/gq...=chlamydomonas[orgn]

                    Or you could try BLAST'ing some of your sequences against the NR database to see if any give perfect matches?
                    The sequences themselves are perfectly matched the submitted data of Chlamydomonas. I just have no idea what kind of IDs they are that the authors used.

                    Comment


                    • #11
                      If you can work out how to get the data from the NCBI with their accessions, that might be simpler than working with the original author's private identifiers.

                      Comment


                      • #12
                        Originally posted by maubp View Post
                        If you can work out how to get the data from the NCBI with their accessions, that might be simpler than working with the original author's private identifiers.
                        That's right.
                        Anyway, I will try to extract the accession numbers from NCBI.
                        Thank you very much Maubp !

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Best Practices for Single-Cell Sequencing Analysis
                          by seqadmin



                          While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                          06-06-2024, 07:15 AM
                        • seqadmin
                          Latest Developments in Precision Medicine
                          by seqadmin



                          Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                          Somatic Genomics
                          “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                          05-24-2024, 01:16 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 06-21-2024, 07:49 AM
                        0 responses
                        14 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 06-20-2024, 07:23 AM
                        0 responses
                        14 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 06-17-2024, 06:54 AM
                        0 responses
                        16 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 06-14-2024, 07:24 AM
                        0 responses
                        25 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X