Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • angeloulivieri
    Member
    • Jul 2012
    • 30

    Extract only sequence ids from fasta file with makeblastdb

    Hi all,
    i'm new about learning blast and i'm exploring now its functions by command line.
    I already know that to make a blastx i have first to indicize my fasta db with makeblastdb.
    I already used blast to learn how it works and I would that in the output not all the informations about the sequence are present (code, description,..etc) but only the sequence code.
    How can I do it? Somewhere I read that I have to give some parameter to the makeblastdb command.... someone here knows what?

    Thanks at all..
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    When do do a BLAST search (e.g. blastp or blastn), there are several different output formats. The plain text and XML have the original FASTA record descriptions, however this is not (currently) available in the tabular output.
    This is an open letter to the NCBI BLAST+ team to request two simple enhancements which I think would be extremely useful - first and foremo...


    Is that what you meant?

    Comment

    • angeloulivieri
      Member
      • Jul 2012
      • 30

      #3
      Yes.. maybe it has been useful. I find that maybe I could do it also with makeblastd. Because my problem is that I would that blast won't use the complete file with all the informations for each sequence but only the sequence id.
      So, in example, the command can be this:

      makeblastdb -in db.fasta -title db -parse_seqids -gi_mask

      What do you think about?

      And maybe later I could use the command blastx with -outfmt "6 qgi sgi"
      to let me see only a table with the results and only showing GI for query and sequence..

      I'm trying executing them since I don't know if there is a way to see how it has done the db with makeblastdb.

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        I only use -parse_seqids if my FASTA files are labeled using the NCBI style with pipe characters (the vertical bases, |, are called pipes). Otherwise I find it doesn't work very well.

        Comment

        • angeloulivieri
          Member
          • Jul 2012
          • 30

          #5
          The format of my fasta file are from NCBI and it look like this

          tr|H3ISY8|H3ISY8_STRPU description OrganismType Other params

          I want that blast use only the first sequence code: H3ISY8

          And show me only these in the results...

          The command I've written bring me a "0 0 0" file... I don't know why.

          If I erase the -outfmt "6 qgi sgi" and tell it only "-outfmt "6" it returns a correct table.
          I'm continuing trying with different parameters as input.

          Comment

          • angeloulivieri
            Member
            • Jul 2012
            • 30

            #6
            So finally, I've seen a lot of parameter and cannot do it. Can it be concluded that is it not permitted to create the binary database that blast uses, only using the sequence id number?

            And there is also no way to have with blastx, in our results, only this code instead that the three parts separated by pipe (|).

            Comment

            • maubp
              Peter (Biopython etc)
              • Jul 2009
              • 1544

              #7
              Originally posted by angeloulivieri View Post
              The format of my fasta file are from NCBI and it look like this

              tr|H3ISY8|H3ISY8_STRPU description OrganismType Other params

              I want that blast use only the first sequence code: H3ISY8
              The simplest way to do that is to make a new FASTA file using that as the ID, and make a BLAST database from that.

              Personally I'd use the database as is and process the BLAST output in a script instead.

              Comment

              • angeloulivieri
                Member
                • Jul 2012
                • 30

                #8
                ok thanks... someone said me that there is a parameter to give to makeblastx... but maybe he's wrong...

                Comment

                • maubp
                  Peter (Biopython etc)
                  • Jul 2009
                  • 1544

                  #9
                  Originally posted by angeloulivieri View Post
                  ok thanks... someone said me that there is a parameter to give to makeblastx... but maybe he's wrong...
                  As mentioned earlier, you might be able to do it via the makeblastdb -parse_seqids option, but that requires your sequence identifiers follow the NCBI naming conventions with the pipe ("|") symbol.

                  If your FASTA file identifiers are not already in the expected format, you'd have to modify the FASTA file - and in my view in that case you might as well avoid using this option, and simply format the identifiers exactly as you want them.

                  Comment

                  • angeloulivieri
                    Member
                    • Jul 2012
                    • 30

                    #10
                    Originally posted by maubp View Post
                    As mentioned earlier, you might be able to do it via the makeblastdb -parse_seqids option, but that requires your sequence identifiers follow the NCBI naming conventions with the pipe ("|") symbol.

                    If your FASTA file identifiers are not already in the expected format, you'd have to modify the FASTA file - and in my view in that case you might as well avoid using this option, and simply format the identifiers exactly as you want them.
                    My FASTA file have this kind of header for each sequence:


                    tr|I1GCL2|I1GCL2_AMPQE Uncharacterized protein OS=Amphimedon queenslandica GN=LOC100637533
                    PE=4 SV=1


                    I would that makeblastdb uses only the ID I1GCL2 as identifier. This could be interesting for me since I want the minor possible heavy database to manage. I already have the other informations collected in a db.

                    I used this command
                    makeblastdb -in uniprot_kb_2012_06.fasta -title uniprot_kb_2012_06 -parse_seqids

                    but it doesn't work as I thought... it collects all the informations of the header :-(
                    Last edited by angeloulivieri; 07-26-2012, 02:53 AM.

                    Comment

                    • angeloulivieri
                      Member
                      • Jul 2012
                      • 30

                      #11
                      no one knows how to do it?

                      Comment

                      • maubp
                        Peter (Biopython etc)
                        • Jul 2009
                        • 1544

                        #12
                        You haven't said which output format you are using. The specially formatted identifiers (with the pipe characters) are how BLAST identifies an accession number - which you can ask for explicitly when using the tabular output.
                        Last edited by maubp; 07-30-2012, 02:39 AM. Reason: corrected typo

                        Comment

                        • angeloulivieri
                          Member
                          • Jul 2012
                          • 30

                          #13
                          I know that when run blastx I can obtain a tabular output with only the the Accession Numbers but it is a different problem. I would have that when the program makeblastdb creates its binary format db it takes only the accession. The reason is that I already have accessions->descriptions in a db and this way could be useful to reduce the quantity of informations to manage when later I run blastx. I hope to be clear...

                          (Maybe something could be done by formatdb command but I see that it's an old command)

                          Comment

                          • maubp
                            Peter (Biopython etc)
                            • Jul 2009
                            • 1544

                            #14
                            Originally posted by angeloulivieri View Post
                            I know that when run blastx I can obtain a tabular output with only the the Accession Numbers but it is a different problem. I would have that when the program makeblastdb creates its binary format db it takes only the accession. The reason is that I already have accessions->descriptions in a db and this way could be useful to reduce the quantity of informations to manage when later I run blastx. I hope to be clear...

                            (Maybe something could be done by formatdb command but I see that it's an old command)
                            The old 'legacy' BLAST suite had commands 'formatdb' and 'blastall', but those are replaced in the new BLAST+ suite by 'makeblastdb' and for running BLAST you have get separate tools 'blastp', 'blastn', etc.

                            Anything you could do with 'formatdb' would (I hope) be supported in 'makeblastdb'.

                            Comment

                            Latest Articles

                            Collapse

                            • SEQadmin2
                              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                              by SEQadmin2


                              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                              ...
                              06-02-2026, 10:05 AM
                            • SEQadmin2
                              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                              by SEQadmin2


                              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                              Introduction

                              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                              05-22-2026, 06:42 AM
                            • SEQadmin2
                              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                              by SEQadmin2

                              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                              05-06-2026, 09:04 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by SEQadmin2, Today, 08:59 AM
                            0 responses
                            9 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 12:03 PM
                            0 responses
                            21 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 11:40 AM
                            0 responses
                            17 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 05-28-2026, 11:40 AM
                            0 responses
                            30 views
                            0 reactions
                            Last Post SEQadmin2  
                            Working...