Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • entomology
    Member
    • May 2011
    • 34

    Extract multiple fasta sequences from a fasta file based on sequenes

    I've searched in the forum and google as well, but most of cases are extract sequences from fasta file based on id list. but now I only have several sequences without ids and i want to extract from a fasta file based on these sequences, does anyone have clues for it? Actually, that means I need the sequences identifier (or id) from the fasta file.

    Thank you very much for your help.
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    You can extract sequences that share kmers with your sequences with BBDuk:

    Code:
    bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31
    This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.

    You can also do something more precise with Dedupe, as it allows arbitrary set operations; so, let me know if the above method is insufficient.

    Comment

    • cmbetts
      Senior Member
      • Jun 2012
      • 120

      #3
      There's almost certainly lots of ways to do this depending on what tools you're comfortable with.

      In R/Bioconductor, you could read in the fasta using the bioconductor ShortRead package, and then use vcountPattern to identify the hits to your query sequences and write those as a new fasta file.

      It's been a long time since I used it, but BioPython also has some nice iterators for going through fasta files, and would be better suited for a bigger fasta file. You would essentially write a little loop that iterates through the fasta and queries each record for your desired sequence. If it finds the sequence, write the record to a new file, otherwise move on to the next record.

      I'm sure some folks might have some grep based methods for the commandline as well

      Comment

      • entomology
        Member
        • May 2011
        • 34

        #4
        3xs for the info.


        Originally posted by Brian Bushnell View Post
        You can extract sequences that share kmers with your sequences with BBDuk:

        Code:
        bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31
        This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.

        You can also do something more precise with Dedupe, as it allows arbitrary set operations; so, let me know if the above method is insufficient.

        Comment

        • entomology
          Member
          • May 2011
          • 34

          #5
          Actually, I do prefer command based under shell environment. grep seems a great way to do it, thanks for tips.


          Originally posted by cmbetts View Post
          There's almost certainly lots of ways to do this depending on what tools you're comfortable with.

          In R/Bioconductor, you could read in the fasta using the bioconductor ShortRead package, and then use vcountPattern to identify the hits to your query sequences and write those as a new fasta file.

          It's been a long time since I used it, but BioPython also has some nice iterators for going through fasta files, and would be better suited for a bigger fasta file. You would essentially write a little loop that iterates through the fasta and queries each record for your desired sequence. If it finds the sequence, write the record to a new file, otherwise move on to the next record.

          I'm sure some folks might have some grep based methods for the commandline as well

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            Here are my Python scripts to do this, with Galaxy wrappers:

            This filters the FASTA file (loads all the IDs, then goes through the FASTA file once):
            Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


            This indexes the FASTA file, then goes through the IDs one by one:
            Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


            The difference is which order do you want? The order in the FASTA file (faster), or the order in the ID file (slower).

            Comment

            • entomology
              Member
              • May 2011
              • 34

              #7
              Seems these script also deal with a list of ids, then using these ids to fetch sequences in a fasta file. I wanna use sequences directly and get a subset of fasta from a big fasta file base on provided sequences. Thank you for your information as well.


              Originally posted by maubp View Post
              Here are my Python scripts to do this, with Galaxy wrappers:

              This filters the FASTA file (loads all the IDs, then goes through the FASTA file once):
              Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


              This indexes the FASTA file, then goes through the IDs one by one:
              Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


              The difference is which order do you want? The order in the FASTA file (faster), or the order in the ID file (slower).

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #8
                What scripting/programming language(s) are you learning?

                Comment

                • entomology
                  Member
                  • May 2011
                  • 34

                  #9
                  Actually, I do more wet lab most of the time. Sometimes, I'll also do simple small rna analysis with ready-to-use perl script and simple shell script. It's enough most of the time with my work, but sometime it's not easy for me.

                  Originally posted by maubp View Post
                  What scripting/programming language(s) are you learning?

                  Comment

                  • maubp
                    Peter (Biopython etc)
                    • Jul 2009
                    • 1544

                    #10
                    Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.

                    Comment

                    • entomology
                      Member
                      • May 2011
                      • 34

                      #11
                      No worry, I'll try to use grep to deal with the problem .
                      Originally posted by maubp View Post
                      Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        #12
                        Brian's solution should work. Did you try it?

                        While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.

                        Comment

                        • entomology
                          Member
                          • May 2011
                          • 34

                          #13
                          Yes, I've tried bbduk.sh.

                          bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31

                          But my situation is that b.fa is not fasta file, it contain one sequence per line. I just want the sequence in b from a.fa, than make a new fasta file (c.fa).

                          since my b.fa is not a fasta file, so bbduk.sh give some error:

                          Exception in thread "Thread-9" java.lang.RuntimeException: Error parsing read from text.


                          Originally posted by GenoMax View Post
                          Brian's solution should work. Did you try it?

                          While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            #14
                            If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

                            Code:
                            $ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta
                            Then use the file with BBDuk.

                            Comment

                            • entomology
                              Member
                              • May 2011
                              • 34

                              #15
                              Thank you for the code. It can change my sequences to fasta file. And I try bbduk.fas again, but the result is not as expected. An example will be more easier to understand. there are two fasta

                              original.fas
                              >1123-11234
                              aaaaaa
                              >wer
                              atgcca
                              >ad
                              ctaacg
                              >232-23424
                              tttttt
                              >323-342
                              cacaaa
                              >416-2
                              gggggg
                              >13424241234-23423
                              cccccc
                              >5-234
                              cggcgtcacgttggttgttga


                              ref.fas(after I make fasta using your awk script)
                              >1
                              aaaaaa
                              >2
                              tttttt
                              >3
                              gggggg
                              >4
                              cccccc

                              I use "bbmap/bbduk.sh in=original.fas ref=ref.fas out=out.fas mkf=1 mm=f k=21"

                              out.fas is like this
                              >5-234
                              cggcgtcacgttggttgttga

                              actually, I want a fasta like this

                              >1123-11234
                              aaaaaa
                              >232-23424
                              tttttt
                              >416-2
                              gggggg
                              >13424241234-23423
                              cccccc

                              Just like fetch the id from the original.fas


                              Originally posted by GenoMax View Post
                              If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

                              Code:
                              $ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta
                              Then use the file with BBDuk.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 08:59 AM
                              0 responses
                              8 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...