Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • entomology
    Member
    • May 2011
    • 34

    Extract multiple fasta sequences from a fasta file based on sequenes

    I've searched in the forum and google as well, but most of cases are extract sequences from fasta file based on id list. but now I only have several sequences without ids and i want to extract from a fasta file based on these sequences, does anyone have clues for it? Actually, that means I need the sequences identifier (or id) from the fasta file.

    Thank you very much for your help.
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    You can extract sequences that share kmers with your sequences with BBDuk:

    Code:
    bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31
    This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.

    You can also do something more precise with Dedupe, as it allows arbitrary set operations; so, let me know if the above method is insufficient.

    Comment

    • cmbetts
      Senior Member
      • Jun 2012
      • 120

      #3
      There's almost certainly lots of ways to do this depending on what tools you're comfortable with.

      In R/Bioconductor, you could read in the fasta using the bioconductor ShortRead package, and then use vcountPattern to identify the hits to your query sequences and write those as a new fasta file.

      It's been a long time since I used it, but BioPython also has some nice iterators for going through fasta files, and would be better suited for a bigger fasta file. You would essentially write a little loop that iterates through the fasta and queries each record for your desired sequence. If it finds the sequence, write the record to a new file, otherwise move on to the next record.

      I'm sure some folks might have some grep based methods for the commandline as well

      Comment

      • entomology
        Member
        • May 2011
        • 34

        #4
        3xs for the info.


        Originally posted by Brian Bushnell View Post
        You can extract sequences that share kmers with your sequences with BBDuk:

        Code:
        bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31
        This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.

        You can also do something more precise with Dedupe, as it allows arbitrary set operations; so, let me know if the above method is insufficient.

        Comment

        • entomology
          Member
          • May 2011
          • 34

          #5
          Actually, I do prefer command based under shell environment. grep seems a great way to do it, thanks for tips.


          Originally posted by cmbetts View Post
          There's almost certainly lots of ways to do this depending on what tools you're comfortable with.

          In R/Bioconductor, you could read in the fasta using the bioconductor ShortRead package, and then use vcountPattern to identify the hits to your query sequences and write those as a new fasta file.

          It's been a long time since I used it, but BioPython also has some nice iterators for going through fasta files, and would be better suited for a bigger fasta file. You would essentially write a little loop that iterates through the fasta and queries each record for your desired sequence. If it finds the sequence, write the record to a new file, otherwise move on to the next record.

          I'm sure some folks might have some grep based methods for the commandline as well

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            Here are my Python scripts to do this, with Galaxy wrappers:

            This filters the FASTA file (loads all the IDs, then goes through the FASTA file once):
            Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


            This indexes the FASTA file, then goes through the IDs one by one:
            Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


            The difference is which order do you want? The order in the FASTA file (faster), or the order in the ID file (slower).

            Comment

            • entomology
              Member
              • May 2011
              • 34

              #7
              Seems these script also deal with a list of ids, then using these ids to fetch sequences in a fasta file. I wanna use sequences directly and get a subset of fasta from a big fasta file base on provided sequences. Thank you for your information as well.


              Originally posted by maubp View Post
              Here are my Python scripts to do this, with Galaxy wrappers:

              This filters the FASTA file (loads all the IDs, then goes through the FASTA file once):
              Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


              This indexes the FASTA file, then goes through the IDs one by one:
              Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


              The difference is which order do you want? The order in the FASTA file (faster), or the order in the ID file (slower).

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #8
                What scripting/programming language(s) are you learning?

                Comment

                • entomology
                  Member
                  • May 2011
                  • 34

                  #9
                  Actually, I do more wet lab most of the time. Sometimes, I'll also do simple small rna analysis with ready-to-use perl script and simple shell script. It's enough most of the time with my work, but sometime it's not easy for me.

                  Originally posted by maubp View Post
                  What scripting/programming language(s) are you learning?

                  Comment

                  • maubp
                    Peter (Biopython etc)
                    • Jul 2009
                    • 1544

                    #10
                    Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.

                    Comment

                    • entomology
                      Member
                      • May 2011
                      • 34

                      #11
                      No worry, I'll try to use grep to deal with the problem .
                      Originally posted by maubp View Post
                      Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        #12
                        Brian's solution should work. Did you try it?

                        While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.

                        Comment

                        • entomology
                          Member
                          • May 2011
                          • 34

                          #13
                          Yes, I've tried bbduk.sh.

                          bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31

                          But my situation is that b.fa is not fasta file, it contain one sequence per line. I just want the sequence in b from a.fa, than make a new fasta file (c.fa).

                          since my b.fa is not a fasta file, so bbduk.sh give some error:

                          Exception in thread "Thread-9" java.lang.RuntimeException: Error parsing read from text.


                          Originally posted by GenoMax View Post
                          Brian's solution should work. Did you try it?

                          While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            #14
                            If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

                            Code:
                            $ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta
                            Then use the file with BBDuk.

                            Comment

                            • entomology
                              Member
                              • May 2011
                              • 34

                              #15
                              Thank you for the code. It can change my sequences to fasta file. And I try bbduk.fas again, but the result is not as expected. An example will be more easier to understand. there are two fasta

                              original.fas
                              >1123-11234
                              aaaaaa
                              >wer
                              atgcca
                              >ad
                              ctaacg
                              >232-23424
                              tttttt
                              >323-342
                              cacaaa
                              >416-2
                              gggggg
                              >13424241234-23423
                              cccccc
                              >5-234
                              cggcgtcacgttggttgttga


                              ref.fas(after I make fasta using your awk script)
                              >1
                              aaaaaa
                              >2
                              tttttt
                              >3
                              gggggg
                              >4
                              cccccc

                              I use "bbmap/bbduk.sh in=original.fas ref=ref.fas out=out.fas mkf=1 mm=f k=21"

                              out.fas is like this
                              >5-234
                              cggcgtcacgttggttgttga

                              actually, I want a fasta like this

                              >1123-11234
                              aaaaaa
                              >232-23424
                              tttttt
                              >416-2
                              gggggg
                              >13424241234-23423
                              cccccc

                              Just like fetch the id from the original.fas


                              Originally posted by GenoMax View Post
                              If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

                              Code:
                              $ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta
                              Then use the file with BBDuk.

                              Comment

                              Latest Articles

                              Collapse

                              • GATTACAT
                                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by GATTACAT
                                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                                Yesterday, 11:43 AM
                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-30-2026, 05:37 AM
                              0 responses
                              11 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              18 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              52 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              111 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...