Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to split a fasta file according to a list of gene ID

    Hi ALL,

    I have a fasta file and I want to split it in two two fasta files according to a list of sequence names in a text file (one seq name per line). So those seqs which have a match with the sequences names can be output to one fasta file and the others in another file.

    Could anybody provide me a script or some programs to perform this work? There are some online tools, but it would take a large amount of time to upload my file.

    Thanks.

  • #2
    If your sequences aren't split to multiple lines you can do this with grep. I think:

    grep -A 1 -f yourSeqIDFile.txt yourFastaFile.fasta > SeqsFromIDList.fasta
    grep -A 1 -v -f yourSeqIDFile.txt yourFastaFile.fasta > TheOtherSeqs.fasta

    might remember wrong..


    If you have QIIME, you can do this with filter_fasta.py..
    Last edited by rhinoceros; 08-13-2013, 08:56 AM.
    savetherhino.org

    Comment


    • #3
      Here is a script I wrote a while back to almost do what you want. It takes as input a FASTA file, a text file with a list of sequence IDs (one per line) and a mode argument to include or exclude the IDs in your list from the output. You could simply run the script twice, once in each mode to get the two complementary outputs, or if you feel like it modify the code to generate two output files. As it works now output is written to STDOUT so you can only capture one output by redirecting STDOUT to a file.

      Code:
      Usage:
      
      % subSetFasta.pl -f <fastaFileName> -l <listFileName> -m [i or e]
      
      Example:
      
      % subSetFasta.pl -f mySeqs.fasta -l myList.txt -m i > inList.fasta
      % subSetFasta.pl -f mySeqs.fasta -l myList.txt -m e > notInList.fasta
      If you do not specify a -mode argument the script defaults to the 'include' mode.

      A note about ID matching: the script bases a match on the first non-white space delimited text on the defline. If your defline is:

      Code:
      >sequenceID sequence description follows
      The script will only attempt to match 'sequenceID' so make sure that is the text in list file.
      Attached Files
      Last edited by kmcarr; 08-13-2013, 10:16 AM. Reason: Add note about default mode.

      Comment


      • #4
        Originally posted by lran2008 View Post
        Hi ALL,

        I have a fasta file and I want to split it in two two fasta files according to a list of sequence names in a text file (one seq name per line). So those seqs which have a match with the sequences names can be output to one fasta file and the others in another file.

        Could anybody provide me a script or some programs to perform this work? There are some online tools, but it would take a large amount of time to upload my file.

        Thanks.
        Try this: https://code.google.com/p/nash-bioin...ta.pl&can=2&q=

        Hopefully it will do the job you need.

        J
        Last edited by JohnN; 08-13-2013, 10:19 AM. Reason: Wrong URL

        Comment


        • #5
          Originally posted by rhinoceros View Post
          If your sequences aren't split to multiple lines you can do this with grep. I think:

          grep -A 1 -f yourSeqIDFile.txt yourFastaFile.fasta > SeqsFromIDList.fasta
          grep -A 1 -v -f yourSeqIDFile.txt yourFastaFile.fasta > TheOtherSeqs.fasta

          might remember wrong..


          If you have QIIME, you can do this with filter_fasta.py..
          Thanks. The second command didn't work.

          Comment


          • #6
            Originally posted by kmcarr View Post
            Here is a script I wrote a while back to almost do what you want. It takes as input a FASTA file, a text file with a list of sequence IDs (one per line) and a mode argument to include or exclude the IDs in your list from the output. You could simply run the script twice, once in each mode to get the two complementary outputs, or if you feel like it modify the code to generate two output files. As it works now output is written to STDOUT so you can only capture one output by redirecting STDOUT to a file.

            Code:
            Usage:
            
            % subSetFasta.pl -f <fastaFileName> -l <listFileName> -m [i or e]
            
            Example:
            
            % subSetFasta.pl -f mySeqs.fasta -l myList.txt -m i > inList.fasta
            % subSetFasta.pl -f mySeqs.fasta -l myList.txt -m e > notInList.fasta
            If you do not specify a -mode argument the script defaults to the 'include' mode.

            A note about ID matching: the script bases a match on the first non-white space delimited text on the defline. If your defline is:

            Code:
            >sequenceID sequence description follows
            The script will only attempt to match 'sequenceID' so make sure that is the text in list file.
            Thanks very much. The script works perfectly!

            Comment


            • #7
              In case anyone needed more alternatives, you can also use fastq_select.tcl which is bundled in with mira. This also got discussed in an earlier thread, which might be useful.

              Comment


              • #8
                If you want a Galaxy solution, try this:


                Or this related but subtly different tool which pulls out the reads in the ID order given

                Comment


                • #9
                  Originally posted by maubp View Post
                  If you want a Galaxy solution, try this:


                  Or this related but subtly different tool which pulls out the reads in the ID order given
                  http://toolshed.g2.bx.psu.edu/view/p...q_select_by_id
                  This should work. I didn't try it,so I don't know whether it can output a fasta file for unmatched seq.

                  Comment


                  • #10
                    Originally posted by lran2008 View Post
                    This should work. I didn't try it,so I don't know whether it can output a fasta file for unmatched seq.
                    Yes, my sequence filter tool can produce a FASTA file with matched IDs, a FASTA file with non-matching IDs, or both (two FASTA files):


                    There is a preview/mockup of the tool available to view within the Tool Shed which should help explain this.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Best Practices for Single-Cell Sequencing Analysis
                      by seqadmin



                      While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                      Yesterday, 07:15 AM
                    • seqadmin
                      Latest Developments in Precision Medicine
                      by seqadmin



                      Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                      Somatic Genomics
                      “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                      05-24-2024, 01:16 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Today, 06:58 AM
                    0 responses
                    13 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 08:18 AM
                    0 responses
                    19 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 08:04 AM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-03-2024, 06:55 AM
                    0 responses
                    13 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X