Seqanswers Leaderboard Ad

**rhinoceros** · 08-13-2013, 08:53 AM

If your sequences aren't split to multiple lines you can do this with grep. I think:

grep -A 1 -f yourSeqIDFile.txt yourFastaFile.fasta > SeqsFromIDList.fasta
grep -A 1 -v -f yourSeqIDFile.txt yourFastaFile.fasta > TheOtherSeqs.fasta

might remember wrong..

If you have QIIME, you can do this with filter_fasta.py..

**kmcarr** · 08-13-2013, 10:14 AM

Here is a script I wrote a while back to almost do what you want. It takes as input a FASTA file, a text file with a list of sequence IDs (one per line) and a mode argument to include or exclude the IDs in your list from the output. You could simply run the script twice, once in each mode to get the two complementary outputs, or if you feel like it modify the code to generate two output files. As it works now output is written to STDOUT so you can only capture one output by redirecting STDOUT to a file.

Code:

Usage:

% subSetFasta.pl -f <fastaFileName> -l <listFileName> -m [i or e]

Example:

% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m i > inList.fasta
% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m e > notInList.fasta

If you do not specify a -mode argument the script defaults to the 'include' mode.

A note about ID matching: the script bases a match on the first non-white space delimited text on the defline. If your defline is:

Code:

>sequenceID sequence description follows

The script will only attempt to match 'sequenceID' so make sure that is the text in list file.

Attached Files

subSetFasta.pl (2.4 KB, 101 views)

**JohnN** · 08-13-2013, 10:16 AM

Originally posted by lran2008 View Post

Hi ALL,

I have a fasta file and I want to split it in two two fasta files according to a list of sequence names in a text file (one seq name per line). So those seqs which have a match with the sequences names can be output to one fasta file and the others in another file.

Could anybody provide me a script or some programs to perform this work? There are some online tools, but it would take a large amount of time to upload my file.

Thanks.

Try this: https://code.google.com/p/nash-bioin...ta.pl&can=2&q=

Hopefully it will do the job you need.

J

**lran2008** · 08-13-2013, 12:43 PM

Originally posted by rhinoceros View Post

If your sequences aren't split to multiple lines you can do this with grep. I think:

grep -A 1 -f yourSeqIDFile.txt yourFastaFile.fasta > SeqsFromIDList.fasta
grep -A 1 -v -f yourSeqIDFile.txt yourFastaFile.fasta > TheOtherSeqs.fasta

might remember wrong..

If you have QIIME, you can do this with filter_fasta.py..

Thanks. The second command didn't work.

**lran2008** · 08-13-2013, 12:49 PM

Originally posted by kmcarr View Post

Here is a script I wrote a while back to almost do what you want. It takes as input a FASTA file, a text file with a list of sequence IDs (one per line) and a mode argument to include or exclude the IDs in your list from the output. You could simply run the script twice, once in each mode to get the two complementary outputs, or if you feel like it modify the code to generate two output files. As it works now output is written to STDOUT so you can only capture one output by redirecting STDOUT to a file.

Code:

Usage:

% subSetFasta.pl -f <fastaFileName> -l <listFileName> -m [i or e]

Example:

% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m i > inList.fasta
% subSetFasta.pl -f mySeqs.fasta -l myList.txt -m e > notInList.fasta

If you do not specify a -mode argument the script defaults to the 'include' mode.

A note about ID matching: the script bases a match on the first non-white space delimited text on the defline. If your defline is:

Code:

>sequenceID sequence description follows

The script will only attempt to match 'sequenceID' so make sure that is the text in list file.

Thanks very much. The script works perfectly!

**JamieHeather** · 08-14-2013, 03:00 AM

In case anyone needed more alternatives, you can also use fastq_select.tcl which is bundled in with mira. This also got discussed in an earlier thread, which might be useful.

**maubp** · 08-14-2013, 06:10 AM

If you want a Galaxy solution, try this:

Galaxy | Tool Shed

http://toolshed.g2.bx.psu.edu/view/peterjc/seq_filter_by_id

Or this related but subtly different tool which pulls out the reads in the ID order given

Galaxy | Tool Shed

http://toolshed.g2.bx.psu.edu/view/peterjc/seq_select_by_id

**lran2008** · 08-14-2013, 09:21 AM

Originally posted by maubp View Post

If you want a Galaxy solution, try this:

Galaxy | Tool Shed

http://toolshed.g2.bx.psu.edu/view/peterjc/seq_filter_by_id

Or this related but subtly different tool which pulls out the reads in the ID order given
http://toolshed.g2.bx.psu.edu/view/p...q_select_by_id

This should work. I didn't try it,so I don't know whether it can output a fasta file for unmatched seq.

**maubp** · 08-15-2013, 12:54 AM

Originally posted by lran2008 View Post

This should work. I didn't try it,so I don't know whether it can output a fasta file for unmatched seq.

Yes, my sequence filter tool can produce a FASTA file with matched IDs, a FASTA file with non-matching IDs, or both (two FASTA files):

Galaxy | Tool Shed

http://toolshed.g2.bx.psu.edu/view/peterjc/seq_filter_by_id

There is a preview/mockup of the tool available to view within the Tool Shed which should help explain this.

Topics	Statistics	Last Post
The Adaptation of the Cell Cycle in Multiciliated Cells by seqadmin Started by seqadmin, Today, 06:58 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 06:58 AM
New Method for DNA Sequence Amplification by seqadmin Started by seqadmin, Yesterday, 08:18 AM	0 responses 19 views 0 likes	Last Post by seqadmin Yesterday, 08:18 AM
New Tools Enhance Single-Molecule DNA Analysis with Minimal Samples by seqadmin Started by seqadmin, Yesterday, 08:04 AM	0 responses 18 views 0 likes	Last Post by seqadmin Yesterday, 08:04 AM
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, 06-03-2024, 06:55 AM	0 responses 13 views 0 likes	Last Post by seqadmin 06-03-2024, 06:55 AM

Seqanswers Leaderboard Ad

Announcement

how to split a fasta file according to a list of gene ID

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News