Extract multiple fasta sequences from a fasta file based on sequenes

entomology replied

12-19-2015, 07:28 PM
Yes, that's does work. And it's more flexible and robust with different situations. Thank you for the great work. I've got a lot of fantastic script in bbmap!

Originally posted by Brian Bushnell View Post

Yep, it keeps the sequences in a.fasta that match sequences in b.fasta, and retains the names. You could alternatively run "filterbysequence.sh in=b.fasta ref=a.fasta out=c.fasta" to keep the names from b.fasta.
Leave a comment:
Brian Bushnell replied

12-19-2015, 05:50 PM
Yep, it keeps the sequences in a.fasta that match sequences in b.fasta, and retains the names. You could alternatively run "filterbysequence.sh in=b.fasta ref=a.fasta out=c.fasta" to keep the names from b.fasta.
Leave a comment:
entomology replied

12-19-2015, 04:39 PM
Thank you for your great work. I've tried the script with below two fasta files.

a.fasta

>1
aaaaaa
>2
tttttt
>3
gggggg
>4
cccccc

b.fasta

>1123-11234
aaaaaa
>wer
atgcca
>ad
ctaacg
>232-23424
tttttt
>323-342
cacaaa
>416-2
gggggg
>13424241234-23423
cccccc
>5-234
cggcgtcacgttggttgttga

running the script
filterbysequence.sh in=a.fasta ref=b.fas out=c.fasta ow=true

I've got c.fasta exactly same with a.fasta, is it supposed to replace identifier of the sequences in a.fasta from b.fasta?

Originally posted by Brian Bushnell View Post

So, I was inspired by this thread to add something into BBTools that could accomplish this. Thus, there's yet another method, "filterbysequence.sh". Usage:

Code:

filterbysequence.sh in=a.fasta ref=b.fasta out=c.fasta

c.fasta will then contain all sequences shared by a.fasta and b.fasta. It supports case-matching or case-insensitive operation, and reverse-complement-aware or forward-only operation. And it can either do an exclusion or inclusion filter. Also, it can optionally reduce very large sequences down to their 128-bit hash-codes for low-memory operation (so, for example, you could easily filter sequences against nt or RefSeq microbial quickly in a small amount of memory to see if they are already present before adding yet another copy of E.coli, which is something NCBI absolutely needs to do). And it's very, very fast.
Leave a comment:
Brian Bushnell replied

12-19-2015, 10:17 AM
So, I was inspired by this thread to add something into BBTools that could accomplish this. Thus, there's yet another method, "filterbysequence.sh". Usage:

Code:

filterbysequence.sh in=a.fasta ref=b.fasta out=c.fasta

c.fasta will then contain all sequences shared by a.fasta and b.fasta. It supports case-matching or case-insensitive operation, and reverse-complement-aware or forward-only operation. And it can either do an exclusion or inclusion filter. Also, it can optionally reduce very large sequences down to their 128-bit hash-codes for low-memory operation (so, for example, you could easily filter sequences against nt or RefSeq microbial quickly in a small amount of memory to see if they are already present before adding yet another copy of E.coli, which is something NCBI absolutely needs to do). And it's very, very fast.
Last edited by Brian Bushnell; 12-19-2015, 10:21 AM.
Leave a comment:
entomology replied

12-18-2015, 11:18 PM
Another good method, thank you.

Originally posted by gsgs View Post

only those sequences with a "-" in the name ?

filtentr original 1 - > k1

[filters sequences (including name-lines) with - in the first entry]

names k1 5 5 > result
[only the raw sequence data, omitting the names]

using selfwritten simple programs filtentr.c and names.c
Leave a comment:
entomology replied

12-18-2015, 11:13 PM
Yes, this time it works though the output is not as expected .

more out.fas
>1123-11234
--
gggggg
>13424241234-23423
>1123-11234
aaaaaa
ctaacg
>232-23424
>232-23424
tttttt
tttttt
>323-342
>416-2
gggggg
cacaaa
>416-2
>13424241234-23423
cccccc

Originally posted by GenoMax View Post

Are you running bash shell? If you are not then try explicitly going into bash like this

Code:

$ /bin/bash $ while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas
Leave a comment:
entomology replied

12-18-2015, 11:06 PM
Admire you with the enthusiasm on programming, which makes an excellent programmer. Different running environment makes tons of unexpected problems that's why a robust code is really not easy to produce .

Originally posted by SES View Post

This is irrelevant now that the question is answered, but I can reproduce the "can't find getline..." error mentioned above that resulted from my first script.

The simple solution is to put "use IO::File;" at the top of the script. Apparently the behavior of Perl's I/O has changed (I'm using 5.22). Also, there are often OS-specific differences with scripting languages because the "perl" (or whatever) that the vendor distributes is not the same as what you compile yourself. So, it's possible there could be other tweaks required based on the system/versions, but it would be kind of a waste to go down that path since the problem is solved. Okay, now it's time to move on back to work...
Leave a comment:
gsgs replied

12-18-2015, 09:28 PM
only those sequences with a "-" in the name ?

filtentr original 1 - > k1

[filters sequences (including name-lines) with - in the first entry]

names k1 5 5 > result
[only the raw sequence data, omitting the names]

using selfwritten simple programs filtentr.c and names.c
Leave a comment:
SES replied

12-18-2015, 03:46 PM
This is irrelevant now that the question is answered, but I can reproduce the "can't find getline..." error mentioned above that resulted from my first script.

The simple solution is to put "use IO::File;" at the top of the script. Apparently the behavior of Perl's I/O has changed (I'm using 5.22). Also, there are often OS-specific differences with scripting languages because the "perl" (or whatever) that the vendor distributes is not the same as what you compile yourself. So, it's possible there could be other tweaks required based on the system/versions, but it would be kind of a waste to go down that path since the problem is solved. Okay, now it's time to move on back to work...
Leave a comment:
GenoMax replied

12-18-2015, 03:22 PM
Originally posted by entomology View Post

Forgive my poor programming skill, still I got some error message as below

-bash: syntax error near unexpected token `do'

Are you running bash shell? If you are not then try explicitly going into bash like this

Code:

$ /bin/bash $ while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas
Leave a comment:
SES replied

12-18-2015, 02:40 PM
Originally posted by entomology View Post

Thank you very much for your help. you are right, i get the order wrong. now it works really good with the updated version without warning.

the command test gives the result as below:

perl -v

This is perl, v5.10.1 (*) built for x86_64-linux-thread-multi

perl -MIO::Handle -e 'print IO::Handle->VERSION'
1.28

thanks again for your patience and useful help!!!

Great, thanks for the info. I'm creating a CentOS 6 image now to test your errors above.
Leave a comment:
entomology replied

12-18-2015, 02:23 PM
Thank you very much for your help. you are right, i get the order wrong. now it works really good with the updated version without warning.

the command test gives the result as below:

perl -v

This is perl, v5.10.1 (*) built for x86_64-linux-thread-multi

perl -MIO::Handle -e 'print IO::Handle->VERSION'
1.28

thanks again for your patience and useful help!!!

Originally posted by SES View Post

You must be giving the files in the wrong order because it works for me. It needs to be "perl script.pl original.fas other.fas." Though, the approach I posted is odd since there is something wrong with readline on your system and it will generate warnings. Here is a different approach that should work the same:

Code:

#!/usr/bin/env perl use strict; use warnings; use File::Basename; my $usage = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas"; my $infilei = shift or die $usage; my $infilej = shift or die $usage; my $hash = index_seq($infilei); open my $inj, '<', $infilej or die $!; { local $/ = '>'; while (my $line = <$inj>) { chomp $line; my ($seqid, @seqparts) = split /\n/, $line; my $seq = join '', @seqparts; next unless defined $seqid && defined $seq; if (exists $hash->{$seq}) { print join "\n", ">".$hash->{$seq}, "$seq\n"; } } } close $inj; sub index_seq { my ($file) = @_; open my $in, '<', $file or die $!; my %hash; { local $/ = '>'; while (my $line = <$in>) { chomp $line; my ($seqid, @seqparts) = split /\n/, $line; my $seq = join '', @seqparts; next unless defined $seqid && defined $seq; $hash{$seq} = $seqid; } } close $in; return \%hash; }

By the way, can you tell me the output of "perl -v" and "perl -MIO::Handle -e 'print IO::Handle->VERSION'". Thanks. I am also using CentOS, so I'm curious about that message.
Leave a comment:
SES replied

12-18-2015, 02:12 PM
Originally posted by entomology View Post

I'm using centos

2.6.32-573.7.1.el6.centos.plus.x86_64

when i change as you recommended, this time I got result. but still with some error and the result seems not as expected.

./fetch_fasta.test seq.test seq.test.original.fas
Value of <HANDLE> construct can be "0"; test with defined() at ./fetch_fasta.test line 31.
>1
aaaaaa
>2
tttttt
>3
gggggg

I expected the result like this:

>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

You must be giving the files in the wrong order because it works for me. It needs to be "perl script.pl original.fas other.fas." Though, the approach I posted is odd since there is something wrong with readline on your system and it will generate warnings. Here is a different approach that should work the same:

Code:

#!/usr/bin/env perl use strict; use warnings; use File::Basename; my $usage = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas"; my $infilei = shift or die $usage; my $infilej = shift or die $usage; my $hash = index_seq($infilei); open my $inj, '<', $infilej or die $!; { local $/ = '>'; while (my $line = <$inj>) { chomp $line; my ($seqid, @seqparts) = split /\n/, $line; my $seq = join '', @seqparts; next unless defined $seqid && defined $seq; if (exists $hash->{$seq}) { print join "\n", ">".$hash->{$seq}, "$seq\n"; } } } close $inj; sub index_seq { my ($file) = @_; open my $in, '<', $file or die $!; my %hash; { local $/ = '>'; while (my $line = <$in>) { chomp $line; my ($seqid, @seqparts) = split /\n/, $line; my $seq = join '', @seqparts; next unless defined $seqid && defined $seq; $hash{$seq} = $seqid; } } close $in; return \%hash; }

By the way, can you tell me the output of "perl -v" and "perl -MIO::Handle -e 'print IO::Handle->VERSION'". Thanks. I am also using CentOS, so I'm curious about that message.
Last edited by SES; 12-18-2015, 02:14 PM.
Leave a comment:
entomology replied

12-18-2015, 01:58 PM
I'm using centos

2.6.32-573.7.1.el6.centos.plus.x86_64

when i change as you recommended, this time I got result. but still with some error and the result seems not as expected.

./fetch_fasta.test seq.test seq.test.original.fas
Value of <HANDLE> construct can be "0"; test with defined() at ./fetch_fasta.test line 31.
>1
aaaaaa
>2
tttttt
>3
gggggg

I expected the result like this:

>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

Originally posted by SES View Post

Interesting, never seen that. What OS are you using? To fix the Perl issue, replace this line:

Code:

return unless my $entry = $fh->getline;

with

Code:

return unless my $entry = <$fh>;

should do the trick.
Leave a comment:
SES replied

12-18-2015, 01:49 PM
Originally posted by entomology View Post

I've got below message when I running the script, do i miss some module?

Can't locate object method "getline" via package "IO::Handle" at ./fetch_fasta.test line 30

Interesting, never seen that. What OS are you using? To fix the Perl issue, replace this line:

Code:

return unless my $entry = $fh->getline;

with

Code:

return unless my $entry = <$fh>;

should do the trick.
Leave a comment:

Previous 1 2 3 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News