Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • entomology
    replied
    Yes, that's does work. And it's more flexible and robust with different situations. Thank you for the great work. I've got a lot of fantastic script in bbmap!

    Originally posted by Brian Bushnell View Post
    Yep, it keeps the sequences in a.fasta that match sequences in b.fasta, and retains the names. You could alternatively run "filterbysequence.sh in=b.fasta ref=a.fasta out=c.fasta" to keep the names from b.fasta.

    Leave a comment:


  • Brian Bushnell
    replied
    Yep, it keeps the sequences in a.fasta that match sequences in b.fasta, and retains the names. You could alternatively run "filterbysequence.sh in=b.fasta ref=a.fasta out=c.fasta" to keep the names from b.fasta.

    Leave a comment:


  • entomology
    replied
    Thank you for your great work. I've tried the script with below two fasta files.

    a.fasta

    >1
    aaaaaa
    >2
    tttttt
    >3
    gggggg
    >4
    cccccc



    b.fasta

    >1123-11234
    aaaaaa
    >wer
    atgcca
    >ad
    ctaacg
    >232-23424
    tttttt
    >323-342
    cacaaa
    >416-2
    gggggg
    >13424241234-23423
    cccccc
    >5-234
    cggcgtcacgttggttgttga


    running the script
    filterbysequence.sh in=a.fasta ref=b.fas out=c.fasta ow=true

    I've got c.fasta exactly same with a.fasta, is it supposed to replace identifier of the sequences in a.fasta from b.fasta?




    Originally posted by Brian Bushnell View Post
    So, I was inspired by this thread to add something into BBTools that could accomplish this. Thus, there's yet another method, "filterbysequence.sh". Usage:

    Code:
    filterbysequence.sh in=a.fasta ref=b.fasta out=c.fasta
    c.fasta will then contain all sequences shared by a.fasta and b.fasta. It supports case-matching or case-insensitive operation, and reverse-complement-aware or forward-only operation. And it can either do an exclusion or inclusion filter. Also, it can optionally reduce very large sequences down to their 128-bit hash-codes for low-memory operation (so, for example, you could easily filter sequences against nt or RefSeq microbial quickly in a small amount of memory to see if they are already present before adding yet another copy of E.coli, which is something NCBI absolutely needs to do). And it's very, very fast.

    Leave a comment:


  • Brian Bushnell
    replied
    So, I was inspired by this thread to add something into BBTools that could accomplish this. Thus, there's yet another method, "filterbysequence.sh". Usage:

    Code:
    filterbysequence.sh in=a.fasta ref=b.fasta out=c.fasta
    c.fasta will then contain all sequences shared by a.fasta and b.fasta. It supports case-matching or case-insensitive operation, and reverse-complement-aware or forward-only operation. And it can either do an exclusion or inclusion filter. Also, it can optionally reduce very large sequences down to their 128-bit hash-codes for low-memory operation (so, for example, you could easily filter sequences against nt or RefSeq microbial quickly in a small amount of memory to see if they are already present before adding yet another copy of E.coli, which is something NCBI absolutely needs to do). And it's very, very fast.
    Last edited by Brian Bushnell; 12-19-2015, 10:21 AM.

    Leave a comment:


  • entomology
    replied
    Another good method, thank you.

    Originally posted by gsgs View Post
    only those sequences with a "-" in the name ?

    filtentr original 1 - > k1

    [filters sequences (including name-lines) with - in the first entry]

    names k1 5 5 > result
    [only the raw sequence data, omitting the names]

    using selfwritten simple programs filtentr.c and names.c

    Leave a comment:


  • entomology
    replied
    Yes, this time it works though the output is not as expected .

    more out.fas
    >1123-11234
    --
    gggggg
    >13424241234-23423
    >1123-11234
    aaaaaa
    ctaacg
    >232-23424
    >232-23424
    tttttt
    tttttt
    >323-342
    >416-2
    gggggg
    cacaaa
    >416-2
    >13424241234-23423
    cccccc

    Originally posted by GenoMax View Post
    Are you running bash shell? If you are not then try explicitly going into bash like this

    Code:
    $ /bin/bash
    $ while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

    Leave a comment:


  • entomology
    replied
    Admire you with the enthusiasm on programming, which makes an excellent programmer. Different running environment makes tons of unexpected problems that's why a robust code is really not easy to produce .

    Originally posted by SES View Post
    This is irrelevant now that the question is answered, but I can reproduce the "can't find getline..." error mentioned above that resulted from my first script.

    The simple solution is to put "use IO::File;" at the top of the script. Apparently the behavior of Perl's I/O has changed (I'm using 5.22). Also, there are often OS-specific differences with scripting languages because the "perl" (or whatever) that the vendor distributes is not the same as what you compile yourself. So, it's possible there could be other tweaks required based on the system/versions, but it would be kind of a waste to go down that path since the problem is solved. Okay, now it's time to move on back to work...

    Leave a comment:


  • gsgs
    replied
    only those sequences with a "-" in the name ?

    filtentr original 1 - > k1

    [filters sequences (including name-lines) with - in the first entry]

    names k1 5 5 > result
    [only the raw sequence data, omitting the names]

    using selfwritten simple programs filtentr.c and names.c

    Leave a comment:


  • SES
    replied
    This is irrelevant now that the question is answered, but I can reproduce the "can't find getline..." error mentioned above that resulted from my first script.

    The simple solution is to put "use IO::File;" at the top of the script. Apparently the behavior of Perl's I/O has changed (I'm using 5.22). Also, there are often OS-specific differences with scripting languages because the "perl" (or whatever) that the vendor distributes is not the same as what you compile yourself. So, it's possible there could be other tweaks required based on the system/versions, but it would be kind of a waste to go down that path since the problem is solved. Okay, now it's time to move on back to work...

    Leave a comment:


  • GenoMax
    replied
    Originally posted by entomology View Post
    Forgive my poor programming skill, still I got some error message as below

    -bash: syntax error near unexpected token `do'
    Are you running bash shell? If you are not then try explicitly going into bash like this

    Code:
    $ /bin/bash
    $ while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

    Leave a comment:


  • SES
    replied
    Originally posted by entomology View Post
    Thank you very much for your help. you are right, i get the order wrong. now it works really good with the updated version without warning.

    the command test gives the result as below:

    perl -v

    This is perl, v5.10.1 (*) built for x86_64-linux-thread-multi

    perl -MIO::Handle -e 'print IO::Handle->VERSION'
    1.28

    thanks again for your patience and useful help!!!
    Great, thanks for the info. I'm creating a CentOS 6 image now to test your errors above.

    Leave a comment:


  • entomology
    replied
    Thank you very much for your help. you are right, i get the order wrong. now it works really good with the updated version without warning.

    the command test gives the result as below:

    perl -v

    This is perl, v5.10.1 (*) built for x86_64-linux-thread-multi

    perl -MIO::Handle -e 'print IO::Handle->VERSION'
    1.28

    thanks again for your patience and useful help!!!

    Originally posted by SES View Post
    You must be giving the files in the wrong order because it works for me. It needs to be "perl script.pl original.fas other.fas." Though, the approach I posted is odd since there is something wrong with readline on your system and it will generate warnings. Here is a different approach that should work the same:

    Code:
    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    use File::Basename;
    
    my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
    my $infilei = shift or die $usage;
    my $infilej = shift or die $usage;
    
    my $hash = index_seq($infilei);
    open my $inj, '<', $infilej or die $!;
    {
        local $/ = '>';
    
        while (my $line = <$inj>) {
    	chomp $line;
    	my ($seqid, @seqparts) = split /\n/, $line;
    	my $seq = join '', @seqparts;
    	next unless defined $seqid && defined $seq;
    	if (exists $hash->{$seq}) {
    	    print join "\n", ">".$hash->{$seq}, "$seq\n";
    	}
        }
    }
    close $inj;
    
    sub index_seq {
        my ($file) = @_;
        open my $in, '<', $file or die $!;
    
        my %hash;
        {
    	local $/ = '>';
    
    	while (my $line = <$in>) {
    	    chomp $line;
    	    my ($seqid, @seqparts) = split /\n/, $line;
    	    my $seq = join '', @seqparts;
    	    next unless defined $seqid && defined $seq;
    	    $hash{$seq} = $seqid;
    	}
        }
        close $in;
    
        return \%hash;
    }
    By the way, can you tell me the output of "perl -v" and "perl -MIO::Handle -e 'print IO::Handle->VERSION'". Thanks. I am also using CentOS, so I'm curious about that message.

    Leave a comment:


  • SES
    replied
    Originally posted by entomology View Post
    I'm using centos

    2.6.32-573.7.1.el6.centos.plus.x86_64

    when i change as you recommended, this time I got result. but still with some error and the result seems not as expected.

    ./fetch_fasta.test seq.test seq.test.original.fas
    Value of <HANDLE> construct can be "0"; test with defined() at ./fetch_fasta.test line 31.
    >1
    aaaaaa
    >2
    tttttt
    >3
    gggggg

    I expected the result like this:

    >1123-11234
    aaaaaa
    >232-23424
    tttttt
    >416-2
    gggggg
    >13424241234-23423
    cccccc
    You must be giving the files in the wrong order because it works for me. It needs to be "perl script.pl original.fas other.fas." Though, the approach I posted is odd since there is something wrong with readline on your system and it will generate warnings. Here is a different approach that should work the same:

    Code:
    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    use File::Basename;
    
    my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
    my $infilei = shift or die $usage;
    my $infilej = shift or die $usage;
    
    my $hash = index_seq($infilei);
    open my $inj, '<', $infilej or die $!;
    {
        local $/ = '>';
    
        while (my $line = <$inj>) {
    	chomp $line;
    	my ($seqid, @seqparts) = split /\n/, $line;
    	my $seq = join '', @seqparts;
    	next unless defined $seqid && defined $seq;
    	if (exists $hash->{$seq}) {
    	    print join "\n", ">".$hash->{$seq}, "$seq\n";
    	}
        }
    }
    close $inj;
    
    sub index_seq {
        my ($file) = @_;
        open my $in, '<', $file or die $!;
    
        my %hash;
        {
    	local $/ = '>';
    
    	while (my $line = <$in>) {
    	    chomp $line;
    	    my ($seqid, @seqparts) = split /\n/, $line;
    	    my $seq = join '', @seqparts;
    	    next unless defined $seqid && defined $seq;
    	    $hash{$seq} = $seqid;
    	}
        }
        close $in;
    
        return \%hash;
    }
    By the way, can you tell me the output of "perl -v" and "perl -MIO::Handle -e 'print IO::Handle->VERSION'". Thanks. I am also using CentOS, so I'm curious about that message.
    Last edited by SES; 12-18-2015, 02:14 PM.

    Leave a comment:


  • entomology
    replied
    I'm using centos

    2.6.32-573.7.1.el6.centos.plus.x86_64

    when i change as you recommended, this time I got result. but still with some error and the result seems not as expected.

    ./fetch_fasta.test seq.test seq.test.original.fas
    Value of <HANDLE> construct can be "0"; test with defined() at ./fetch_fasta.test line 31.
    >1
    aaaaaa
    >2
    tttttt
    >3
    gggggg

    I expected the result like this:

    >1123-11234
    aaaaaa
    >232-23424
    tttttt
    >416-2
    gggggg
    >13424241234-23423
    cccccc



    Originally posted by SES View Post
    Interesting, never seen that. What OS are you using? To fix the Perl issue, replace this line:

    Code:
    return unless my $entry = $fh->getline;
    with

    Code:
    return unless my $entry = <$fh>;
    should do the trick.

    Leave a comment:


  • SES
    replied
    Originally posted by entomology View Post
    I've got below message when I running the script, do i miss some module?

    Can't locate object method "getline" via package "IO::Handle" at ./fetch_fasta.test line 30
    Interesting, never seen that. What OS are you using? To fix the Perl issue, replace this line:

    Code:
    return unless my $entry = $fh->getline;
    with

    Code:
    return unless my $entry = <$fh>;
    should do the trick.

    Leave a comment:

Latest Articles

Collapse

  • SEQadmin2
    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
    by SEQadmin2


    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
    ...
    06-02-2026, 10:05 AM
  • SEQadmin2
    Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
    by SEQadmin2


    With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


    Introduction

    Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
    05-22-2026, 06:42 AM
  • SEQadmin2
    Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
    by SEQadmin2

    Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


    Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
    05-06-2026, 09:04 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by SEQadmin2, Today, 08:59 AM
0 responses
11 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-02-2026, 12:03 PM
0 responses
21 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-02-2026, 11:40 AM
0 responses
17 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 05-28-2026, 11:40 AM
0 responses
31 views
0 reactions
Last Post SEQadmin2  
Working...