Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • entomology
    replied
    I've got below message when I running the script, do i miss some module?

    Can't locate object method "getline" via package "IO::Handle" at ./fetch_fasta.test line 30

    Originally posted by SES View Post
    That is what the post above (#19) produces. The question and expected results seem pretty simple, maybe you missed the previous post or are looking for another way?

    Leave a comment:


  • entomology
    replied
    Forgive my poor programming skill, still I got some error message as below

    -bash: syntax error near unexpected token `do'



    Originally posted by GenoMax View Post
    @entomology: Try the following

    Use the original file of sequences (i.e. not the fasta format but just sequence, one on each line).

    Code:
    $  while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

    Leave a comment:


  • SES
    replied
    Originally posted by entomology View Post
    I've upload the two file, the expected output is like this:

    >1123-11234
    aaaaaa
    >232-23424
    tttttt
    >416-2
    gggggg
    >13424241234-23423
    cccccc

    Thanks!
    That is what the post above (#19) produces. The question and expected results seem pretty simple, maybe you missed the previous post or are looking for another way?

    Leave a comment:


  • entomology
    replied
    Actually, I've deal with some small rna sequence which is with length of 18-30. Anyway, thank you for your kindness.


    Originally posted by Brian Bushnell View Post
    Oh, I did not realize your sequences were so tiny; I assumed they were much longer. BBDuk is probably not appropriate for this situation, as it makes the implicit assumption that long kmers are relatively unique, which is not the case with 6-mers.

    Leave a comment:


  • entomology
    replied
    I've upload the two file, the expected output is like this:

    >1123-11234
    aaaaaa
    >232-23424
    tttttt
    >416-2
    gggggg
    >13424241234-23423
    cccccc

    Thanks!

    Originally posted by gsgs View Post
    I usually write a small basic program for such problems.

    post/send the file, I send the result ?
    Attached Files

    Leave a comment:


  • SES
    replied
    Here is a simple script that uses an iterator to fetch records by sequence. This would be likely faster and less error-prone than grep:

    Code:
    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    use File::Basename;
    
    my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
    my $infilei = shift or die $usage;
    my $infilej = shift or die $usage;
    
    my %hash;
    open my $ini, '<', $infilei or die $!;
    while (my ($id, $seq) = fasta_it(\*$ini)) {
        $hash{$seq} = $id;
    }
    close $ini;
    
    open my $inj, '<', $infilej or die $!;
    while (my ($id, $seq) = fasta_it(\*$inj)) {
        if (exists $hash{$seq}) {
    	print join "\n", ">".$hash{$seq}, "$seq\n";
        }
    }
    close $inj;
    
    sub fasta_it {
        my ($fh) = @_;
        
        local $/ = "\n>";
        return unless my $entry = $fh->getline;
        chomp $entry;
    
        my ($id, $seq) = split /\n/, $entry, 2;
        defined $id && $id =~ s/>//g;
        return ($id, $seq);
    }
    Here is the gist for easier download: https://gist.github.com/sestaton/889cba88b5279a58d997

    The output:

    Code:
    perl fetch_by_seq.pl i.fas j.fas 
    >1123-11234
    aaaaaa
    >232-23424
    tttttt
    >416-2
    gggggg
    >13424241234-23423
    cccccc
    Depending on the size of the original file you may want to think about using an SQLite database but this should work fine for most uses.

    Leave a comment:


  • GenoMax
    replied
    @entomology: Try the following

    Use the original file of sequences (i.e. not the fasta format but just sequence, one on each line).

    Code:
    $  while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

    Leave a comment:


  • Brian Bushnell
    replied
    Oh, I did not realize your sequences were so tiny; I assumed they were much longer. BBDuk is probably not appropriate for this situation, as it makes the implicit assumption that long kmers are relatively unique, which is not the case with 6-mers.

    Leave a comment:


  • gsgs
    replied
    I usually write a small basic program for such problems.

    post/send the file, I send the result ?

    Leave a comment:


  • entomology
    replied
    Thank you for the code. It can change my sequences to fasta file. And I try bbduk.fas again, but the result is not as expected. An example will be more easier to understand. there are two fasta

    original.fas
    >1123-11234
    aaaaaa
    >wer
    atgcca
    >ad
    ctaacg
    >232-23424
    tttttt
    >323-342
    cacaaa
    >416-2
    gggggg
    >13424241234-23423
    cccccc
    >5-234
    cggcgtcacgttggttgttga


    ref.fas(after I make fasta using your awk script)
    >1
    aaaaaa
    >2
    tttttt
    >3
    gggggg
    >4
    cccccc

    I use "bbmap/bbduk.sh in=original.fas ref=ref.fas out=out.fas mkf=1 mm=f k=21"

    out.fas is like this
    >5-234
    cggcgtcacgttggttgttga

    actually, I want a fasta like this

    >1123-11234
    aaaaaa
    >232-23424
    tttttt
    >416-2
    gggggg
    >13424241234-23423
    cccccc

    Just like fetch the id from the original.fas


    Originally posted by GenoMax View Post
    If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

    Code:
    $ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta
    Then use the file with BBDuk.

    Leave a comment:


  • GenoMax
    replied
    If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

    Code:
    $ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta
    Then use the file with BBDuk.

    Leave a comment:


  • entomology
    replied
    Yes, I've tried bbduk.sh.

    bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31

    But my situation is that b.fa is not fasta file, it contain one sequence per line. I just want the sequence in b from a.fa, than make a new fasta file (c.fa).

    since my b.fa is not a fasta file, so bbduk.sh give some error:

    Exception in thread "Thread-9" java.lang.RuntimeException: Error parsing read from text.


    Originally posted by GenoMax View Post
    Brian's solution should work. Did you try it?

    While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.

    Leave a comment:


  • GenoMax
    replied
    Brian's solution should work. Did you try it?

    While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.

    Leave a comment:


  • entomology
    replied
    No worry, I'll try to use grep to deal with the problem .
    Originally posted by maubp View Post
    Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.

    Leave a comment:


  • maubp
    replied
    Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.

    Leave a comment:

Latest Articles

Collapse

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by SEQadmin2, 06-05-2026, 10:09 AM
0 responses
14 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-04-2026, 08:59 AM
0 responses
24 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-02-2026, 12:03 PM
0 responses
30 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-02-2026, 11:40 AM
0 responses
23 views
0 reactions
Last Post SEQadmin2  
Working...