Unconfigured Ad

entomology · 12-18-2015, 01:42 PM

I've got below message when I running the script, do i miss some module?

Can't locate object method "getline" via package "IO::Handle" at ./fetch_fasta.test line 30

Originally posted by SES View Post

That is what the post above (#19) produces. The question and expected results seem pretty simple, maybe you missed the previous post or are looking for another way?

entomology · 12-18-2015, 01:37 PM

Forgive my poor programming skill, still I got some error message as below

-bash: syntax error near unexpected token `do'

Originally posted by GenoMax View Post

@entomology: Try the following

Use the original file of sequences (i.e. not the fasta format but just sequence, one on each line).

Code:

$  while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

SES · 12-18-2015, 01:36 PM

Originally posted by entomology View Post

I've upload the two file, the expected output is like this:

>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

Thanks!

That is what the post above (#19) produces. The question and expected results seem pretty simple, maybe you missed the previous post or are looking for another way?

entomology · 12-18-2015, 01:31 PM

Actually, I've deal with some small rna sequence which is with length of 18-30. Anyway, thank you for your kindness.

Originally posted by Brian Bushnell View Post

Oh, I did not realize your sequences were so tiny; I assumed they were much longer. BBDuk is probably not appropriate for this situation, as it makes the implicit assumption that long kmers are relatively unique, which is not the case with 6-mers.

entomology · 12-18-2015, 01:25 PM

I've upload the two file, the expected output is like this:

>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

Thanks!

Originally posted by gsgs View Post

I usually write a small basic program for such problems.

post/send the file, I send the result ?

Attached Files

SES · 12-18-2015, 12:05 PM

Here is a simple script that uses an iterator to fetch records by sequence. This would be likely faster and less error-prone than grep:

Code:

#!/usr/bin/env perl

use strict;
use warnings;
use File::Basename;

my $usage   = "perl ".basename($0)." seqsi.fas seqsj.fas > seqs_out.fas";
my $infilei = shift or die $usage;
my $infilej = shift or die $usage;

my %hash;
open my $ini, '<', $infilei or die $!;
while (my ($id, $seq) = fasta_it(\*$ini)) {
    $hash{$seq} = $id;
}
close $ini;

open my $inj, '<', $infilej or die $!;
while (my ($id, $seq) = fasta_it(\*$inj)) {
    if (exists $hash{$seq}) {
	print join "\n", ">".$hash{$seq}, "$seq\n";
    }
}
close $inj;

sub fasta_it {
    my ($fh) = @_;
    
    local $/ = "\n>";
    return unless my $entry = $fh->getline;
    chomp $entry;

    my ($id, $seq) = split /\n/, $entry, 2;
    defined $id && $id =~ s/>//g;
    return ($id, $seq);
}

Here is the gist for easier download: https://gist.github.com/sestaton/889cba88b5279a58d997

The output:

Code:

perl fetch_by_seq.pl i.fas j.fas 
>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

Depending on the size of the original file you may want to think about using an SQLite database but this should work fine for most uses.

GenoMax · 12-18-2015, 10:29 AM

@entomology: Try the following

Use the original file of sequences (i.e. not the fasta format but just sequence, one on each line).

Code:

$  while read i ; do grep -B 1 $i original.fas ; done < sequence_file > out.fas

Brian Bushnell · 12-18-2015, 10:27 AM

Oh, I did not realize your sequences were so tiny; I assumed they were much longer. BBDuk is probably not appropriate for this situation, as it makes the implicit assumption that long kmers are relatively unique, which is not the case with 6-mers.

gsgs · 12-18-2015, 10:14 AM

I usually write a small basic program for such problems.

post/send the file, I send the result ?

entomology · 12-17-2015, 03:03 PM

Thank you for the code. It can change my sequences to fasta file. And I try bbduk.fas again, but the result is not as expected. An example will be more easier to understand. there are two fasta

original.fas
>1123-11234
aaaaaa
>wer
atgcca
>ad
ctaacg
>232-23424
tttttt
>323-342
cacaaa
>416-2
gggggg
>13424241234-23423
cccccc
>5-234
cggcgtcacgttggttgttga

ref.fas(after I make fasta using your awk script)
>1
aaaaaa
>2
tttttt
>3
gggggg
>4
cccccc

I use "bbmap/bbduk.sh in=original.fas ref=ref.fas out=out.fas mkf=1 mm=f k=21"

out.fas is like this
>5-234
cggcgtcacgttggttgttga

actually, I want a fasta like this

>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

Just like fetch the id from the original.fas

Originally posted by GenoMax View Post

If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

Code:

$ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta

Then use the file with BBDuk.

GenoMax · 12-17-2015, 02:23 PM

If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

Code:

$ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta

Then use the file with BBDuk.

entomology · 12-17-2015, 02:00 PM

Yes, I've tried bbduk.sh.

bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31

But my situation is that b.fa is not fasta file, it contain one sequence per line. I just want the sequence in b from a.fa, than make a new fasta file (c.fa).

since my b.fa is not a fasta file, so bbduk.sh give some error:

Exception in thread "Thread-9" java.lang.RuntimeException: Error parsing read from text.

Originally posted by GenoMax View Post

Brian's solution should work. Did you try it?

While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.

GenoMax · 12-17-2015, 01:08 PM

Brian's solution should work. Did you try it?

While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.

entomology · 12-17-2015, 12:39 PM

No worry, I'll try to use grep to deal with the problem

.

Originally posted by maubp View Post

Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.

maubp · 12-17-2015, 12:28 PM

Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 24 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 30 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News