Seqanswers Leaderboard Ad

**ShellfishGene** · 06-04-2010, 10:35 PM

Personally I would find you program even more useful if there was an option to pipe sequence IDs and get the output on stdout!
I've been using bioperl in a simple script to do that, but a C++ program with more features would be nice, too!

Code:

#!/usr/bin/perl
use strict;
use warnings;
use Bio::DB::Fasta;

my $file = shift;

unless ( $file && -e $file ) { print "Usage: echo 'seq1:5..15' | get_seq.pl sequences.fasta\n      echo 'seq1' | get_seq.pl sequences.fasta\n"; exit; }

my $db = Bio::DB::Fasta->new( $file );

while (<>){
  my $query = $_;
  chomp $query;

  my $sequence;
  if ( $query =~ /:/ ) {
    $query =~ /^(.+):(\d+)\.\.(\d+)/;
    unless ( $1 && $2 && $3 ) {
      die "problem parsing request string.\n";
    }

    $sequence = $db->seq($1, $2 => $3);
  }
  else {
    $sequence = $db->seq($query);
  }

  unless ( $sequence ) { die "Sequence $query not found. \n" }
  print ">$query\n", "$sequence\n";

}

**ekg** · 06-05-2010, 06:12 AM

Originally posted by ShellfishGene View Post

Personally I would find you program even more useful if there was an option to pipe sequence IDs and get the output on stdout!
I've been using bioperl in a simple script to do that, but a C++ program with more features would be nice, too!

That's a great idea. Presently you can get the same effect by using xargs, but it would be quite a bit more cumbersome. I've thought of using a BED file as input to similar effect.

I like the sequence and subsequence specification syntax that the utility you posted uses. I will probably adapt that into FastaHack. Does it use 0 or 1-based coordinates?

**ShellfishGene** · 06-05-2010, 06:26 AM

Originally posted by ekg View Post

I like the sequence and subsequence specification syntax that the utility you posted uses. I will probably adapt that into FastaHack. Does it use 0 or 1-based coordinates?

The syntax is actually quite widespread I think, Ensembl uses it for example. It is 1-based. One variation, used on the UCSC browser pages, is to use - instead of '..'. You might want to support both.

**mslider** · 09-02-2010, 12:06 PM

another option

maybe usefull to add a parameter to only count the number of sequence...
when you have million of sequence, grep -c "^>" is very low !

**SES** · 09-03-2010, 05:46 AM

Originally posted by mslider View Post

maybe usefull to add a parameter to only count the number of sequence...
when you have million of sequence, grep -c "^>" is very low !

I disagree with part of this statement. There are myriad ways to index a fasta and these usually take a few seconds to a few minutes for millions of sequences. Then counting can take seconds. I just used grep to count 2.2 million 454 sequences and it took 13 seconds and did not create any huge index files. I would argue that grep is probably faster than creating an index then counting (in terms of overall time spent), but others may not agree. I agree that returning sequence stats from an index seems natural if you already have the index and it looks like this is on the author's to do list.

**Lee Sam** · 09-03-2010, 08:57 AM

Thanks for contributing this! It's literally exactly what I need.

**avilella** · 09-05-2010, 08:26 AM

There are two utilities that I am missing in current methods: I am using cdbfasta/cdbyank to index fastq files, but I would like to be able to compress the fastq file so that it takes up less space, even if it means a slower retrieval time. I would also like to be able to send a large number of ids, and retrieve the complement from them: the list of ids in the fastq file but not in the id list.

**mslider** · 09-06-2010, 12:36 AM

extract just subsequence

If you just want to extract a subsequence from a big sequence like a chromosome,
the program below is more faster and without creating index file:

Code:

#include<iostream>
#include<string.h>
#include<fstream>
#include <stdlib.h>
using namespace std;

 /* Steps:-
1- Download FASTA file and then remove the header. (>asdasfdasfassa)
2- Remove new lines from FASTA file. (using sed or perl)
3- Then you can use the C++  program like this in linux:

./ExtractSequence inputfilename start stop
*/

  int GetIntVal(string strConvert) {
              int intReturn;
              intReturn = atoi(strConvert.c_str());
              return(intReturn);
  }

int main(int argc ,char* argv[]){

       string line1;
	   ifstream myFile(argv[1]);
	   if(! myFile){
	      cout << "Error opening file" << endl;
		  return -1;
	   }
	   while(! myFile.eof()){
	       getline(myFile, line1);

			 string r1 = argv[2];
			 string r2 = argv[3];
			 int range1 = GetIntVal(r1);
			 int range2 =  GetIntVal(r2)- range1;
			 cout << ">Sample Sequence" << endl;
			 cout << line1.substr(range1,range2) << endl;
	   }
	   myFile.close();
    return 0;
}

**Thomas Doktor** · 09-06-2010, 05:14 AM

BEDTools' fastaFromBed utility allows you to extract (sub)sequences from a FASTA file using a BED/GFF/VCF file with intervals as input. It also supports strand specific sequence queries so you can extract strand specific features, such as exons.
BEDTools: http://code.google.com/p/bedtools/

**shaldenby** · 01-28-2013, 05:10 AM

This is a really useful little tool. Thanks very much!

**mattanswers** · 01-29-2013, 04:02 PM

Thank you ekg for your fastahack tool. The tool seems to extract the sequence by its position in the fasta file. I was wondering if it can extract a provided subsequence from the fasta file, and if so, what if the provided subsequence occurs multiple times in the fasta file ?

**ekg** · 01-29-2013, 04:07 PM

@mattanswers This sounds like a job for your favorite aligner. For short sequences, you can use smith-waterman (https://github.com/ekg/smithwaterman) but for bigger stuff I'd use something like blat or encode your sequences in FASTA and align them.

As for multiple mappings, you'll have to find a mapper that generates them. MOSAIK does, and I believe so does MrsFast.

**mattanswers** · 01-30-2013, 11:25 AM

Thank you for your help, and also for writing and sharing FastaHack.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

FastaHack - FASTA file manipulation and subsequence extraction utilities

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News