Unconfigured Ad

**A1_UltiMA** · 10-09-2010, 06:38 AM

Thanks for reading so far.

**krobison** · 10-09-2010, 01:36 PM

First, expecting 12 hour reply on a holiday weekend is a bit of a stretch, no?

I'm not sure I understand exactly what you are having trouble doing. It seems you want to separate out the reads which align to certain genes of interest -- is that correct?

What platform were the reads generated on (a search of SRA with "Eskimo" yields no hits)?

**Thomas Doktor** · 10-10-2010, 04:26 PM

Looks like the genome could be this one: http://www.nature.com/nature/journal...100211-02.html

SRA number: SRA01010

If I understand you correctly, you want to examine the eskimo genome at a certain position in the genome? You would have to realign the reads in the SRA archive against a modern human genome (possibly just a fraction of it, the bit that you are interested in) then look at the mapping result in a genome viewer such as the IGV from Broad.

**maubp** · 10-11-2010, 12:42 AM

I'm unclear exactly what you are trying to do. However, you can install and run BLAST locally - this is a possible solution if you are hitting the NCBI usage limits:

http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download

**A1_UltiMA** · 10-12-2010, 04:49 AM

Hi,

Thanks for replies.

Sorry If i didn't make it clear. We have the Eskimo genome downloaded, we have already done a local BLAST using the Eskimo as the 'subject'. There are many many results we have gotten, but out of those results are 2000 or so hits we are interested in, these hits are identified by accession numbers as 'spots' (short reads of sequence produced by the sequencer which collectively make the genome).

The question is, is how can we use the list of 2000 accession numbers we are interested in to retrieve their origional sequences?

**maubp** · 10-12-2010, 04:55 AM

Originally posted by A1_UltiMA View Post

The question is, is how can we use the list of 2000 accession numbers we are interested in to retrieve their origional sequences?

Assuming I have understood correctly, these original sequences you want are NOT the ones you have already downloaded and used to build the BLAST database. Right?

I would use the NCBI Entrez API, via your scripting language of choice. Remember to follow their usage guidelines.

<advert>Personally I would use Python with the Bio.Entrez module from Biopython</advert>

**A1_UltiMA** · 10-12-2010, 05:44 AM

Thanks Maubp,

Actually the accession numbers I am talking about ARE part of the databases created from the downloaded Genome.

**maubp** · 10-12-2010, 06:03 AM

Originally posted by A1_UltiMA View Post

Thanks Maubp,

Actually the accession numbers I am talking about ARE part of the databases created from the downloaded Genome.

OK - so if you have the downloaded Genome FASTA files you can get the sequence directly from the FASTA file.

If you just have the BLAST database files but not the FASTA files, then the NCBI include a tool to extract the sequence: fastacmd in 'legacy' BLAST, or blastdbcmd in BLAST+

**A1_UltiMA** · 10-13-2010, 06:52 AM

Originally posted by maubp View Post

OK - so if you have the downloaded Genome FASTA files you can get the sequence directly from the FASTA file.

If you just have the BLAST database files but not the FASTA files, then the NCBI include a tool to extract the sequence: fastacmd in 'legacy' BLAST, or blastdbcmd in BLAST+

Hi thanks again maubp!

We have tried to use MEGA to read the sequences, but actually the files are too large for the hardware to handle..

With regards to blastcmd, how does this tool identify sequences? by accession number or by the name that precedes the ">" sign?

Also.. I am quite sorry for perhapes asking for too much but, is it possible to type out an example of the coding that we would need to type for it to extract a hypothetical sequence?

Thank you again.

**krobison** · 10-13-2010, 08:40 AM

It's very easy to write code to pull out your target sequences. A basic Perl framework is below; save it as pullseqs.pl

Code:

#!/usr/bin/perl
use strict;

if (scalar(@ARGV)<2)
{
   die "
usage: pullseqs.pl file.with.targets.txt sequences.fastq
          will work with FASTA or FASTQ                   
";
}
my %targets=();
open(IN,$ARGV[0]);
while ($_ = <IN>)
{
  $targets{$1}=1 if (/^(\W+)/);
}
my $on=0;
open(IN,$ARGV[1]);
while ($_ = <IN>)
{
  if (/^[>\@](\W+)/)  # pull id composed of non-whitespace characters
  {
     $on=$targets{$1};
  }
  print $_ if ($on);
}

the first input to this program is a file with your target ids, one per line. Anything after the first whitespace character will be ignored.

I've written the above to work on FASTA or FASTQ. I haven't actually run it -- debugging is left as an exercise for the student :

Topics	Statistics	Last Post
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, Yesterday, 12:03 PM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 Yesterday, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, Yesterday, 11:40 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM
Scientists Solve a 25-Year Mystery in RNA Interference by SEQadmin2 Started by SEQadmin2, 05-26-2026, 10:12 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 05-26-2026, 10:12 AM

Unconfigured Ad

Help pls! Picking out specific 'spots' from SRA FASTA files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News