Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • A1_UltiMA
    Member
    • Aug 2010
    • 11

    Help pls! Picking out specific 'spots' from SRA FASTA files

    Hello and thanks for reading,

    We have the entire genome of an ancient Eskimo downloaded in FASTQ from NCBI's Short Read Archive (SRA) and now converted into FASTA. We have carried out BLAST (stand alone) and the results as .csv files, with this and particular criteria involving the results we filter out the 'spots'/short reads that contain the sequence data we are interested in (the sequences are identified by accession numbers and a -presumably arbitrary- 'definition').

    We would like to know how we can separate the spots we are interested in from the entire collection of downloaded spots that compose the genome. We thought that the accession numbers we wanted could be collected and BLASTed on NCBI, but NCBI would only allow a maximum of 15 accession numbers at a time, yet we have a couple of thousand that need to be BLASTed, so that method will not work. Please suggest to us any methods or tools that can allow us to achieve this, thanks!!!!
  • A1_UltiMA
    Member
    • Aug 2010
    • 11

    #2
    Thanks for reading so far.

    Comment

    • krobison
      Senior Member
      • Nov 2007
      • 734

      #3
      First, expecting 12 hour reply on a holiday weekend is a bit of a stretch, no?

      I'm not sure I understand exactly what you are having trouble doing. It seems you want to separate out the reads which align to certain genes of interest -- is that correct?

      What platform were the reads generated on (a search of SRA with "Eskimo" yields no hits)?

      Comment

      • Thomas Doktor
        Senior Member
        • Apr 2009
        • 105

        #4
        Looks like the genome could be this one: http://www.nature.com/nature/journal...100211-02.html

        SRA number: SRA01010

        If I understand you correctly, you want to examine the eskimo genome at a certain position in the genome? You would have to realign the reads in the SRA archive against a modern human genome (possibly just a fraction of it, the bit that you are interested in) then look at the mapping result in a genome viewer such as the IGV from Broad.

        Comment

        • maubp
          Peter (Biopython etc)
          • Jul 2009
          • 1544

          #5
          I'm unclear exactly what you are trying to do. However, you can install and run BLAST locally - this is a possible solution if you are hitting the NCBI usage limits:

          Comment

          • A1_UltiMA
            Member
            • Aug 2010
            • 11

            #6
            Hi,

            Thanks for replies.

            Sorry If i didn't make it clear. We have the Eskimo genome downloaded, we have already done a local BLAST using the Eskimo as the 'subject'. There are many many results we have gotten, but out of those results are 2000 or so hits we are interested in, these hits are identified by accession numbers as 'spots' (short reads of sequence produced by the sequencer which collectively make the genome).

            The question is, is how can we use the list of 2000 accession numbers we are interested in to retrieve their origional sequences?

            Comment

            • maubp
              Peter (Biopython etc)
              • Jul 2009
              • 1544

              #7
              Originally posted by A1_UltiMA View Post
              The question is, is how can we use the list of 2000 accession numbers we are interested in to retrieve their origional sequences?
              Assuming I have understood correctly, these original sequences you want are NOT the ones you have already downloaded and used to build the BLAST database. Right?

              I would use the NCBI Entrez API, via your scripting language of choice. Remember to follow their usage guidelines.

              <advert>Personally I would use Python with the Bio.Entrez module from Biopython</advert>

              Comment

              • A1_UltiMA
                Member
                • Aug 2010
                • 11

                #8
                Thanks Maubp,

                Actually the accession numbers I am talking about ARE part of the databases created from the downloaded Genome.

                Comment

                • maubp
                  Peter (Biopython etc)
                  • Jul 2009
                  • 1544

                  #9
                  Originally posted by A1_UltiMA View Post
                  Thanks Maubp,

                  Actually the accession numbers I am talking about ARE part of the databases created from the downloaded Genome.
                  OK - so if you have the downloaded Genome FASTA files you can get the sequence directly from the FASTA file.

                  If you just have the BLAST database files but not the FASTA files, then the NCBI include a tool to extract the sequence: fastacmd in 'legacy' BLAST, or blastdbcmd in BLAST+

                  Comment

                  • A1_UltiMA
                    Member
                    • Aug 2010
                    • 11

                    #10
                    Originally posted by maubp View Post
                    OK - so if you have the downloaded Genome FASTA files you can get the sequence directly from the FASTA file.

                    If you just have the BLAST database files but not the FASTA files, then the NCBI include a tool to extract the sequence: fastacmd in 'legacy' BLAST, or blastdbcmd in BLAST+
                    Hi thanks again maubp!

                    We have tried to use MEGA to read the sequences, but actually the files are too large for the hardware to handle..

                    With regards to blastcmd, how does this tool identify sequences? by accession number or by the name that precedes the ">" sign?

                    Also.. I am quite sorry for perhapes asking for too much but, is it possible to type out an example of the coding that we would need to type for it to extract a hypothetical sequence?

                    Thank you again.

                    Comment

                    • krobison
                      Senior Member
                      • Nov 2007
                      • 734

                      #11
                      It's very easy to write code to pull out your target sequences. A basic Perl framework is below; save it as pullseqs.pl

                      Code:
                      #!/usr/bin/perl
                      use strict;
                      
                      if (scalar(@ARGV)<2)
                      {
                         die "
                      usage: pullseqs.pl file.with.targets.txt sequences.fastq
                                will work with FASTA or FASTQ                   
                      ";
                      }
                      my %targets=();
                      open(IN,$ARGV[0]);
                      while ($_ = <IN>)
                      {
                        $targets{$1}=1 if (/^(\W+)/);
                      }
                      my $on=0;
                      open(IN,$ARGV[1]);
                      while ($_ = <IN>)
                      {
                        if (/^[>\@](\W+)/)  # pull id composed of non-whitespace characters
                        {
                           $on=$targets{$1};
                        }
                        print $_ if ($on);
                      }
                      the first input to this program is a file with your target ids, one per line. Anything after the first whitespace character will be ignored.

                      I've written the above to work on FASTA or FASTQ. I haven't actually run it -- debugging is left as an exercise for the student :

                      Comment

                      Latest Articles

                      Collapse

                      • SEQadmin2
                        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                        by SEQadmin2


                        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                        ...
                        Yesterday, 10:05 AM
                      • SEQadmin2
                        Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                        by SEQadmin2


                        With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                        Introduction

                        Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                        05-22-2026, 06:42 AM
                      • SEQadmin2
                        Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                        by SEQadmin2

                        Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                        Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                        05-06-2026, 09:04 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, Yesterday, 12:03 PM
                      0 responses
                      19 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, Yesterday, 11:40 AM
                      0 responses
                      14 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 05-28-2026, 11:40 AM
                      0 responses
                      29 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 05-26-2026, 10:12 AM
                      0 responses
                      31 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...