Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help pls! Picking out specific 'spots' from SRA FASTA files

    Hello and thanks for reading,

    We have the entire genome of an ancient Eskimo downloaded in FASTQ from NCBI's Short Read Archive (SRA) and now converted into FASTA. We have carried out BLAST (stand alone) and the results as .csv files, with this and particular criteria involving the results we filter out the 'spots'/short reads that contain the sequence data we are interested in (the sequences are identified by accession numbers and a -presumably arbitrary- 'definition').

    We would like to know how we can separate the spots we are interested in from the entire collection of downloaded spots that compose the genome. We thought that the accession numbers we wanted could be collected and BLASTed on NCBI, but NCBI would only allow a maximum of 15 accession numbers at a time, yet we have a couple of thousand that need to be BLASTed, so that method will not work. Please suggest to us any methods or tools that can allow us to achieve this, thanks!!!!

  • #2
    Thanks for reading so far.

    Comment


    • #3
      First, expecting 12 hour reply on a holiday weekend is a bit of a stretch, no?

      I'm not sure I understand exactly what you are having trouble doing. It seems you want to separate out the reads which align to certain genes of interest -- is that correct?

      What platform were the reads generated on (a search of SRA with "Eskimo" yields no hits)?

      Comment


      • #4
        Looks like the genome could be this one: http://www.nature.com/nature/journal...100211-02.html

        SRA number: SRA01010

        If I understand you correctly, you want to examine the eskimo genome at a certain position in the genome? You would have to realign the reads in the SRA archive against a modern human genome (possibly just a fraction of it, the bit that you are interested in) then look at the mapping result in a genome viewer such as the IGV from Broad.

        Comment


        • #5
          I'm unclear exactly what you are trying to do. However, you can install and run BLAST locally - this is a possible solution if you are hitting the NCBI usage limits:

          Comment


          • #6
            Hi,

            Thanks for replies.

            Sorry If i didn't make it clear. We have the Eskimo genome downloaded, we have already done a local BLAST using the Eskimo as the 'subject'. There are many many results we have gotten, but out of those results are 2000 or so hits we are interested in, these hits are identified by accession numbers as 'spots' (short reads of sequence produced by the sequencer which collectively make the genome).

            The question is, is how can we use the list of 2000 accession numbers we are interested in to retrieve their origional sequences?

            Comment


            • #7
              Originally posted by A1_UltiMA View Post
              The question is, is how can we use the list of 2000 accession numbers we are interested in to retrieve their origional sequences?
              Assuming I have understood correctly, these original sequences you want are NOT the ones you have already downloaded and used to build the BLAST database. Right?

              I would use the NCBI Entrez API, via your scripting language of choice. Remember to follow their usage guidelines.

              <advert>Personally I would use Python with the Bio.Entrez module from Biopython</advert>

              Comment


              • #8
                Thanks Maubp,

                Actually the accession numbers I am talking about ARE part of the databases created from the downloaded Genome.

                Comment


                • #9
                  Originally posted by A1_UltiMA View Post
                  Thanks Maubp,

                  Actually the accession numbers I am talking about ARE part of the databases created from the downloaded Genome.
                  OK - so if you have the downloaded Genome FASTA files you can get the sequence directly from the FASTA file.

                  If you just have the BLAST database files but not the FASTA files, then the NCBI include a tool to extract the sequence: fastacmd in 'legacy' BLAST, or blastdbcmd in BLAST+

                  Comment


                  • #10
                    Originally posted by maubp View Post
                    OK - so if you have the downloaded Genome FASTA files you can get the sequence directly from the FASTA file.

                    If you just have the BLAST database files but not the FASTA files, then the NCBI include a tool to extract the sequence: fastacmd in 'legacy' BLAST, or blastdbcmd in BLAST+
                    Hi thanks again maubp!

                    We have tried to use MEGA to read the sequences, but actually the files are too large for the hardware to handle..

                    With regards to blastcmd, how does this tool identify sequences? by accession number or by the name that precedes the ">" sign?

                    Also.. I am quite sorry for perhapes asking for too much but, is it possible to type out an example of the coding that we would need to type for it to extract a hypothetical sequence?

                    Thank you again.

                    Comment


                    • #11
                      It's very easy to write code to pull out your target sequences. A basic Perl framework is below; save it as pullseqs.pl

                      Code:
                      #!/usr/bin/perl
                      use strict;
                      
                      if (scalar(@ARGV)<2)
                      {
                         die "
                      usage: pullseqs.pl file.with.targets.txt sequences.fastq
                                will work with FASTA or FASTQ                   
                      ";
                      }
                      my %targets=();
                      open(IN,$ARGV[0]);
                      while ($_ = <IN>)
                      {
                        $targets{$1}=1 if (/^(\W+)/);
                      }
                      my $on=0;
                      open(IN,$ARGV[1]);
                      while ($_ = <IN>)
                      {
                        if (/^[>\@](\W+)/)  # pull id composed of non-whitespace characters
                        {
                           $on=$targets{$1};
                        }
                        print $_ if ($on);
                      }
                      the first input to this program is a file with your target ids, one per line. Anything after the first whitespace character will be ignored.

                      I've written the above to work on FASTA or FASTQ. I haven't actually run it -- debugging is left as an exercise for the student :

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-25-2024, 11:49 AM
                      0 responses
                      19 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-24-2024, 08:47 AM
                      0 responses
                      19 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      62 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X