Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Downloading 'RunInfo Table' from SRA Run Selector

    Hello,

    I would like to download the metadata for a given BioProject from the SRA. I am able to get exactly what I need by hitting the download 'RunInfo Table' through the SRA Run Selector web interface (example). It should be relatively straightforward to perform this action from the command line using "wget".

    By clicking on the 'RunInfo Table' button, the page loads the following address, which is stable link to download the information:



    BUT, I have no idea where that hash information is coming from. Can anyone help there?

    Alternatively, I've tried a series of efetch commands, but none provide me a '.tsv' (or '.csv' would be fine) of the complete BioProject metadata.

    This command provides only the information about sequencing:
    wget -O PRJNA308986.csv 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=PRJNA308986'

    This command provides the full BioProject information sought, but in an .xml format which I haven't been able to parse.

    wget -O PRJNA496337.xml 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=bioproject&term=PRJNA496337'

    Thanks in advance,
    Roli

  • #2
    In general, for downloading NCBI data from the Unix command line, I recommend using Entrez Direct.

    Specifically, to download the runinfo table, you can use the following command:
    Code:
    esearch -db sra -q 'PRJNA308986' | efetch -format runinfo
    This will produce a comma separated table with the following fields:
    Code:
                      Run [  1]: SRR3108728
              ReleaseDate [  2]: 2017-02-16 00:00:00
                 LoadDate [  3]: 2016-01-21 03:15:18
                    spots [  4]: 98100
                    bases [  5]: 49246200
         spots_with_mates [  6]: 98100
                avgLength [  7]: 502
                  size_MB [  8]: 28
             AssemblyName [  9]: 
            download_path [ 10]: https://sra-download.ncbi.nlm.nih.gov/traces/sra37/SRR/003035/SRR3108728
               Experiment [ 11]: SRX1537041
              LibraryName [ 12]: mdbk110
          LibraryStrategy [ 13]: AMPLICON
         LibrarySelection [ 14]: PCR
            LibrarySource [ 15]: METAGENOMIC
            LibraryLayout [ 16]: PAIRED
               InsertSize [ 17]: 0
                InsertDev [ 18]: 0
                 Platform [ 19]: ILLUMINA
                    Model [ 20]: Illumina MiSeq
                 SRAStudy [ 21]: SRP068618
               BioProject [ 22]: PRJNA308986
          Study_Pubmed_id [ 23]: 
                ProjectID [ 24]: 308986
                   Sample [ 25]: SRS1253892
                BioSample [ 26]: SAMN04419133
               SampleType [ 27]: simple
                    TaxID [ 28]: 410658
           ScientificName [ 29]: soil metagenome
               SampleName [ 30]: mdbk110
             g1k_pop_code [ 31]: 
                   source [ 32]: 
       g1k_analysis_group [ 33]: 
               Subject_ID [ 34]: 
                      Sex [ 35]: 
                  Disease [ 36]: 
                    Tumor [ 37]: no
         Affection_Status [ 38]: 
             Analyte_Type [ 39]: 
        Histological_Type [ 40]: 
                Body_Site [ 41]: 
               CenterName [ 42]: UNIVERSITY OF MINNESOTA
               Submission [ 43]: SRA336468
    dbgap_study_accession [ 44]: 
                  Consent [ 45]: public
                  RunHash [ 46]: 4B63AAF2295927A2EAEB798FCF9FC7DA
                 ReadHash [ 47]: FB1226CB8B5FEBC85B053718D4C1BBFA
    You can download the same table in XML format by making a small change as follows:
    Code:
    esearch -db sra -q 'PRJNA308986' | efetch -format runinfo -mode xml
    You can then parse this XML using the command "xtract" that comes with the Entrez Direct tools to extract only specific columns of interest to you.

    Comment


    • #3
      In general, for downloading NCBI data from the Unix command line, I recommend using Entrez Direct.

      Specifically, to download the runinfo table, you can use the following command:
      Code:
      esearch -db sra -q 'PRJNA308986' | efetch -format runinfo
      This will produce a comma separated table with the following fields:
      Code:
                        Run [  1]: SRR3108728
                ReleaseDate [  2]: 2017-02-16 00:00:00
                   LoadDate [  3]: 2016-01-21 03:15:18
                      spots [  4]: 98100
                      bases [  5]: 49246200
           spots_with_mates [  6]: 98100
                  avgLength [  7]: 502
                    size_MB [  8]: 28
               AssemblyName [  9]: 
              download_path [ 10]: https://sra-download.ncbi.nlm.nih.gov/traces/sra37/SRR/003035/SRR3108728
                 Experiment [ 11]: SRX1537041
                LibraryName [ 12]: mdbk110
            LibraryStrategy [ 13]: AMPLICON
           LibrarySelection [ 14]: PCR
              LibrarySource [ 15]: METAGENOMIC
              LibraryLayout [ 16]: PAIRED
                 InsertSize [ 17]: 0
                  InsertDev [ 18]: 0
                   Platform [ 19]: ILLUMINA
                      Model [ 20]: Illumina MiSeq
                   SRAStudy [ 21]: SRP068618
                 BioProject [ 22]: PRJNA308986
            Study_Pubmed_id [ 23]: 
                  ProjectID [ 24]: 308986
                     Sample [ 25]: SRS1253892
                  BioSample [ 26]: SAMN04419133
                 SampleType [ 27]: simple
                      TaxID [ 28]: 410658
             ScientificName [ 29]: soil metagenome
                 SampleName [ 30]: mdbk110
               g1k_pop_code [ 31]: 
                     source [ 32]: 
         g1k_analysis_group [ 33]: 
                 Subject_ID [ 34]: 
                        Sex [ 35]: 
                    Disease [ 36]: 
                      Tumor [ 37]: no
           Affection_Status [ 38]: 
               Analyte_Type [ 39]: 
          Histological_Type [ 40]: 
                  Body_Site [ 41]: 
                 CenterName [ 42]: UNIVERSITY OF MINNESOTA
                 Submission [ 43]: SRA336468
      dbgap_study_accession [ 44]: 
                    Consent [ 45]: public
                    RunHash [ 46]: 4B63AAF2295927A2EAEB798FCF9FC7DA
                   ReadHash [ 47]: FB1226CB8B5FEBC85B053718D4C1BBFA
      You can download the same table in XML format by making a small change as follows:
      Code:
      esearch -db sra -q 'PRJNA308986' | efetch -format runinfo -mode xml
      You can then parse this XML using the command "xtract" that comes with the Entrez Direct tools to extract only specific columns of interest to you.

      Comment


      • #4
        Using wget to retrieve SRA RunInfo and AccList

        Here's an example of using `wget` to retrieve the SRA RunInfo and AccList from NCBI Sequence Read Archive.

        Code:
        # wget equivalent to:
        #   esearch -db sra -q "${study_id}" | efetch -format runinfo
        
        study_id=PRJNA308986
        db=sra
        
        #assemble the esearch URL
        base='https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
        
        # esearch for the project, using WebEnv/QueryKey for efetch
        data="`wget -qO- "${base}esearch.fcgi?db=${db}&term=${study_id}&usehistory=y"`"
        web=$(grep -oPm1 "(?<=<WebEnv>)[^<]+" <<< "${data}")
        key=$(grep -oPm1 "(?<=<QueryKey>)[^<]+" <<< "${data}")
        
        # efetch SRA RunInfo
        wget -qO "SraRunInfo-${study_id}.csv" "${base}efetch.fcgi?db=${db}&query_key=${key}&WebEnv=${web}&retmode=text&rettype=runinfo"
        
        # efetch SRA AccList
        wget -qO "SraAccList-${study_id}.txt" "${base}efetch.fcgi?db=${db}&query_key=${key}&WebEnv=${web}&retmode=text&rettype=acclist"

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Latest Developments in Precision Medicine
          by seqadmin



          Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

          Somatic Genomics
          “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
          Today, 01:16 PM
        • seqadmin
          Recent Advances in Sequencing Analysis Tools
          by seqadmin


          The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
          05-06-2024, 07:48 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 07:15 AM
        0 responses
        12 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 10:28 AM
        0 responses
        15 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 07:35 AM
        0 responses
        16 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-22-2024, 02:06 PM
        0 responses
        8 views
        0 likes
        Last Post seqadmin  
        Working...
        X