Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Searching, parsing and working with large GEO queries

    Ok, so I'm about to start the largest project I've undertaken, in terms of data handling and analysis. In short, the project is about finding public datasets available at the Gene Expression Omnibus (GEO) that adhere to a set of filters, download all the raw data (i.e. FASTQ files) and perform a set of analyses on the data, while maintaining metadata-to-result connectivity. I've done some googling... but I'm at a loss as to how to go about it.

    What I'm looking for is RNA-seq data for human cell lines, but only for a specific 1000 cell lines. I want to find the FASTQ files + metadata for these datasets and perform analyses on them. There's several steps, some of which I have ideas for, some of which I have no clue how to proceed.

    First, the GEO query itself. I can easily search for RNA-seq data for human ("expression profiling by high throughput sequencing"[DataSet Type]) OR "non coding rna profiling by high throughput sequencing"[DataSet Type]) AND Homo sapiens[Organism], 3298 GEO series), but I'm not sure how to search for cell lines only. Just adding "cell line" to [Any Field] seems too simple, and might miss GEO series. There is a field inside the GEO SOFT files called "Sample_characteristics_ch1 = <value>", which can be set to "cell line: <cell line name>". (No, I'm not sure exactly what the "_ch1" part means...) I was thinking downloading all the SOFT files for the series above followed by a filtering on sample characteristics as including "cell line". The first step would then be:
    • 1) Get identifiers for all the series in the query, download all the SOFT files and filter them to include cell lines

    The second step would be simple in comparison:
    • 2) Filter the results to only contain cell lines that are included in the list of the specific 1000 cell lines.

    Then comes another big question mark for me: how do I go from this list, containing all the info available in the SOFT file, to downloading all the correponding FASTQ files from SRA? The SRX ID is available in the SOFT file, but I think that fastq-dump requires SRR IDs... So:
    • 3) Find each SRR associated with all the SRXs in each SOFT file from the list above.
    • 4) Read the appropriate metadata to see if the data is paired-end or single-end.
    • 5) Download the data using fastq-dump as appropriate.

    Is this something feasible? Am I going about the problem the right way? Maybe I'm doing it all wrong and there's a simply solution that I'm not seeing. How would you do this, given the project outline? A big problem I foresee (other than not actually having a good idea how to perform all the steps) is how to keep the metadata properly connected to the raw data... It's (of course) quite important to be able to stratify the end results based on the metadata, as that is a big part of the reason why I want to do this project.

    Ideas, suggestions, tips? Fully fleshed-out solutions are also acceptable ;-)

  • #2
    I managed to solve it myself, so I'm posting the solution if anybody else happens upon the same problem. I first used the NCBI Entrez Direct CLI (http://www.ncbi.nlm.nih.gov/books/NBK179288/) to query GEO and find all the available RNA-seq data and it's GSE accession numbers. I parsed this list using the GEOquery R package https://bioconductor.org/packages/re.../GEOquery.html and downloaded all the corresponding SOFT files, from which I performed additional filtering, yielding a list of SRX accessions. I converted these to SRR accessions using SRAdb https://bioconductor.org/packages/re...tml/SRAdb.html, followed by downloading the FASTQ raw data using fastq-dump from sra-tools.

    It took a while to parse all the SOFT files with R, but seeing as I had to do the additional filtering on criteria that's only found in the SOFT files rather than in the GEO query itself, that's what worked in my specific case. It'd be much faster if I could filter more in the starting query itself, rather than after having to download all the additional and a lot of unnecessary (it turns out) metadata.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Exploring the Dynamics of the Tumor Microenvironment
      by seqadmin




      The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
      07-08-2024, 03:19 PM
    • seqadmin
      Exploring Human Diversity Through Large-Scale Omics
      by seqadmin


      In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
      06-25-2024, 06:43 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 07-16-2024, 05:49 AM
    0 responses
    24 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 07-15-2024, 06:53 AM
    0 responses
    31 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 07-10-2024, 07:30 AM
    0 responses
    40 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 07-03-2024, 09:45 AM
    0 responses
    205 views
    0 likes
    Last Post seqadmin  
    Working...
    X