Header Leaderboard Ad

Collapse

Illumina MiSeq file size/downstream analysis question

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina MiSeq file size/downstream analysis question

    At my lab We are starting to organize all of the infrastructure we will need in our lab for bringing in NGS. We will be doing a 15kb panel on the MiSeq using v3 reagents. We will be generating ~10-15Gb of sequence per run.

    Our downstream analysis will be in CLC Biogenomics workbench. It is my understanding that we will demultiplex our MiSeq files, import them into the workbench software on our custom tower, and process from there.

    Is there anyone who has experience with the workflow of CLC genomics workbench from the Illumina platforms? Our analysis computer will have ~4TB of storage, and we were thinking of obtaining ~10-15 TB of network storage.

    In addition to the 15Gb of MiSeq data, is there any way to estimate the size and number of files that we will generate in CLC while we work towards the final VCF?

    Sorry for the long winded question. Any information will help greatly

  • #2
    How many runs do you expect to do each month/over a year? Have you thought about a long term archival storage solution (or don't expect a need for that)? Are you going to use on-board software on MiSeq to do the demultiplexing or would BaseSpace be in play?

    Comment


    • #3
      We will probably be performing 4-5 runs a month, with room for growth. We have not necessarily thought about archive storage yet, but the MiSeq itself has a 750gb HD. I don't feel comfortable keeping all of the data for a given month/few months on that, so I imagine we will clean it out periodically and put this data into network HD storage. How much extra data/file sizes are produced downstream? I imagine we won't duplicate the MiSeq files before. moving them to the analysis environment.

      Thanks for the insight.

      Comment


      • #4
        Depending on how you schedule the runs (number of cycles, SE vs PE etc) the size of the original data folder will vary but you can expect it to be somewhere between these values (e.g. 50x7 ~12G to 300x8x8x300 ~60G). After demultiplexing (bcl2fastq) the size will increase by about 50% (so data folders would become ~18 to ~80G in above example). We don't use on-board MiSeq software, but I expect if you did that then the folder sizes would likely be similar to final sizes above.

        Comment


        • #5
          Originally posted by flyinglotus View Post
          We will probably be performing 4-5 runs a month, with room for growth. We have not necessarily thought about archive storage yet, but the MiSeq itself has a 750gb HD. I don't feel comfortable keeping all of the data for a given month/few months on that, so I imagine we will clean it out periodically and put this data into network HD storage. How much extra data/file sizes are produced downstream? I imagine we won't duplicate the MiSeq files before. moving them to the analysis environment.

          Thanks for the insight.
          The MiSeq software can, and SHOULD be configured to copy its data to a network storage device as it is collected. There is no need for manual moving of the data. The network storage device you set up to receive the data should be fault tolerant (i.e. some type of RAID configuration) and ideally from there a second, archival copy is made immediately after the run.

          Comment


          • #6
            As well, you should consider what data you actually need to keep. If you set up your analyses well, with an actual software-defined pipeline of some sort, which you version (along with all software components used in the pipeline) then you can recreate downstream files. Meaning you generally keep/archive:

            1)Raw input data (this could be BCL files, but you may reasonably opt to just keep the de-multiplexed FASTQ files). This is generally quite a bit smaller than the complete run output from a MiSeq.

            2)Detailed documentation of the workflow that was done on the data. Yous separately archive all your software, pipelines, databases, etc (in a versioned manner)

            3)Your final results (and even this isn't absolutely required, particularly for archiving)

            You should structure everything so you can recreate your analysis and all downstream results files, exactly, at any time. Granted this is actually harder since you are using commercial software and have little control over version changes an updates often, in terms of keeping around old copies. But you still want to strive towards reproducibility.

            Otherwise everything you have set up seems on the right track. The exact specs of your workstation depend on the analyses you will do within CLC workbench. I would go with at least a few TB of RAIDed storage on the workstation itself. If you haven't already bought it, Qiagen/CLCBio has a collaboration with PSSC labs. PSSC builds a workstation that is configurable themselves, but you can also order the whole thing as a turn-key solution from CLC Bio still I believe.

            Comment


            • #7
              Thank you all for your responses. We are looking into our options for the downstream analysis, and feel most likely we will only keep FASTQ files and potentially BAM files. All of the intermediate files (generated from CLC, most likely) we feel are probably discardable.

              WIll update when we have started generating data.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                by seqadmin


                ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                01-24-2023, 01:19 PM
              • seqadmin
                Introduction to Single-Cell Sequencing
                by seqadmin
                Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                ...
                01-09-2023, 03:10 PM

              ad_right_rmr

              Collapse
              Working...
              X