Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Long Term Data Storage

    I am in the process of setting up a NGS core facility. I will be starting with a single HiSeq 1000 with an IlluminaCompute Tier0 analysis server. In a past life, I ran a NGS facility, in which we had a "medium-term" storage server and long-term tape back up system. File sizes have gotten so large, I'm not sure how practical it is to back up data on tape or deal with the hassle putting data on tape -- and retrieving if needed again in the future.

    A few questions for all of you:
    1. What data are you keeping ?
    -- keeping BCL = 330Gb
    -- keeping BAM = 330Gb
    -- total = 660Gb per run (paired end, 2 x 101bp)
    2. What long term data storage media are you using ?
    3. I am a geneticist/biologist --- I'm not an IT professional -- what would be the easiest solution for me ? (at some point, I will be hiring an informaticist/computational biologist)

    4. Would it be easier to store on external drives ?
    5. Do any of you back up data and send to another facility for storage - such as Iron Mountain ?

    Any advice you can give would be appreciated.
    Thank you,
    Michael

  • #2
    I a very few years you save the DNA libraries only.

    Comment


    • #3
      It seems a pragmatic solution to cost in a terabyte disk per sequencing run and use that as backup, assuming you have a place to store the disks.

      You might look into Basespace (illumina cloud solution) which i understand should be available for hiseq.

      Comment


      • #4
        We decided against external drives because of
        a) space
        b) organisation
        c) lack of mirroring (RAID)

        The last point is the most critical because we are required to save data for 10 years at the University. This can (hopefully) be guaranteed by tape and (maintained) RAID backups but not by off the shelf external HDs.

        We also have tape and spatially separate hard drive backups in case the server room burns down.

        Comment


        • #5
          Two (bare) disks, two separate locations?

          Comment


          • #6
            Also, these kind of blanket University data policies don't make sense in context of sequencing. They should understand the problem first, then make a data retention policy.

            Comment


            • #7
              The new floppy disk ...

              These are the new keychain USBs for large data :



              .... >1TB portable hard drives.

              Just buy enough to make 2 or 3 backups. Keep the backup separated and verify.

              This is labor intensive.

              Comment


              • #8
                Richard - not sure I understood your message. You seem to be suggesting these are USB flash solutions, but you actually linked to regular hard disks with USB interfaces. It is true that there are 1TB flash disks, but they are currently about $2000.

                Comment


                • #9
                  Yep. The greater than 1TB portable hard drive is the new floppy disk.

                  Comment


                  • #10
                    Ah right, got confused by the term "keychain" which made me think of flash disks. But yes, I agree, and judicious use of USB disks is a very cost-effective storage solution in my opinion. I am certainly never going back to tape backup!

                    Comment


                    • #11
                      The nice thing about USB disks is that if your sequencer dumps out 1TB of data per run, then cost in 2 x 1TB USB disks per run and you have a resilient backup solution. Given that a HiSeq run might be $10,000 of consumables, $200 more for the disks can be absorbed easily.

                      Comment


                      • #12
                        Contrast that with enterprise-grade solutions and you are talking more like $1000/TB plus all the administrative overhead of keeping these solutions going. Amazon S3 is another option but costs can mount up over time.

                        Comment


                        • #13
                          With a 2.5" hard drive as your file backup, storing the samples may almost end up taking more room than storing the data.

                          I agree with the purchase of 2 hard drives for each run. The university then has a visual idea of how their 10-year policy is working out, and the hard drives won't use any power when they're not plugged into anything (unlike a dedicated network backup, which will consume power on the off chance that you'll want a 5kb file from your 7-year-old sequencing data with latency of less than a second).

                          Comment


                          • #14
                            What about the CIFs and corresponding files? There are situations where there is need to externally re-basecall the data with bustard. With BCLs alone this is not possible.
                            Storing CIF plus corresponding files takes up to 3.5TB per HiSeq flowcell ...

                            IMHO USB disks are not suited for such amount of data (especially when you are running more than one machine).

                            Comment


                            • #15
                              I've heard that the HiSeq autmatically dismisses the image files, isn't that true?

                              Anyways, I don't think it makes sense to save both .bcl and .fastq files as they can easily be converted (at least from bcl to fastq, don't know the other way round)

                              Anf for 330 GB that could easily be saved on a hard disk (would be 3 runs per TB, right?)

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 07-16-2024, 05:49 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-15-2024, 06:53 AM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              40 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 09:45 AM
                              0 responses
                              205 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X