Surprisingly I was not able to find anything on Google about the details of uploading raw PacBio sequence reads to the European Nucleotide Archive (ENA), the EBI-EMBL twin of the Short Read Archive (SRA).
http://www.ebi.ac.uk/ena/submit/read...bio_hd5_format just says:
In our case I have a zip file with dozens of files under assorted folders. Thankfully https://github.com/PacificBioscience...rvice-provider explains this. Based on their example, I've marked the files I think I need to upload (update - not correct, see later):
However, when clicking though the ENA Webin forms, and you finally get to the spreadsheet-like view to upload the files, and pick PacBio:
It really only seems to want a single file... which puzzled me. However, the tool tip says:
So, I think that means you can create a plain text "manifest" using the md5sum command line tool, e.g.:
But that would miss out the *.metadata.xml file which looks useful? (update - yes, they want that XML file in particular - see below). Could anyone who has done this help - or should I email the DataSub teams and report back here? Thanks!
http://www.ebi.ac.uk/ena/submit/read...bio_hd5_format just says:
PacBio format
PacBio data submissions are supported in the platform specific native format.
One run consists of *.bax.h5, *.bas.h5 and xml files. Please note that these files must not be tarred.
PacBio data submissions are supported in the platform specific native format.
One run consists of *.bax.h5, *.bas.h5 and xml files. Please note that these files must not be tarred.
Code:
/path/to/secondary/storage/2420294/0011 ├── Analysis_Results │ ├── [B]m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.bax.h5 --> ENA[/B] │ ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.log │ ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.subreads.fasta │ ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.subreads.fastq │ ├── [B]m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.bax.h5 --> ENA[/B] │ ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.log │ ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.subreads.fasta │ ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.subreads.fastq │ ├── [B]m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.bax.h5 --> ENA[/B] │ ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.log │ ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.subreads.fasta │ ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.subreads.fastq │ ├── [B]m140415_143853_42175_c100635972550000001823121909121417_s1_p0.bas.h5 --> ENA[/B] │ ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.sts.csv │ └── [B]m140415_143853_42175_c100635972550000001823121909121417_s1_p0.sts.xml --> ENA[/B] ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.1.xfer.xml ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.2.xfer.xml ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.3.xfer.xml ├── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.mcd.h5 └── m140415_143853_42175_c100635972550000001823121909121417_s1_p0.metadata.xml
PacBio HDF5
One PacBio HDF5 file is submitted for each run.
One PacBio HDF5 file is submitted for each run.
Please choose one of the following manifest files present in your drop box. A manifest file ( *.all ) contains all files ( *bas.h5, *.bax.h5 and *.xml ) and their MD5 checksums associated with a single PacBio run. The format of the manifest file must correspond to the output of the md5sum command.
If your file is not listed below, it was either not found in your drop box or its extension was not recognized.
If your file is not listed below, it was either not found in your drop box or its extension was not recognized.
Code:
$ cd /path/to/secondary/storage/2420294/0011/Analysis_Results $ md5sum *bas.h5 *.bax.h5 *.xml > manifest.all
Comment