Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FastQ decrypted from SRA toolkit with warnings: any loss of information?

    Hi,

    Recently we've been trying to decrypt some SRA files of the same project to get the FastQ data. As we got the FastQ files, however, we also received some warnings as shown below:
    Code:
    2014-01-22T00:04:11 fastq-dump.2.3.2 warn: column not found while
    opening table within short read archive module - column LABEL
    2014-01-22T00:04:11 fastq-dump.2.3.2 warn: column not found while
    opening table within short read archive module - column LABEL_START
    2014-01-22T00:04:11 fastq-dump.2.3.2 warn: column not found while
    opening table within short read archive module - column LABEL_LEN
    For each SRA file we decrypted "successfully(see below)", we will get exactly 5 copies of such warnings.

    A "successful" decryption here means that the FastQ files do have read information, and their sizes also seem to be reasonable. However, we're still not sure whether the decryption has led to any loss of data, especially some important information about the reads themselves (e.g. we have lost some reads).

    So here are the problems we'd like to ask:
    • Is there any difference with respect to read information between the FastQ files decrypted from SRA files with or without the warnings mentioned above?
    • If yes, what are the differences?


    Here are the details of our decryption:
    • sratoolkit used: version 2.3.2-5-centos_linux64 (the newest version when we downloaded the data and tried to decrypt them)
    • the decryption needs a repository key, and we set it up using the GUI started up by sratoolkit.jar
    • program used to decrypt SRA files: fastq-dump
    • command line used to decrypt SRA files: fastq-dump --outdir $OUTPUT_DIR --bzip2 --split-3 --keep-empty-files --log-level info $SRA_FILE
    • each SRA file is a paired-end RNA-Seq data of one biological sample produced by Illumina HiSeq 2000, and the read length is always 76bp.


    Thanks in advance!

    Yang

  • #2
    I'm a little confused that I received the following reply from GenoMax by e-mail while there's none on the forum. Anyway, here's the reply:

    Originally posted by GenoMax
    SRA toolkit error messages can be benign, data set specific etc. Perhaps there is no problem here.

    It may not hurt to send a message to SRA support. Use the "Write to helpdesk" link at the bottom of the page for the toolkit download tab. Include the dataset you are using. It is weekend so you may not hear back till Monday. In past they have sometime confirmed if there was a problem with a specific dataset.
    Thanks for the information . As for the NCBI help desk, we did write to them more than 2 weeks ago, but there was no reply. We suppose that there's something wrong with the mail servers, and since we cannot find any related topics or threads on the internet, yesterday we sent another again and also decided to ask the question here. However, as you have mentioned here, maybe we should have included our dataset IDs to tell NCBI which ones we'd like to check.

    Comment


    • #3
      The NCBI Help Desk had replied to me a few days ago to help to fix these issues. I think it would be good to share the solution here to everyone, so here's the solution:
      • The data will always be valid/complete as long as fastq-dump does not produce any error messages. It is possible for fastq-dump to produce a lot of warnings when operating on a valid data, especially when the log-level is set to 5 (default is 4).
      • The data will also always be valid/complete as it passes the vdb-validate program (i.e. all the outputs are "OK").

      Comment


      • #4
        What happens if you try samdump on the same SRA files instead?

        Comment


        • #5
          Originally posted by albireo View Post
          What happens if you try samdump on the same SRA files instead?
          Hi albireo,

          Sorry for the late reply. These SRA files are pure FastQ files, not SAM files, and I'm not sure which parameters I should set to use sam-dump to decrypt these SRA files correctly even after I have read the help page of sam-dump. Could you tell me why you're interested in the output of sam-dump?

          Comment


          • #6
            Thanks for share the information!
            May i wonder why ncbi favors SRA instead of just keep FASTQ?
            Last edited by shuoguo; 02-22-2014, 07:47 AM.

            Comment


            • #7
              Originally posted by shuoguo View Post
              Thanks for share the information!
              May i wonder why ncbi favors SRA instead of just keep FASTQ?
              As far as I know, FASTQ is itself a text-based format, so it would be better to compress them first and distribute them to save time. I don't know the reason why NCBI chose SRA instead of other popular compression format, but I guess that NCBI, by developing a new compression format itself, could have total control over anything of files compressed in this way, the most important of which should be the security issue.

              Comment


              • #8
                Originally posted by Yang Ding View Post
                As far as I know, FASTQ is itself a text-based format, so it would be better to compress them first and distribute them to save time. I don't know the reason why NCBI chose SRA instead of other popular compression format, but I guess that NCBI, by developing a new compression format itself, could have total control over anything of files compressed in this way, the most important of which should be the security issue.
                Thank you!

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Best Practices for Single-Cell Sequencing Analysis
                  by seqadmin



                  While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                  06-06-2024, 07:15 AM
                • seqadmin
                  Latest Developments in Precision Medicine
                  by seqadmin



                  Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                  Somatic Genomics
                  “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                  05-24-2024, 01:16 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 06-07-2024, 06:58 AM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-06-2024, 08:18 AM
                0 responses
                20 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-06-2024, 08:04 AM
                0 responses
                20 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-03-2024, 06:55 AM
                0 responses
                14 views
                0 likes
                Last Post seqadmin  
                Working...
                X