Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ynwh
    replied
    That is really helpful. Thank you, Ray.

    In my case SRR1284074, if use
    #fastq-dump SRR1284074
    Rejected 163480 SPOTS because SPOTLEN < 1
    Read 163482 spots for SRR1284074
    Written 2 spots for SRR1284074

    Use "--table SEQUENCE" to dump SRR1284074, I still got 3 spots rejected.
    #fastq-dump --table SEQUENCE SRR1284074
    Rejected 3 SPOTS because SPOTLEN < 1
    Read 163482 spots for SRR1284074
    Written 163479 spots for SRR1284074

    Any more suggestions or comments on this issue are very welcome.

    Originally posted by rwan View Post
    Dear all,

    Not sure if you have resolved your problem, but I had a similar problem with PacBio reads, but from a different data set. After reading this thread, I asked NCBI's Helpdesk and they explained to me that PacBio data is special in that multiple reads with a lot of errors are used to form consensus reads. It is these consensus reads that are output with no options to fastq-dump:

    Code:
    fastq-dump SRR2003880
    If the raw reads are required, you need to supply the --table SEQUENCE option. i.e.,

    Code:
    fastq-dump --table SEQUENCE SRR2003880
    I hope this helps someone!

    Ray
    Last edited by ynwh; 12-04-2015, 07:00 AM.

    Leave a comment:


  • rwan
    replied
    Dear all,

    Not sure if you have resolved your problem, but I had a similar problem with PacBio reads, but from a different data set. After reading this thread, I asked NCBI's Helpdesk and they explained to me that PacBio data is special in that multiple reads with a lot of errors are used to form consensus reads. It is these consensus reads that are output with no options to fastq-dump:

    Code:
    fastq-dump SRR2003880
    If the raw reads are required, you need to supply the --table SEQUENCE option. i.e.,

    Code:
    fastq-dump --table SEQUENCE SRR2003880
    I hope this helps someone!

    Ray

    Leave a comment:


  • Retro
    replied
    We downloaded the ENA fatsq file. It is exactly what we get as result of the SRA toolkit. So probably only 46K sequences are usable. What is still unclear is why the NCBI archive website shows the "zero" reads as sequences, e.g. SRA|SRR2003880.1

    Leave a comment:


  • GenoMax
    replied
    ENA record appears to have the same number of spots: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/S...03880.fastq.gz

    Leave a comment:


  • GenoMax
    replied
    It is possible that the download from SRA is corrupt. Best recourse there is to wait to hear back from SRA support. They generally fix these files based on my experience.

    In the mean time, hdf5 files from the download tab is the original data from the submitter. It does not appear to contain the metadata.xml file that is required by SMRTportal so you may not be able to use the original files right away.

    Leave a comment:


  • Retro
    replied
    Thanks. But those reads show up in the NCBI website as not empty.

    Leave a comment:


  • GenoMax
    replied
    Fastq-dump appears to be rejecting reads because of this

    "Rejected 117005 SPOTS because SPOTLEN < 1".

    These reads appear to have no sequence.

    You can confirm this yourself by doing

    Code:
    $ fastq-dump -M 0 -F SRR2003880
    You can download the original HDF5 files for this record (using the "Download" tab) and verify if there are many 0 length sequences. You will need access to SMRTportal to properly process the raw data files.

    Leave a comment:


  • Retro
    started a topic PacBio data - problem with SRA toolkit

    PacBio data - problem with SRA toolkit

    I have problems getting fasta from PacBio SRA file using SRA toolkit. For example, file SRR2003880.sra should contain about 163K sequences, it yields only 46K and those do not correspond to the same names on NCBI SRA website. I can successfully process other PacBio files, and I am using the newest version of SRA toolkit with the following command line:

    sratoolkit.2.4.5-2-win64/bin/fastq-dump.exe --fasta SRR2003880.sra

    My best guess is that the upload of the data on NCBI SRA website was incorrect. They did not answer me yet. I would very appreciate anybody's help or opinion.

    Thank you.

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-25-2024, 11:49 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
18 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
62 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X