Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • syslm01
    Member
    • Apr 2010
    • 16

    about SRA paired datasets

    Hi everyone,

    I have a question about pair-ended RNA-seq datasets on SRA. Some sequences file of pair-ended datasets are like SRR0011_1, SRR0011_2 which means these are paired sequences. But I didn't find the same information on some datasets and the reads length of each datasets seems two times than the length of one single RNA-seq reads mentioned in the paper.

    so do these datasets combined two paired sequences ?

    Thank you.
  • john_mu
    Member
    • May 2010
    • 88

    #2
    what do you mean by "so do these datasets combined two paired sequences ? ", that doesn't quite make sense.

    Are you asking how to tell if two files come from paired-end reads, if that information was lost?
    SpliceMap: De novo detection of splice junctions from RNA-seq
    Download SpliceMap Comment here

    Comment

    • syslm01
      Member
      • Apr 2010
      • 16

      #3
      hi john,
      I have checked some two paired-end reads file, one reads in the file is like:
      @SRR037945.1 HWUSI-EAS627_1:2:1:0:1629 length=152
      NNNANNNNNNNATCTCTTTAGATTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAGAAGAAACCTCTGATCCACCTCTAATACATCATTTATTTTTTTTATATTTATATATATGTAAAAAGATATAAAAACAAAGAAG
      +SRR037945.1 HWUSI-EAS627_1:2:1:0:1629 length=152
      !!!#!!!!!!!#############################################################################################################################################
      the sequence length is 152bp, and I know their RNA-seq data is 75bp, so I wonder if these two paired-ended reads are join togather.

      yes, I am asking how to find the paired-ended information.

      here is an example link: http://www.ncbi.nlm.nih.gov/sra/SRX017794?report=full

      Thank you

      Comment

      • kmcarr
        Senior Member
        • May 2008
        • 1181

        #4
        The srf file format (which is how Illumina data is submitted to the SRA) has all bases for a spot (cluster) stored as a single string. Meta information also stored in the srf file indicates which portions of the that string represent read1 and read2 if it is a paired read (as well is which portion is the index if an MID protocol is run, etc.). When a FASTQ file is extracted from the srf the user must indicated whether they want the read split into its parts or the entire read as a single string. Your example looks like the FASTQ output you would get when you don't specify splitting the output into reads.

        In the example you provided there are two possibilities: The srf file is malformed; it does not properly indicated that the data came from a paired end method and the data represents two reads. Alternatively the NCBI may not be properly splitting the data when it creates the FASTQ files.

        I suggest that you contact the SRA help desk with your questoin: [email protected]

        Comment

        • syslm01
          Member
          • Apr 2010
          • 16

          #5
          Hi kmcarr,

          I will send an email to SRA.

          Thanks for your help.

          Comment

          • pascal
            Junior Member
            • Mar 2010
            • 9

            #6
            syslm01, have you received an answer from SRA? I want to analyze the same dataset...

            Comment

            • fennan
              Member
              • Apr 2010
              • 19

              #7
              syslm01, I found the same issue in the same datasets.

              I also thought that the two mates of the reads might be concatenated. I run some quality control process for the reads and it confirmed it (if you want I could send them to you). What I did was to write a script that divides the reads in two files "*_1.fastq" and "*_2.fastq" in order to be able to use tophat/cufflinks pipeline.

              However I still have some concerns with these data since the quality of the reads presents some strange properties and also I saw that the length of the reads that the original authors report in the sam and gtf files is 75 instead of 76 as I found from the raw data... Any thought on that?

              Comment

              • syslm01
                Member
                • Apr 2010
                • 16

                #8
                Hi pascal and fenan,

                I received a letter from SRA. Here is the reply:

                In the case with SRX017794 and runs SRR037945 and SRR037946 we had a situation when SPOT_DESCRIPTOR has incorrect.
                To reload data - we need to get fixed srf files from original submitter (that may be impossible) or develop internal way to fix such data set, it will take some time as well.
                I recommend to split data by yourself for now.

                I also seperate the file in two files by myself, I found some of these reads are 75bp and some are 76bp, I have no idea about why this happen.

                Comment

                • kmcarr
                  Senior Member
                  • May 2008
                  • 1181

                  #9
                  Originally posted by fennan View Post
                  However I still have some concerns with these data since the quality of the reads presents some strange properties and also I saw that the length of the reads that the original authors report in the sam and gtf files is 75 instead of 76 as I found from the raw data... Any thought on that?
                  For Illumina sequencing it is normal to collect one additional cycle of data for each read; that is, if the final read length you want is 75nt then you will collect 76 cycles of data but the base from the last cycle is not reported. (This has to do with phasing/prephasing correction. To correct for phasing in cycle n you need data from cycle n+1; thus the last cycle can never have phasing correction applied to is so standard procedure is to trim it off.) To collect 2 X 75 nt paired end reads you would want 152 cycles (2 X 76). If the SRF file had been properly formed the command line option "--use_bases Y75n,Y75n" would have been used. This would signify that within the 152 cycles of raw data, cycles 1-75 are read 1, cycle 76 is to be ignored, cycles 77-151 are read 2 and cycle 152 is ignored. When FASTQ is output from the SRF file by (e.g. by the program srf2fastq) it would split the data into separate fastq files for reads 1 and 2.

                  If you are going to split the 152 nt reads manually do as stated above, nt 1-75 for read 1 and nt 77-151 for read 2.

                  Could you provide some more details on what you mean by "the quality of the reads presents some strange properties".

                  Comment

                  • syslm01
                    Member
                    • Apr 2010
                    • 16

                    #10
                    Originally posted by fennan View Post
                    syslm01, I found the same issue in the same datasets.

                    I also thought that the two mates of the reads might be concatenated. I run some quality control process for the reads and it confirmed it (if you want I could send them to you). What I did was to write a script that divides the reads in two files "*_1.fastq" and "*_2.fastq" in order to be able to use tophat/cufflinks pipeline.

                    However I still have some concerns with these data since the quality of the reads presents some strange properties and also I saw that the length of the reads that the original authors report in the sam and gtf files is 75 instead of 76 as I found from the raw data... Any thought on that?
                    Hi,

                    did you use the datasets to run tophat and cufflinks ? did the result are same as their provided sam files? I have a try, but my result is different.

                    Comment

                    • fennan
                      Member
                      • Apr 2010
                      • 19

                      #11
                      @kmcarr
                      Thank you very much for the information. It really is what I was looking for. The thing is that you cannot download the srf file but the fastq, and that's why I need to split it manually.

                      Could you provide some more details on what you mean by "the quality of the reads presents some strange properties".
                      I have obtained some quality control graphs from the raw data. I could provide them to you if you are interested. The thing that called my attention the most was the difference between the quality of the first and the second read, as well as the low quality of the basis T in the second read. You can see an example of this in the attached image. It represents the basis mean quality per position (T is the blue line), which has been generated from the file "SRR037945.fastq" of the run "SRX017794" (similar graphs are obtained for most of the other fastq files). Do you have any idea why this is happening?

                      Thanks again for your help.
                      Attached Files
                      Last edited by fennan; 05-27-2010, 04:06 AM.

                      Comment

                      • fennan
                        Member
                        • Apr 2010
                        • 19

                        #12
                        Originally posted by syslm01 View Post
                        Hi,

                        did you use the datasets to run tophat and cufflinks ? did the result are same as their provided sam files? I have a try, but my result is different.
                        That was what I wanted to do at first. I haven't done it yet since I wasn't sure how to deal with the raw data.

                        However, in the header of the sam file you can find the command used to create such mapping. Take a look to it and maybe it will help you to figure out how things should be done. Unfortunately, this is not the case for the cufflinks output. I think it would be very useful if cufflinks stored the command line used to create its outputs (maybe it does it already, and I just haven't found where)

                        Comment

                        • syslm01
                          Member
                          • Apr 2010
                          • 16

                          #13
                          Hi fennan,

                          I checked their command line, they use mm9+wold_spikes as references and provide tophat with junction file pooled_200bp_frags.juncs. I'm not sure what these files are, I think that my cause the differences. Do you have any idea?

                          please tell me if you are sure how to deal with the raw data.

                          Thank you very much.

                          Comment

                          • syslm01
                            Member
                            • Apr 2010
                            • 16

                            #14
                            Hi,

                            I am also not sure about the other datasets: ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX019/SRX019275
                            The SRR039999_1.fastq.gz and SRR039999_2.fastq.gz are paired reads, but I am not sure the SRR039999.fastq.gz dataset, does it also belong to the SRR039999 ? but I don't find the pair-ended information.

                            Does anyone have experiences with this kind of data?

                            Thanks

                            Comment

                            • ychen
                              Junior Member
                              • Feb 2010
                              • 4

                              #15
                              Hi Folks,

                              I feel lucky to find this thread because I have been struggling with the same problems. After splitting the unusual FASTQ files, my TopHat results are still quite different from what reported in the recent published paper. Can you tell me where to find the provided SAM file? I want to try the the reported command line.

                              Thanks a lot,

                              Yi-Shiou

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 08:59 AM
                              0 responses
                              9 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              30 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...