Unconfigured Ad

**john_mu** · 05-23-2010, 06:29 PM

what do you mean by "so do these datasets combined two paired sequences ? ", that doesn't quite make sense.

Are you asking how to tell if two files come from paired-end reads, if that information was lost?

**syslm01** · 05-23-2010, 06:44 PM

hi john,
I have checked some two paired-end reads file, one reads in the file is like:

@SRR037945.1 HWUSI-EAS627_1:2:1:0:1629 length=152
NNNANNNNNNNATCTCTTTAGATTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAGAAGAAACCTCTGATCCACCTCTAATACATCATTTATTTTTTTTATATTTATATATATGTAAAAAGATATAAAAACAAAGAAG
+SRR037945.1 HWUSI-EAS627_1:2:1:0:1629 length=152
!!!#!!!!!!!#############################################################################################################################################

the sequence length is 152bp, and I know their RNA-seq data is 75bp, so I wonder if these two paired-ended reads are join togather.

yes, I am asking how to find the paired-ended information.

here is an example link: http://www.ncbi.nlm.nih.gov/sra/SRX017794?report=full

Thank you

**kmcarr** · 05-24-2010, 04:37 AM

The srf file format (which is how Illumina data is submitted to the SRA) has all bases for a spot (cluster) stored as a single string. Meta information also stored in the srf file indicates which portions of the that string represent read1 and read2 if it is a paired read (as well is which portion is the index if an MID protocol is run, etc.). When a FASTQ file is extracted from the srf the user must indicated whether they want the read split into its parts or the entire read as a single string. Your example looks like the FASTQ output you would get when you don't specify splitting the output into reads.

In the example you provided there are two possibilities: The srf file is malformed; it does not properly indicated that the data came from a paired end method and the data represents two reads. Alternatively the NCBI may not be properly splitting the data when it creates the FASTQ files.

I suggest that you contact the SRA help desk with your questoin: [email protected]

**syslm01** · 05-24-2010, 04:50 AM

Hi kmcarr,

I will send an email to SRA.

Thanks for your help.

**pascal** · 05-26-2010, 06:36 AM

syslm01, have you received an answer from SRA? I want to analyze the same dataset...

**fennan** · 05-26-2010, 06:54 AM

syslm01, I found the same issue in the same datasets.

I also thought that the two mates of the reads might be concatenated. I run some quality control process for the reads and it confirmed it (if you want I could send them to you). What I did was to write a script that divides the reads in two files "*_1.fastq" and "*_2.fastq" in order to be able to use tophat/cufflinks pipeline.

However I still have some concerns with these data since the quality of the reads presents some strange properties and also I saw that the length of the reads that the original authors report in the sam and gtf files is 75 instead of 76 as I found from the raw data... Any thought on that?

**syslm01** · 05-26-2010, 07:25 AM

Hi pascal and fenan,

I received a letter from SRA. Here is the reply:

In the case with SRX017794 and runs SRR037945 and SRR037946 we had a situation when SPOT_DESCRIPTOR has incorrect.
To reload data - we need to get fixed srf files from original submitter (that may be impossible) or develop internal way to fix such data set, it will take some time as well.
I recommend to split data by yourself for now.

I also seperate the file in two files by myself, I found some of these reads are 75bp and some are 76bp, I have no idea about why this happen.

**kmcarr** · 05-26-2010, 09:01 AM

Originally posted by fennan View Post

However I still have some concerns with these data since the quality of the reads presents some strange properties and also I saw that the length of the reads that the original authors report in the sam and gtf files is 75 instead of 76 as I found from the raw data... Any thought on that?

For Illumina sequencing it is normal to collect one additional cycle of data for each read; that is, if the final read length you want is 75nt then you will collect 76 cycles of data but the base from the last cycle is not reported. (This has to do with phasing/prephasing correction. To correct for phasing in cycle n you need data from cycle n+1; thus the last cycle can never have phasing correction applied to is so standard procedure is to trim it off.) To collect 2 X 75 nt paired end reads you would want 152 cycles (2 X 76). If the SRF file had been properly formed the command line option "--use_bases Y75n,Y75n" would have been used. This would signify that within the 152 cycles of raw data, cycles 1-75 are read 1, cycle 76 is to be ignored, cycles 77-151 are read 2 and cycle 152 is ignored. When FASTQ is output from the SRF file by (e.g. by the program srf2fastq) it would split the data into separate fastq files for reads 1 and 2.

If you are going to split the 152 nt reads manually do as stated above, nt 1-75 for read 1 and nt 77-151 for read 2.

Could you provide some more details on what you mean by "the quality of the reads presents some strange properties".

**syslm01** · 05-27-2010, 02:59 AM

Originally posted by fennan View Post

syslm01, I found the same issue in the same datasets.

I also thought that the two mates of the reads might be concatenated. I run some quality control process for the reads and it confirmed it (if you want I could send them to you). What I did was to write a script that divides the reads in two files "*_1.fastq" and "*_2.fastq" in order to be able to use tophat/cufflinks pipeline.

However I still have some concerns with these data since the quality of the reads presents some strange properties and also I saw that the length of the reads that the original authors report in the sam and gtf files is 75 instead of 76 as I found from the raw data... Any thought on that?

Hi,

did you use the datasets to run tophat and cufflinks ? did the result are same as their provided sam files? I have a try, but my result is different.

**fennan** · 05-27-2010, 03:27 AM

@kmcarr
Thank you very much for the information. It really is what I was looking for. The thing is that you cannot download the srf file but the fastq, and that's why I need to split it manually.

Could you provide some more details on what you mean by "the quality of the reads presents some strange properties".

I have obtained some quality control graphs from the raw data. I could provide them to you if you are interested. The thing that called my attention the most was the difference between the quality of the first and the second read, as well as the low quality of the basis T in the second read. You can see an example of this in the attached image. It represents the basis mean quality per position (T is the blue line), which has been generated from the file "SRR037945.fastq" of the run "SRX017794" (similar graphs are obtained for most of the other fastq files). Do you have any idea why this is happening?

Thanks again for your help.

Attached Files

basesQualities.jpg (9.6 KB, 208 views)

**fennan** · 05-27-2010, 03:35 AM

Originally posted by syslm01 View Post

Hi,

did you use the datasets to run tophat and cufflinks ? did the result are same as their provided sam files? I have a try, but my result is different.

That was what I wanted to do at first. I haven't done it yet since I wasn't sure how to deal with the raw data.

However, in the header of the sam file you can find the command used to create such mapping. Take a look to it and maybe it will help you to figure out how things should be done. Unfortunately, this is not the case for the cufflinks output. I think it would be very useful if cufflinks stored the command line used to create its outputs (maybe it does it already, and I just haven't found where)

**syslm01** · 05-27-2010, 04:39 AM

Hi fennan,

I checked their command line, they use mm9+wold_spikes as references and provide tophat with junction file pooled_200bp_frags.juncs. I'm not sure what these files are, I think that my cause the differences. Do you have any idea?

please tell me if you are sure how to deal with the raw data.

Thank you very much.

**syslm01** · 05-27-2010, 07:36 AM

Hi,

I am also not sure about the other datasets: ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX019/SRX019275
The SRR039999_1.fastq.gz and SRR039999_2.fastq.gz are paired reads, but I am not sure the SRR039999.fastq.gz dataset, does it also belong to the SRR039999 ? but I don't find the pair-ended information.

Does anyone have experiences with this kind of data?

Thanks

**ychen** · 05-29-2010, 07:07 AM

Hi Folks,

I feel lucky to find this thread because I have been struggling with the same problems. After splitting the unusual FASTQ files, my TopHat results are still quite different from what reported in the recent published paper. Can you tell me where to find the provided SAM file? I want to try the the reported command line.

Thanks a lot,

Yi-Shiou

Topics	Statistics	Last Post
Study Captures the First Moments of DNA Replication by SEQadmin2 Started by SEQadmin2, 07-24-2026, 12:17 PM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 07-24-2026, 12:17 PM
Chemotherapy Leaves Detectable DNA Signatures in Childhood Tumors by SEQadmin2 Started by SEQadmin2, 07-23-2026, 11:41 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 07-23-2026, 11:41 AM
Single-Cell Atlases Skew Toward European Ancestry, Analysis Finds by SEQadmin2 Started by SEQadmin2, 07-20-2026, 11:10 AM	0 responses 214 views 0 reactions	Last Post by SEQadmin2 07-20-2026, 11:10 AM
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 79 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM

Unconfigured Ad

about SRA paired datasets

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News