Unconfigured Ad

**alexdobin** · 05-23-2012, 11:26 AM

ENCODE data

Hi prussiap,

you are talking about ENCODE data, not ENSEMBL, right?

The sample you have chosen is not a good example, it's one of the earliest samples we generated with an unusual library prep and sub-par sequencing quality. I would strongly recommend other samples such as whole cell poly-A+/- for K562 and other cell lines. ENCODE RNA-seq data was mapped with STAR: ftp://ftp2.cshl.edu/gingeraslab/trac...release/2.1.1/

If you have questions about ENCODE data, please send me a message.

Alex

**wupengpro** · 07-03-2012, 08:24 AM

Originally posted by alexdobin View Post

Hi prussiap,

you are talking about ENCODE data, not ENSEMBL, right?

The sample you have chosen is not a good example, it's one of the earliest samples we generated with an unusual library prep and sub-par sequencing quality. I would strongly recommend other samples such as whole cell poly-A+/- for K562 and other cell lines. ENCODE RNA-seq data was mapped with STAR: ftp://ftp2.cshl.edu/gingeraslab/trac...release/2.1.1/

If you have questions about ENCODE data, please send me a message.

Alex

Hi Alex,

I have downloaded some ENCODE datasets from SRA in NCBI(http://www.ncbi.nlm.nih.gov/sra/SRX135162?&report=full). Are these ENCODE datasets raw data or clean data? Need I the further quality control? Which method of quality control do you recommend?

Thank you!

**alexdobin** · 07-03-2012, 09:47 AM

Originally posted by wupengpro View Post

Hi Alex,

I have downloaded some ENCODE datasets from SRA in NCBI(http://www.ncbi.nlm.nih.gov/sra/SRX135162?&report=full). Are these ENCODE datasets raw data or clean data? Need I the further quality control? Which method of quality control do you recommend?

Thank you!

Hi @wupengpro,
the ENCODE data deposited in SRA is raw, filtered only by standard Illumina chastity filters. All of the data is clean and high quality, judged by high mapping rates (90-95%), high correlation of gene expression from bio-replicas (>0.98) and by correct clustering of the samples. I think you do not need any additional quality control or filtering of the .fastq files - however, it's always advisable to filter your alignments, for example, remove multi-mappers, non-concordant mates, non-canonical junctions.

**Richard Finney** · 07-03-2012, 09:56 AM

... it's always advisable to filter your alignments, for example, remove multi-mappers, non-concordant mates, non-canonical junctions.

This is interesting information to be throwing away.

**alexdobin** · 07-03-2012, 10:51 AM

Originally posted by Richard Finney View Post

... it's always advisable to filter your alignments, for example, remove multi-mappers, non-concordant mates, non-canonical junctions.

This is interesting information to be throwing away.

That is indeed interesting information for some applications, however, it also contains a significantly larger percentage of mis-mappings (i.e. false positives). I guess I need to re-formulate my statement more carefully: if the study does not involve (i) highly similar loci (e.g. paralogs), (ii) fusion/chimeric transcripts, or (iii) non-canonical splicing, it is advisable to remove (i) multi-mappers, (ii) non-concordant mates, (iii) non-canonical junctions.

**per_ngs** · 09-27-2012, 04:37 AM

Quality scores in fastqc for ENCODE RNASeq data

Hello,
I just downloaded the LHCN RNASeq data generated at Caltech. I merged the fastq files from the 3 runs to generate a single file and ran fastqc and I an a little bit confused about the output I have got. The per base quality graph in fastqc is showing quality score going upto 70 (attached) and the per sequence graph is showing peaks at approx 38 and 68 (also attached). According to the ENCODE documentation, the quality scores are phred 33, so how come the quality score graphs look like this?
Apologies if my question is silly and if i am not understanding the way fastqc works.

Thanks for help.
NGSnewbie

Attached Files

**Sujani** · 09-27-2012, 05:09 AM

hello all,

When I try to sequence 16s bacterial RNA using ABI 3130 it gives me heterozygous peaks. Since the microbes only contain a haploid set of chromosomes I am puzzled how it could be possible to indicate two peaks?
can someone please explain

**GenoMax** · 09-27-2012, 05:43 AM

Sujani,

You should create a new post for this question rather than this current thread. Perhaps one of the moderators can do it for you.

Originally posted by Sujani View Post

hello all,

When I try to sequence 16s bacterial RNA using ABI 3130 it gives me heterozygous peaks. Since the microbes only contain a haploid set of chromosomes I am puzzled how it could be possible to indicate two peaks?
can someone please explain

**Sujani** · 09-27-2012, 05:51 AM

GenoMax,

Im really sorry for the inconvenience.Unfortunately,Im finding it hard to post a new thread.

**GenoMax** · 09-27-2012, 06:15 AM

Once you log into SeqAnswers, click on the "Forum" link in the top left quadrant under "site navigation".

Select the appropriate forum to post in by clicking on the main title of the forum (e.g. core facilities).

On the page that opens next there should be a "new thread" button towards top left.

Originally posted by Sujani View Post

GenoMax,

Im really sorry for the inconvenience.Unfortunately,Im finding it hard to post a new thread.

**Sujani** · 09-28-2012, 12:44 AM

Originally posted by GenoMax View Post

Once you log into SeqAnswers, click on the "Forum" link in the top left quadrant under "site navigation".

Select the appropriate forum to post in by clicking on the main title of the forum (e.g. core facilities).

On the page that opens next there should be a "new thread" button towards top left.

Genomax,

Thanks alot for the help!! I could post my issues as a new thread!!!

**cwzkevin** · 10-02-2012, 08:42 AM

Remember that you combined three runs results into one. It is very likely that the three runs does not have the same Phred offset. The mode at 38 is from Phred33 and the mode at 68 is from Phred64, obviously.
Now, you must combined the three runs this way:
Phred33 then Phred?? then Phred64
Such that when FastQC trying to guess the offset, all it can see are the codes from Phred33, and it concludes the data is from Phred33. FastQC doesn't use all reads in your data to guess, it only use 200,000 reads if I am correct.
After its conclusion of Phred33, FastQC keeps memorizing your data quality and maps the ascii codes based on Phred33. That is why the 2nd mode at 68 showing up.

FASTQ format - Wikipedia

http://en.wikipedia.org/wiki/FASTQ_format#Encoding

Originally posted by per_ngs View Post

Hello,
I just downloaded the LHCN RNASeq data generated at Caltech. I merged the fastq files from the 3 runs to generate a single file and ran fastqc and I an a little bit confused about the output I have got. The per base quality graph in fastqc is showing quality score going upto 70 (attached) and the per sequence graph is showing peaks at approx 38 and 68 (also attached). According to the ENCODE documentation, the quality scores are phred 33, so how come the quality score graphs look like this?
Apologies if my question is silly and if i am not understanding the way fastqc works.

Thanks for help.
NGSnewbie

**per_ngs** · 10-03-2012, 12:26 AM

Hello Kevin,
Thanks for the response. I did find out from other sources on seqanswers that the data that i combined had data with different Phred offset. I ran fastqc on each of the files individually and noticed this as well. So, for now i am processing the data separately.
Regards,
NGSnewbie

Topics	Statistics	Last Post
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, Today, 11:05 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Today, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM

Unconfigured Ad

Public RAW RNA-seq data Now What!!

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News