Dear all,
I received raw data from a sequencing facility for an experiment in which cells were either infected with a virus (INF) or left uninfected (UNI). There were three biological replicates for each condition (UNI-1, UNI-2, UNI-3, INF-1, INF-2 and INF-3), and RNASeq for each sample was done on a separate lane of a GAIIx flowcell (I am aware of the Auer and Doerge Genetics paper – please let’s skip that for now). Libraries were prepared using Nextera kit.
Since I am a self-taught newbie without formal bioinformatics education, I would really be grateful for your comments and advice on the FastQC results and some read pre-processing that I’ve done.
First, all of the samples had significant amounts of contamination with Nextera transposase sequences. Therefore I’ve used BBduk to get rid of the adapters and quality trim the raw reads. To clean reads further, I’ve then mapped trimmed reads to a “contaminants metagenome” (contains rRNA, E. coli, S. cerevisiae, univec db sequences and others) and I plan to use unmapped reads from this step for downstream analysis (mapping to reference genome, DE analysis, etc…).
So, for each replicate there are now three different fastq files: raw, trimmed, trimmed & decontaminated, and FastQC results for each are in the attached pdf file, together with some trimming, decontamination and preliminary mapping statistics (each file was mapped to mouse reference genome with STAR).
My questions are:
1) Which reads would be best to use for mapping to reference and DE analysis (I’m guessing trimmed or trimmed&decontaminated, but I cannot decide which).
2) Number of reads in the sample UNI-2 is ~3 smaller than in the other samples. Would such a big difference in library size present a problem during DE analysis in DESeq2, EdgeR and voom+limma? In other words, should I discard this replicate during DE analysis?
3) What type of problem could most likely be responsible for the per tile sequence quality pattern for samples UNI-2, INF-1, INF-2. (I would like to know this so that I can discuss the issue with the sequencing facility in a more knowledgeable manner, if necessary).
4) Would you agree that the adapter contamination is the driving force behind the nonstandard appearance of the “Per base sequence content” graphs. (I am referring to the raising %C line, not the beginning of reads)
5) Are “Per sequence GC content” distributions horrible, particularly samples INF-2 and INF-3. Are the small peaks at low GC-content attributable to poly-A tails?
6) Some of the reads have really been trimmed to very short sizes. Will STAR appropriately discard reads that are below some reasonable size, or do I have to do it manually. If so, what would be the minimum suggested (allowed) read size to be used during mapping for DE analysis?
7) I really don’t know what to make of “Sequence duplication levels” graph. Neither trimming nor decontamination seemed to have any appreciable effects. Does anyone know what could be the usual culprit causing such big differences, even between biological replicates? Do they have to be dealt with in some way?
8) Trimming seems to deal nicely with adapter contamination, but Kmer content graphs still show overrepresentation of certain Kmers at the end of reads. Did I miss any adapters that need to be removed? Should I trim this out somehow, or can I let the STAR soft-clip this during mapping?
Big thanks to everyone, if only for reading such a big post.
I received raw data from a sequencing facility for an experiment in which cells were either infected with a virus (INF) or left uninfected (UNI). There were three biological replicates for each condition (UNI-1, UNI-2, UNI-3, INF-1, INF-2 and INF-3), and RNASeq for each sample was done on a separate lane of a GAIIx flowcell (I am aware of the Auer and Doerge Genetics paper – please let’s skip that for now). Libraries were prepared using Nextera kit.
Since I am a self-taught newbie without formal bioinformatics education, I would really be grateful for your comments and advice on the FastQC results and some read pre-processing that I’ve done.
First, all of the samples had significant amounts of contamination with Nextera transposase sequences. Therefore I’ve used BBduk to get rid of the adapters and quality trim the raw reads. To clean reads further, I’ve then mapped trimmed reads to a “contaminants metagenome” (contains rRNA, E. coli, S. cerevisiae, univec db sequences and others) and I plan to use unmapped reads from this step for downstream analysis (mapping to reference genome, DE analysis, etc…).
So, for each replicate there are now three different fastq files: raw, trimmed, trimmed & decontaminated, and FastQC results for each are in the attached pdf file, together with some trimming, decontamination and preliminary mapping statistics (each file was mapped to mouse reference genome with STAR).
My questions are:
1) Which reads would be best to use for mapping to reference and DE analysis (I’m guessing trimmed or trimmed&decontaminated, but I cannot decide which).
2) Number of reads in the sample UNI-2 is ~3 smaller than in the other samples. Would such a big difference in library size present a problem during DE analysis in DESeq2, EdgeR and voom+limma? In other words, should I discard this replicate during DE analysis?
3) What type of problem could most likely be responsible for the per tile sequence quality pattern for samples UNI-2, INF-1, INF-2. (I would like to know this so that I can discuss the issue with the sequencing facility in a more knowledgeable manner, if necessary).
4) Would you agree that the adapter contamination is the driving force behind the nonstandard appearance of the “Per base sequence content” graphs. (I am referring to the raising %C line, not the beginning of reads)
5) Are “Per sequence GC content” distributions horrible, particularly samples INF-2 and INF-3. Are the small peaks at low GC-content attributable to poly-A tails?
6) Some of the reads have really been trimmed to very short sizes. Will STAR appropriately discard reads that are below some reasonable size, or do I have to do it manually. If so, what would be the minimum suggested (allowed) read size to be used during mapping for DE analysis?
7) I really don’t know what to make of “Sequence duplication levels” graph. Neither trimming nor decontamination seemed to have any appreciable effects. Does anyone know what could be the usual culprit causing such big differences, even between biological replicates? Do they have to be dealt with in some way?
8) Trimming seems to deal nicely with adapter contamination, but Kmer content graphs still show overrepresentation of certain Kmers at the end of reads. Did I miss any adapters that need to be removed? Should I trim this out somehow, or can I let the STAR soft-clip this during mapping?
Big thanks to everyone, if only for reading such a big post.
Comment