I recently received my first short read data set, one lane of 2x100bp Illumina Hiseq reads. I'm hoping the community can help me identify the source of duplicate sequences indicated in fastQC reports on the data.
FastQC showed high duplication (>60%) for both forward and reverse reads. The report for the reverse reads did not turn up any specific over-represented sequences, while the report for the forward reads identified a PCR primer and adapter sequence. However a bowtie alignment against illumina paired-end adapters and primers showed 0% alignment. And when I tried to use picard to mark and remove duplicate reads, no reads were removed (picard command below).
My reads are from a single lane of 12 individually barcoded cDNA sub-libraries from a non-model organism (no reference genome). Six of these libraries were normalized (via DSN digestion), six were not. Has anyone seem similar fastQC curves for rna-seq data?
Is there a way to search the file for the actual sequences that are highly duplicated?
Full fastQC reports are attached.
[command run as below, though I have omitted the path for each file]
nohup java -jar MarkDuplicates.jar INPUT=sequence_file.bam OUTPUT=deduplicated_reads.bam METRICS_FILE=deduplicated_reads_metrics.txt REMOVE_DUPLICATES=true &
FastQC showed high duplication (>60%) for both forward and reverse reads. The report for the reverse reads did not turn up any specific over-represented sequences, while the report for the forward reads identified a PCR primer and adapter sequence. However a bowtie alignment against illumina paired-end adapters and primers showed 0% alignment. And when I tried to use picard to mark and remove duplicate reads, no reads were removed (picard command below).
My reads are from a single lane of 12 individually barcoded cDNA sub-libraries from a non-model organism (no reference genome). Six of these libraries were normalized (via DSN digestion), six were not. Has anyone seem similar fastQC curves for rna-seq data?
Is there a way to search the file for the actual sequences that are highly duplicated?
Full fastQC reports are attached.
[command run as below, though I have omitted the path for each file]
nohup java -jar MarkDuplicates.jar INPUT=sequence_file.bam OUTPUT=deduplicated_reads.bam METRICS_FILE=deduplicated_reads_metrics.txt REMOVE_DUPLICATES=true &
Comment