I wanna preface this by saying I'm relatively new to NGS analysis.
I recently received raw data from WXS (paired end 100bp reads with 100X coverage (12Gb data) with Agilent SureSelect All Human Exon V5 kit). I noticed something's really off from the getgo.
The file size between the normal and tumor pair are enormous. Read 1 and 2 fastq of the normal sample are about 7GB each, and the reads 1 and 2 from the tumor sample is about 40GB each. I've worked with exome data before, and they were usually close in size.
Anyway, I assume everything is okay, and put the raw data through the "pipeline." You know the usual, alignment, sorting, duplicate marking, indel align, mate-info fixing, Base-Recalibration, etc. Almost 45% of the reads failed the "DuplicateReadFilter"(GATK-baserecalibrator
), and another 45% failed the mappingQualityzero filter(GATK-baserecalibrator
)! I have to filter out >90% of my sequence reads!
In my previous runs, I've filtered out at most ~10%.
This made me run the FastQC on the reads, I should have done in the beginning. In the sequence duplication level section, FastQC reports that 25% of the seqs will remain if deduplicated! I see double peaks in the "per sequence GC content"!
All this is new to me. I'm used to seeing yellow bars in the sequence quality section, I don't see it here for some reason.
Would someone show me the ropes?
I recently received raw data from WXS (paired end 100bp reads with 100X coverage (12Gb data) with Agilent SureSelect All Human Exon V5 kit). I noticed something's really off from the getgo.
The file size between the normal and tumor pair are enormous. Read 1 and 2 fastq of the normal sample are about 7GB each, and the reads 1 and 2 from the tumor sample is about 40GB each. I've worked with exome data before, and they were usually close in size.
Anyway, I assume everything is okay, and put the raw data through the "pipeline." You know the usual, alignment, sorting, duplicate marking, indel align, mate-info fixing, Base-Recalibration, etc. Almost 45% of the reads failed the "DuplicateReadFilter"(GATK-baserecalibrator
), and another 45% failed the mappingQualityzero filter(GATK-baserecalibrator
)! I have to filter out >90% of my sequence reads!
In my previous runs, I've filtered out at most ~10%.
This made me run the FastQC on the reads, I should have done in the beginning. In the sequence duplication level section, FastQC reports that 25% of the seqs will remain if deduplicated! I see double peaks in the "per sequence GC content"!
All this is new to me. I'm used to seeing yellow bars in the sequence quality section, I don't see it here for some reason.
Would someone show me the ropes?
Comment