Hi all,
I have WGS data from illumina paired-end libraries made from DNA extracted from FFPE tissue.
For 1 lane's fastq files I: split each read1 and read2 fastq into small fastq files (of 40000000 lines each) -> bwa align with modest read trimming (q=30) against the whole reference genome (hg19 ucsc, provided by the GATK resource bundle, has all the contigs incl. many chrUs) -> sort, index -> merge all the small bams ----->large bam.
I ran GATK's Alignmentmetrics algorithm and am getting ~17.5% chimeric reads (see output below):
TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES
PAIR 1362335072 1362335072 1 14042 873692969 0.64132 72066521143 742934974 61451362144
PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE MEAN_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS
60529568026 0 0.014693 0.012886 0.001404 101 749051300 0.857339
BAD_CYCLES STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER SAMPLE LIBRARY READ_GROUP
0 0.500537 0.175427 0.002857
My question is if anyone has run into this type of situation before? I am trying alignment without read trimming to compare, but I doubt that will really change the %chimeric reads I have.
I'm wondering if I shouldn't be aligning against all the contigs supplied in the ucsc fasta reference file. If anyone has any suggestions, I'm open to them.
Thanks,
-L.
I have WGS data from illumina paired-end libraries made from DNA extracted from FFPE tissue.
For 1 lane's fastq files I: split each read1 and read2 fastq into small fastq files (of 40000000 lines each) -> bwa align with modest read trimming (q=30) against the whole reference genome (hg19 ucsc, provided by the GATK resource bundle, has all the contigs incl. many chrUs) -> sort, index -> merge all the small bams ----->large bam.
I ran GATK's Alignmentmetrics algorithm and am getting ~17.5% chimeric reads (see output below):
TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES
PAIR 1362335072 1362335072 1 14042 873692969 0.64132 72066521143 742934974 61451362144
PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE MEAN_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS
60529568026 0 0.014693 0.012886 0.001404 101 749051300 0.857339
BAD_CYCLES STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER SAMPLE LIBRARY READ_GROUP
0 0.500537 0.175427 0.002857
My question is if anyone has run into this type of situation before? I am trying alignment without read trimming to compare, but I doubt that will really change the %chimeric reads I have.
I'm wondering if I shouldn't be aligning against all the contigs supplied in the ucsc fasta reference file. If anyone has any suggestions, I'm open to them.
Thanks,
-L.