Hopefully someone can help me out here as I've been scratching my head now for a couple of days to try and solve the puzzle.
I've downloaded the SRA file for SRR081275, and have tried various different options to generate $file_1.fastq and $file_2.fastq files for mapping. I'll present two examples here:
1) ${SRATK}/fastq-dump --split-3 -O ${DATA}/$i ${DATA}/SRR081275.sra
2) ${SRATK}/fastq-dump -I --split-3 -O ${DATA}/$i ${DATA}/SRR081275.sra
And an example of a read from $file_1.fastq in each case:
1)
@SRR081275.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
GTGGGTGTGCATTGTCCTTACCTGCTGGCAGACAAT
+SRR081275.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
IIIIIIIIIIIIIIII?II;D;A-55%5)&)#'(,9
2)
@SRR081275.1.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
GTGGGTGTGCATTGTCCTTACCTGCTGGCAGACAAT
+SRR081275.1.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
IIIIIIIIIIIIIIII?II;D;A-55%5)&)#'(,9
In each case the reads are mapped using BWA
converted to BAM
cleaned
mates fixed
duplicates marked
and finally sent to GASV
.
In example (1) Picard's MarkDuplicates returns 1371073/2000000 duplicate reads and 0 optical duplicates. Whilst in example (2) it returns 15523/2000000 duplicate reads and 0 optical duplicates.
When sent to GASV, example (1) can have its insert sizes calculated from Lmin and Lmax, and causes no problems. Whilst for example (2) GASV fails to calculate the insert sizes and as a result fails to identify any discordant reads.
It seems I'm failing to generate a BAM file that will work in both Picard's MarkDuplicates AND GASV, and it appears to be due to the read definition line. I've tried to address this by going through all of the fastq-dump options, and attempting to alter the default READ_NAME_REGEX in Picard's MarkDuplicates, but don't appear to be getting anywhere.
Is anyone able to help?
I've downloaded the SRA file for SRR081275, and have tried various different options to generate $file_1.fastq and $file_2.fastq files for mapping. I'll present two examples here:
1) ${SRATK}/fastq-dump --split-3 -O ${DATA}/$i ${DATA}/SRR081275.sra
2) ${SRATK}/fastq-dump -I --split-3 -O ${DATA}/$i ${DATA}/SRR081275.sra
And an example of a read from $file_1.fastq in each case:
1)
@SRR081275.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
GTGGGTGTGCATTGTCCTTACCTGCTGGCAGACAAT
+SRR081275.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
IIIIIIIIIIIIIIII?II;D;A-55%5)&)#'(,9
2)
@SRR081275.1.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
GTGGGTGTGCATTGTCCTTACCTGCTGGCAGACAAT
+SRR081275.1.1 HWI-EA332_307V4AAXX:3:1:1008:1695 length=36
IIIIIIIIIIIIIIII?II;D;A-55%5)&)#'(,9
In each case the reads are mapped using BWA
Code:
bwa aln -t 4 -e 5 -d 5 -q 20 -f ${DATA}/$i/"$i"_1.sai ${REF} ${DATA}/$i/"$i"_1.fastq.gz bwa aln -t 4 -e 5 -d 5 -q 20 -f ${DATA}/$i/"$i"_2.sai ${REF} ${DATA}/$i/"$i"_2.fastq.gz bwa sampe -f ${DATA}/$i/$i.sam ${REF} \ -r @RG"\t"ID:$i"\t"SM:$i"\t"LB:$i"\t"PL:Illumina \ ${DATA}/$i/"$i"_1.sai ${DATA}/$i/"$i"_2.sai \ ${DATA}/$i/"$i"_1.fastq.gz ${DATA}/$i/"$i"_2.fastq.gz
Code:
samtools view -bhT ${REF} ${DATA}/$i/$i.sam > ${DATA}/$i/$i.bam
Code:
java -d64 -Xmx4g -jar ${PICARD}/CleanSam.jar \ INPUT=${DATA}/$i/$i.bam \ OUTPUT=${DATA}/$i/$i.clean.bam \ VALIDATION_STRINGENCY=SILENT
Code:
java -d64 -Xmx4g -jar ${PICARD}/FixMateInformation.jar \ INPUT=${DATA}/$i/$i.clean.bam \ SORT_ORDER=coordinate \ VALIDATION_STRINGENCY=SILENT
Code:
java -d64 -Xmx4g -jar ${PICARD}/MarkDuplicates.jar \ INPUT=${DATA}/$i/$i.clean.bam \ OUTPUT=${DATA}/$i/$i.md.bam \ ASSUME_SORTED=true \ METRICS_FILE=${DATA}/$i/$i.md.metrics \ VALIDATION_STRINGENCY=SILENT
Code:
cd ${GASV} perl ${GASV}/BAM_preprocessor.pl \ ${DATA}/$i/$i.md.bam \ -CHROMOSOME_NAMING ${CHRS} \ -OUTPUT_PREFIX ${DATA}/$i/GASV/gasv
In example (1) Picard's MarkDuplicates returns 1371073/2000000 duplicate reads and 0 optical duplicates. Whilst in example (2) it returns 15523/2000000 duplicate reads and 0 optical duplicates.
When sent to GASV, example (1) can have its insert sizes calculated from Lmin and Lmax, and causes no problems. Whilst for example (2) GASV fails to calculate the insert sizes and as a result fails to identify any discordant reads.
It seems I'm failing to generate a BAM file that will work in both Picard's MarkDuplicates AND GASV, and it appears to be due to the read definition line. I've tried to address this by going through all of the fastq-dump options, and attempting to alter the default READ_NAME_REGEX in Picard's MarkDuplicates, but don't appear to be getting anywhere.
Is anyone able to help?
Comment