I am looking for suggestions for the following issue/task:
I am using a reference genome for which one of my scaffolds is an incomplete (e.g. is a much smaller subset of one of the genomic regions that I am after based on the complete genome of a closely related species). All other scaffolds are relatively complete for my focal species.
What I would like to do is first align my Illumina paired-end reads to my focal species' reference (all scaffolds) and then take the reads that align to my short scaffold PLUS the discarded reads and realign them to the more complete version of this particular region from the closely-related model organism in order to pick up any additional reads that might cover that might align to that region.
I think the approach outlined below will work, but I'm looking for feedback and answers to the questions in my comments before I proceed with my big files!
# After initial trimming, alignment with bwa and removal of duplicates with Piccard, get reads aligning to target scaffold based on bed file
# ??? does removing duplicates with Piccard keep unpaired reads ???
samtools view -L scaffoldName myAlign.bam -b > myScaffold.bam
# get discarded reads (reads that are paired but unmapped or with a mate that is unmapped)
# ??? Are these the right flags for this??? I'm most uncertain about the 0x0001 flag ???
samtools view -F -0x0001 -f 0x0004 -f 0x0008 -b myAlign.bam -b > discarded.bam
# merge target aligned and discarded
samtools merge targetDiscarded.bam
# use Picard's SamToFastq to convert merged bam to fastq
# ??? based on the above the unpaired fastq should be empty right ???
java -Xmx2g SamToFastq.jar INPUT=targetDiscarded.bam FASTQ=r1.fastq SECOND_END_FASTQ=r2.fastq UNPAIRED_FASTQ=unpaired.fastq
# align to second species ref with bwa mem
bwa mem -M modelSpeciesRef r1.fastq r2.fastq unpaired.fastq > finalAlignment.sam
??? I know there is also a way to do this using the aln and sampe function in samtools...how does this compare?
Thanks in advance for any advice/help
I am using a reference genome for which one of my scaffolds is an incomplete (e.g. is a much smaller subset of one of the genomic regions that I am after based on the complete genome of a closely related species). All other scaffolds are relatively complete for my focal species.
What I would like to do is first align my Illumina paired-end reads to my focal species' reference (all scaffolds) and then take the reads that align to my short scaffold PLUS the discarded reads and realign them to the more complete version of this particular region from the closely-related model organism in order to pick up any additional reads that might cover that might align to that region.
I think the approach outlined below will work, but I'm looking for feedback and answers to the questions in my comments before I proceed with my big files!
# After initial trimming, alignment with bwa and removal of duplicates with Piccard, get reads aligning to target scaffold based on bed file
# ??? does removing duplicates with Piccard keep unpaired reads ???
samtools view -L scaffoldName myAlign.bam -b > myScaffold.bam
# get discarded reads (reads that are paired but unmapped or with a mate that is unmapped)
# ??? Are these the right flags for this??? I'm most uncertain about the 0x0001 flag ???
samtools view -F -0x0001 -f 0x0004 -f 0x0008 -b myAlign.bam -b > discarded.bam
# merge target aligned and discarded
samtools merge targetDiscarded.bam
# use Picard's SamToFastq to convert merged bam to fastq
# ??? based on the above the unpaired fastq should be empty right ???
java -Xmx2g SamToFastq.jar INPUT=targetDiscarded.bam FASTQ=r1.fastq SECOND_END_FASTQ=r2.fastq UNPAIRED_FASTQ=unpaired.fastq
# align to second species ref with bwa mem
bwa mem -M modelSpeciesRef r1.fastq r2.fastq unpaired.fastq > finalAlignment.sam
??? I know there is also a way to do this using the aln and sampe function in samtools...how does this compare?
Thanks in advance for any advice/help
Comment