Seqanswers Leaderboard Ad

**mbayer** · 10-15-2012, 02:35 AM

Hi mht,

we work on plants here too and I tend to use Bowtie for my mapping (= reference assembly). Of all the mappers I have tried it gives me the greatest degree of control. It has this great switch that allows you to reduce cross-mapping of reads between pairs of paralogs (--best --strata). This really helps reduce the number of false positive SNPs later.

I also do a lot of SNP detection and I currently use bcftools (part of samtools) to do this. If I have multiple samples and I want to find SNPs between them I tend to map the quality trimmed reads of each sample separately first, then merge the BAM files with samtools merge, and then do the SNP discovery on that. This gives me a full audit trail of where each read comes from as you can add read group tags during the merge, and this is also handy for visualisation afterwards (we use Tablet for this).

I have written a shell script that iterates over a set of FASTQ files and maps them as described above. I still use my own trimming code but you can substitute that with anything you like (e.g. trimmomatic or similar). It goes like this:

Code:

#GBS specific pipeline for mapping single end reads to a reference using Bowtie
#Assumptions:
#1. the Bowtie reference index has already been generated with bowtie-build and all samples use the same reference
#2. all sample reads are packaged in compressed files with a .txt.gz extension and are located in the working directory

#params
referenceIndex=$1
numMismatches=$2
mapToAllLocations=$3
numProcessors=$4
#don't include an extension for this file's name
mergedFileName=$5
baseQualityOffset=$6

#directories
undedupedBamFilesDir=undedupedBamFiles
mergedDir=merged

#check for the correct number of args
if [ $# -ne 6 ]
then
    echo "Error in $0 - Invalid Argument Count"
    echo "Syntax: $0  <referenceIndex> <numMismatches> <mapToAllLocations> <numProcessors> <mergedFileName> <baseQualityOffset>"
    exit
fi

#make the directory for the merged files
mkdir $mergedDir

#make a directory for the undeduped files
mkdir $undedupedBamFilesDir

#make a file for the merged bam file header
headerFile=header.txt
rm $headerFile
touch $headerFile

#extract all the filenames to an array
files=`ls -l *.txt.gz | awk '{ print $9 }'`

#iterate over these
for i in $files
do
	echo "=================processing file $i"

	#extract filename without extension
	fileNameNoExt=`echo $i | basename $i .txt.gz`
	
	#gunzip the file
	echo "unzipping file"
	gunzip -c $i > $fileNameNoExt".fastq"

	# Trim the ends of the raw reads to a quality of 20 and prefixes the read names at the same time
	echo "quality trimming reads"
	java utils.ngs.FastqQualityTrimmer $baseQualityOffset 20 false false true $fileNameNoExt".fastq"

	# Map the reads with Bowtie
	echo "mapping reads with Bowtie"
	if [ "$mapToAllLocations" == "true" ]
	then
	#using the 'all' mapping mode -- this maps reads to all of their mappable locations
		if [ "$baseQualityOffset" == "64" ]
		then
			#offset 64
			bowtie --phred64-quals --best --strata -v $numMismatches --sam -p $numProcessors -a $referenceIndex $fileNameNoExt".fastq.qualTrimmed" > $fileNameNoExt.sam
		elif [ "$baseQualityOffset" == "33" ]
		then
			#offset 33
			bowtie --phred33-quals --best --strata -v $numMismatches --sam -p $numProcessors -a $referenceIndex $fileNameNoExt".fastq.qualTrimmed" > $fileNameNoExt.sam
		fi
	else
	#using the 'unique' mapping mode -- this only maps reads that have only a single mappable location		
		if [ "$baseQualityOffset" == "64" ]
		then
			#offset 64
			bowtie --phred64-quals --best --strata -v $numMismatches --sam -p $numProcessors -m 1 $referenceIndex $fileNameNoExt".fastq.qualTrimmed" > $fileNameNoExt.sam
		elif [ "$baseQualityOffset" == "33" ]
		then
			#offset 33
			bowtie --phred33-quals --best --strata -v $numMismatches --sam -p $numProcessors -m 1 $referenceIndex $fileNameNoExt".fastq.qualTrimmed" > $fileNameNoExt.sam
		fi
	fi
	
	# Convert to BAM
	echo "converting SAM to BAM"
	samtools view -S -b -o $fileNameNoExt.unsorted.bam $fileNameNoExt.sam	

	#sort bam file
	echo "sorting BAM file with dups"
	samtools sort $fileNameNoExt.unsorted.bam $fileNameNoExt.duplicatesRetained
		
	echo "indexing BAM file with dups "
	samtools index $fileNameNoExt.duplicatesRetained.bam
				
	echo "removing duplicates"
	samtools rmdup -s $fileNameNoExt.duplicatesRetained.bam $fileNameNoExt.bam	
	
	echo "indexing deduped BAM file "
	samtools index $fileNameNoExt.bam
	
	#append an RG header to the joint header file
	#should look like this:
	# @RG     ID:FC069_s5_sequence_Waggon     SM:FC069_s5_sequence_Waggon
	echo -ne "@RG\tID:$fileNameNoExt\tSM:$fileNameNoExt\n" >> $headerFile

	#delete all redundant files 
	echo "cleaning up redundant files"
	rm $fileNameNoExt.unsorted.bam
	rm $fileNameNoExt.sam
	rm $fileNameNoExt".fastq.qualTrimmed"
	rm $fileNameNoExt".fastq"
	
	#move the undeduped files
	mv $fileNameNoExt.duplicatesRetained.bam $undedupedBamFilesDir
	mv $fileNameNoExt.duplicatesRetained.bam.bai $undedupedBamFilesDir
	
done



#then merge the remaining individual bam files into a single one, using the header file we have been building up gradually 

echo "===========merging individual bam files==============="
samtools merge -frh $headerFile $mergedFileName.bam *.bam
echo "indexing merged file"
samtools index $mergedFileName.bam
#move these to the merged directory
mv $mergedFileName.bam* $mergedDir

echo "done"

Then I do the SNP calling on the merged file like this:

Code:

samtools  mpileup  -D -u -f <referenceFasta> <BAM file> | bcftools view -v -s <file with sample names> - | tee SNPs.vcf

The tee command pipes the output both to stdout and to a file -- quite handy.

Hopes this helps.

cheers

Micha

**mht** · 10-16-2012, 05:47 AM

Hi Micha,

Thanks for your reply. My pipeline is almost similar to yours, but using BWA as my mapper and a few other different programs. I was wondering if you've ever observed read mapping to regions with ambiguous reads (Ns). My reference genome are actually scaffolds with Ns in between, and some of my reads are mapping to these regions, resulting in what I presume are false SNP and indel calls.

Any idea what I can do to resolve this?

**mbayer** · 10-22-2012, 01:10 AM

Hi mht,

I have little experience of how Bowtie would handle mapping reads onto scaffolded sequences with Ns in them, but does this still happen if you force the reads to have, say, no more than 1 or 2 mismatches? If the mapper counts mapping onto Ns in the referene as a mismatch then the problem you have in the image there should go away.

cheers
Micha

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 15 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 21 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 18 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

Some help needed for plant reference assembly

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News