Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Some help needed for plant reference assembly

    Hi all,

    I am currently looking into methods to do reference assembly of a plant genome, and later call for SNVs based on this assembly.

    I see a lot of threads on de novo assemblers but not many on reference-based ones. So far, I notice a lot of discussions on Velvet, MIRA and Soap.. but I'm not sure which is the best tool for eventual variant calling. Any suggestions would be very helpful.

    My fastq files are from 3 separate plants of the same species. Should I pool all the reads together and build a consensus assembly and call variants from the consensus?

    A newbie at assembly, any feedback would be much appreciated.

  • #2
    Hi mht,

    we work on plants here too and I tend to use Bowtie for my mapping (= reference assembly). Of all the mappers I have tried it gives me the greatest degree of control. It has this great switch that allows you to reduce cross-mapping of reads between pairs of paralogs (--best --strata). This really helps reduce the number of false positive SNPs later.

    I also do a lot of SNP detection and I currently use bcftools (part of samtools) to do this. If I have multiple samples and I want to find SNPs between them I tend to map the quality trimmed reads of each sample separately first, then merge the BAM files with samtools merge, and then do the SNP discovery on that. This gives me a full audit trail of where each read comes from as you can add read group tags during the merge, and this is also handy for visualisation afterwards (we use Tablet for this).

    I have written a shell script that iterates over a set of FASTQ files and maps them as described above. I still use my own trimming code but you can substitute that with anything you like (e.g. trimmomatic or similar). It goes like this:

    Code:
    #GBS specific pipeline for mapping single end reads to a reference using Bowtie
    #Assumptions:
    #1. the Bowtie reference index has already been generated with bowtie-build and all samples use the same reference
    #2. all sample reads are packaged in compressed files with a .txt.gz extension and are located in the working directory
    
    #params
    referenceIndex=$1
    numMismatches=$2
    mapToAllLocations=$3
    numProcessors=$4
    #don't include an extension for this file's name
    mergedFileName=$5
    baseQualityOffset=$6
    
    #directories
    undedupedBamFilesDir=undedupedBamFiles
    mergedDir=merged
    
    #check for the correct number of args
    if [ $# -ne 6 ]
    then
        echo "Error in $0 - Invalid Argument Count"
        echo "Syntax: $0  <referenceIndex> <numMismatches> <mapToAllLocations> <numProcessors> <mergedFileName> <baseQualityOffset>"
        exit
    fi
    
    #make the directory for the merged files
    mkdir $mergedDir
    
    #make a directory for the undeduped files
    mkdir $undedupedBamFilesDir
    
    #make a file for the merged bam file header
    headerFile=header.txt
    rm $headerFile
    touch $headerFile
    
    #extract all the filenames to an array
    files=`ls -l *.txt.gz | awk '{ print $9 }'`
    
    #iterate over these
    for i in $files
    do
    	echo "=================processing file $i"
    
    	#extract filename without extension
    	fileNameNoExt=`echo $i | basename $i .txt.gz`
    	
    	#gunzip the file
    	echo "unzipping file"
    	gunzip -c $i > $fileNameNoExt".fastq"
    
    	# Trim the ends of the raw reads to a quality of 20 and prefixes the read names at the same time
    	echo "quality trimming reads"
    	java utils.ngs.FastqQualityTrimmer $baseQualityOffset 20 false false true $fileNameNoExt".fastq"
    
    	# Map the reads with Bowtie
    	echo "mapping reads with Bowtie"
    	if [ "$mapToAllLocations" == "true" ]
    	then
    	#using the 'all' mapping mode -- this maps reads to all of their mappable locations
    		if [ "$baseQualityOffset" == "64" ]
    		then
    			#offset 64
    			bowtie --phred64-quals --best --strata -v $numMismatches --sam -p $numProcessors -a $referenceIndex $fileNameNoExt".fastq.qualTrimmed" > $fileNameNoExt.sam
    		elif [ "$baseQualityOffset" == "33" ]
    		then
    			#offset 33
    			bowtie --phred33-quals --best --strata -v $numMismatches --sam -p $numProcessors -a $referenceIndex $fileNameNoExt".fastq.qualTrimmed" > $fileNameNoExt.sam
    		fi
    	else
    	#using the 'unique' mapping mode -- this only maps reads that have only a single mappable location		
    		if [ "$baseQualityOffset" == "64" ]
    		then
    			#offset 64
    			bowtie --phred64-quals --best --strata -v $numMismatches --sam -p $numProcessors -m 1 $referenceIndex $fileNameNoExt".fastq.qualTrimmed" > $fileNameNoExt.sam
    		elif [ "$baseQualityOffset" == "33" ]
    		then
    			#offset 33
    			bowtie --phred33-quals --best --strata -v $numMismatches --sam -p $numProcessors -m 1 $referenceIndex $fileNameNoExt".fastq.qualTrimmed" > $fileNameNoExt.sam
    		fi
    	fi
    	
    	# Convert to BAM
    	echo "converting SAM to BAM"
    	samtools view -S -b -o $fileNameNoExt.unsorted.bam $fileNameNoExt.sam	
    
    	#sort bam file
    	echo "sorting BAM file with dups"
    	samtools sort $fileNameNoExt.unsorted.bam $fileNameNoExt.duplicatesRetained
    		
    	echo "indexing BAM file with dups "
    	samtools index $fileNameNoExt.duplicatesRetained.bam
    				
    	echo "removing duplicates"
    	samtools rmdup -s $fileNameNoExt.duplicatesRetained.bam $fileNameNoExt.bam	
    	
    	echo "indexing deduped BAM file "
    	samtools index $fileNameNoExt.bam
    	
    	#append an RG header to the joint header file
    	#should look like this:
    	# @RG     ID:FC069_s5_sequence_Waggon     SM:FC069_s5_sequence_Waggon
    	echo -ne "@RG\tID:$fileNameNoExt\tSM:$fileNameNoExt\n" >> $headerFile
    
    	#delete all redundant files 
    	echo "cleaning up redundant files"
    	rm $fileNameNoExt.unsorted.bam
    	rm $fileNameNoExt.sam
    	rm $fileNameNoExt".fastq.qualTrimmed"
    	rm $fileNameNoExt".fastq"
    	
    	#move the undeduped files
    	mv $fileNameNoExt.duplicatesRetained.bam $undedupedBamFilesDir
    	mv $fileNameNoExt.duplicatesRetained.bam.bai $undedupedBamFilesDir
    	
    done
    
    
    
    #then merge the remaining individual bam files into a single one, using the header file we have been building up gradually 
    
    echo "===========merging individual bam files==============="
    samtools merge -frh $headerFile $mergedFileName.bam *.bam
    echo "indexing merged file"
    samtools index $mergedFileName.bam
    #move these to the merged directory
    mv $mergedFileName.bam* $mergedDir
    
    echo "done"
    Then I do the SNP calling on the merged file like this:

    Code:
    samtools  mpileup  -D -u -f <referenceFasta> <BAM file> | bcftools view -v -s <file with sample names> - | tee SNPs.vcf
    The tee command pipes the output both to stdout and to a file -- quite handy.

    Hopes this helps.

    cheers

    Micha

    Comment


    • #3
      Hi Micha,

      Thanks for your reply. My pipeline is almost similar to yours, but using BWA as my mapper and a few other different programs. I was wondering if you've ever observed read mapping to regions with ambiguous reads (Ns). My reference genome are actually scaffolds with Ns in between, and some of my reads are mapping to these regions, resulting in what I presume are false SNP and indel calls.

      Click image for larger version

Name:	ambiguous_region.png
Views:	1
Size:	22.4 KB
ID:	303970

      Any idea what I can do to resolve this?
      Last edited by mht; 10-16-2012, 05:50 AM. Reason: adding image

      Comment


      • #4
        Hi mht,

        I have little experience of how Bowtie would handle mapping reads onto scaffolded sequences with Ns in them, but does this still happen if you force the reads to have, say, no more than 1 or 2 mismatches? If the mapper counts mapping onto Ns in the referene as a mismatch then the problem you have in the image there should go away.

        cheers
        Micha

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Non-Coding RNA Research and Technologies
          by seqadmin




          Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

          Nobel Prize for MicroRNA Discovery
          This week,...
          10-07-2024, 08:07 AM
        • seqadmin
          Recent Developments in Metagenomics
          by seqadmin





          Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
          09-23-2024, 06:35 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 06:35 AM
        0 responses
        7 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 02:44 PM
        0 responses
        7 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-11-2024, 06:55 AM
        0 responses
        15 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-02-2024, 04:51 AM
        0 responses
        111 views
        0 likes
        Last Post seqadmin  
        Working...
        X