Getting Better with Variables
Now that I have the lab (www.keatslab.org) moving forward to some extent I'm back to working on my analysis methods. For a couple of reasons I've been searching for ways to improve the previous methods so that even less user input is required and thus saving me time running things myself or worrying about other users making mistakes. So here is what seems like a good version that relies on very limited user input. It takes advantage of a common parameters file, and a series of basic user queries as the script launches to setup the tophat and cufflinks run. It also uses bwa to find the insert size and standard deviation of the library.
Requirements:
1) Create_Directory_Structure_V4 directory structure, but the pipeline is so flexible this is not really a requirement just simplifies effort.
2) bwa transcriptome index file for genome
3) bowtie index file for genome
4) Annotation files for Picard CollectRnaSeqMetrics.jar to function properly
Options:
1) It accepts compressed gzip or bzip2 files or standard read files with any extension
2) It creates a summary file at the end of each run and you have the option to output a list of genes and associated FPKM values to this summary file. So if you are looking at an siRNA experiment you could drop the gene of interest values for your samples or a list of random genes you would like a quick look at.
General Outline:
1) Query User to define setup, which sets the naming convention for each sample, the tophat run format (two options), the cufflinks run format (six options), the gene list added to the summary file
2) Counts the number of reads in each input file (used to calculate % aligned)
3) Cuts out the first 1 million reads for bwa alignment against a transcriptome reference to define the insert size and standard deviation
4) Runs tophat
5) Runs cufflinks
6) Runs a series of picard metrics programs
7) Summarizes metrics and gene values
create_ngs_directorystructure_v4.sh
RNAseq_MensFormal_Mac.sh
rnaseq_common_parameters.ini
Now that I have the lab (www.keatslab.org) moving forward to some extent I'm back to working on my analysis methods. For a couple of reasons I've been searching for ways to improve the previous methods so that even less user input is required and thus saving me time running things myself or worrying about other users making mistakes. So here is what seems like a good version that relies on very limited user input. It takes advantage of a common parameters file, and a series of basic user queries as the script launches to setup the tophat and cufflinks run. It also uses bwa to find the insert size and standard deviation of the library.
Requirements:
1) Create_Directory_Structure_V4 directory structure, but the pipeline is so flexible this is not really a requirement just simplifies effort.
2) bwa transcriptome index file for genome
3) bowtie index file for genome
4) Annotation files for Picard CollectRnaSeqMetrics.jar to function properly
Options:
1) It accepts compressed gzip or bzip2 files or standard read files with any extension
2) It creates a summary file at the end of each run and you have the option to output a list of genes and associated FPKM values to this summary file. So if you are looking at an siRNA experiment you could drop the gene of interest values for your samples or a list of random genes you would like a quick look at.
General Outline:
1) Query User to define setup, which sets the naming convention for each sample, the tophat run format (two options), the cufflinks run format (six options), the gene list added to the summary file
2) Counts the number of reads in each input file (used to calculate % aligned)
3) Cuts out the first 1 million reads for bwa alignment against a transcriptome reference to define the insert size and standard deviation
4) Runs tophat
5) Runs cufflinks
6) Runs a series of picard metrics programs
7) Summarizes metrics and gene values
create_ngs_directorystructure_v4.sh
Code:
#!/bin/sh # Create_NGS_DirectoryStructure_V4.sh # # # Created by Jonathan Keats on 09/25/11 # Translational Genomics Research Institute ######################################################################### # CREATES A DIRECTORY STURCTURE TO SUPPORT A VARIETY OF NGS PIPELINES # ######################################################################### # Designed for a Mac OS enviroment and requires initiation from your home folder (/User/You/) # Check to confirm current location is $HOME/ (ie. /User/You/) echo "Confirming Script Initiation Directory" var1=$HOME if [ "`pwd`" != "$var1" ] then echo " The script must be launched from your home directory " echo " The script was automatically killed due to a launch error - See Above Error Message" exit 2 fi echo "1) Launch Location is Correct ($HOME/)" # Create required directories to support pipelines (RNAseq_MensFormal_) echo ***Creating Pipeline Directory Structure*** mkdir -p ngs/{applications,bwa,run_parameters,run_parameters,scripts,temp,tophat,tophat_fusion} mkdir -p ngs/analyzed_read_files/{chipseq,exomes,genomes,matepair,rnaseq} mkdir -p ngs/finaloutputs/{chipseq,exomes,genomes,matepair,rnaseq} mkdir -p ngs/refgenomes/{bfast_indexed,bowtie_indexed,bwa_indexed,downloads} mkdir -p ngs/refgenomes/downloads/{ncbi36_hg18,grch37_hg19} mkdir -p ngs/refgenomes/downloads/ncbi36_hg18/{annotation_files,reference_sequences} mkdir -p ngs/refgenomes/downloads/grch37_hg19/{annotation_files,reference_sequences} mv create_ngs_directorystructure_v4.sh ngs/scripts/ echo ***Pipeline Directory Structure Created***
Code:
#!/bin/sh # RNAseq_MensFormal_Mac.sh # # # Created by Jonathan Keats on 9/21/11. # Copyright 2011 Translational Genomics Research Institute. All rights reserved. # Outline - Automated pipeline requiring limited input for multiple RNAseq samples using tophat and cufflinks # Usage = RNAseq_MensFormal_Mac.sh <rnaseq_common_parameters.ini> ###################################################################################################### ## ## ## Dependencies - Unix (mv, cd, find, basename, mkdir, echo, wc, head, tail, cut, awk, sed) ## ## - BWA (version with -I option) (Tested with v0.5.9) ## ## - SAMTOOLS (Tested with v0.1.12a) ## ## - PICARD Tools (Tested with v1.52) ## ## - CollectInsertSizeMetrics.jar (WATCH VERSION AS COLUMN OUTPUT DIFFERS) ## ## - CollectRNAseqMetrics.jar ## ## - MarkDuplicates.jar ## ## - CollectAlignmentSummaryMetrics.jar ## ## - MeanQualityByCycle.jar ## ## - TOPHAT (Tested with v1.3.2) ## ## - CUFFLINKS (Tested with v1.1.0) ## ## ## ## Outline - Place uniformly named read file pairs in TOPHAT folder ## ## - This method is dependent on XXXX folder structure ## ## - Update "tophat_run_parameters.ini" as appropriate to your samples ## ## - Launch script which queries the user to define methods used and renaming convention ## ## then does the following steps automatically: ## ## - Create custom directory for each sample ## ## - Create custom .ini file for each sample ## ## - Cut out first 1 million reads from each read file ## ## - Align to transcriptome reference using BWA ## ## - Calculate inner-mate-distance and standard deviation ## ## ** inner-mate-distance = (insert size) - (combined read length) ## ## - Run TOPHAT ## ## - Run CUFFLINKS ## ## - Calculate metrics ## ## ## ## Notes - Extensive use of common variables is used to facilitate automation ## ## - There is some naming convention requirements but they are flexible ## ## - Paired samples must follow the same convention: ## ## --> SampleA_1.fastq and SampleA_2.fastq ## ## OR ## ## --> SampleA_Time1_1.txt and SampleA_Time1_2.txt ## ## - You can mix different basenames but not common extensions ## ## --> SampleB_1.fa / SampleB_2.fa ## ## --> SampleC_Time1_1.fa / SampleC_Time1_2.fa ## ## ** This is fine: Common Extension = .fa ; Read Extension = _1.fa and _2.fa ** ## ## --> SampleB_1.fastq / SampleB_2.fastq ## ## --> SampleC_1.fa / SampleC_2.fa ## ## --> SampleD_R1.fa / SampleD_R2.fa ## ## ** These can not be mixed, no "Common Extension" or common "Read Extension" ** ## ## ## ###################################################################################################### # Read in the tophat_common_parameters file . $1 #Starting Directory=ngs/tophat/ echo echo "*** STARTING SCRIPT ***" date '+%m/%d/%y %H:%M:%S' echo ############################################################## ## Check that the script was launched from ngs/tophat dir ## ############################################################## echo Checking Launch Location date '+%m/%d/%y %H:%M:%S' echo # Checking if launch location is correct if [ "`pwd`" != "${LAUNCH_DIR}" ] then echo " The script must be launched from the ${LAUNCH_DIR} directory " echo " The script was automatically killed due to a launch error - See Above Error Message" exit 2 fi echo "1) Launch Location is Correct ($HOME/ngs/tophat)" # Checking if launch directory contains any .ini files (No such files are expected) if [ `ls *.ini | wc -l` != 0 ] then echo " The tophat directory contains unexpected .ini files - These must be removed before the script starts " echo " ERROR - The script was automatically killed due to a launch error - See Above Error Message" exit 2 fi echo "2) No .ini Files Found in Launch Directory as Expected" echo echo Directory Check Completed date '+%m/%d/%y %H:%M:%S' echo ############################################################## ## Query User for tophat and cufflinks configurations ## ############################################################## echo Starting User Configuration Query Step date '+%m/%d/%y %H:%M:%S' echo # Ask user if the input read files are compressed or not echo "Are the input read files compressed? (ie. gzip or bzip2) [Yes/No]" read COMPRESSION_QUERY # Test to see if the "COMPRESSION QUERY" Query Response was appropriate ## NOTE - In the IF statement use && as AND and || as OR in new shells (old OR is -o) if [[ ${COMPRESSION_QUERY} = "Yes" || ${COMPRESSION_QUERY} = "No" ]] then echo else echo echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) " echo " The script was automatically killed due to a launch error - See Above Error Message" exit 2 fi # Ask user for the compression type if they indicated that the read files are compressed if [ ${COMPRESSION_QUERY} = "Yes" ] then echo Please indicate the type of compression. [gzip/bzip2] read COMPRESSION_TYPE if [[ ${COMPRESSION_TYPE} = "gzip" || ${COMPRESSION_QUERY} = "bzip2" ]] then echo else echo echo " Incorrect Response Option, Please use gzip or bzip2 (Case-Sensitive) " echo " The script was automatically killed due to a launch error - See Above Error Message" exit 2 fi fi ###################### echo Would you like to run tophat with a gtf reference file? [Yes/No] read TOPHAT_QUERY_GTF # Test to see if the Query Response to the use of a GTF file during tophat alignment was correct or not ## NOTE - In the IF statement use && as AND and || as OR in new shells (old OR is -o) if [[ ${TOPHAT_QUERY_GTF} = "Yes" || ${TOPHAT_QUERY_GTF} = "No" ]] then echo else echo echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) " echo " The script was automatically killed due to a launch error - See Above Error Message" exit 2 fi # Ask user for a short GTF Identifier for renaming purposes if [ ${TOPHAT_QUERY_GTF} = "Yes" ] then echo "Please provide a short descriptor for the GTF file used during tophat alignment" echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs" echo Please Enter Response Now: read TOPHAT_GTF_SHORT echo fi echo Notes regarding the following questions pertain to how cufflinks estimates abundances echo ----------------------------------------------------------------------------------------- echo echo A GTF file can be used in two mutually exclusive fashions -G/--GTF or -g/--GTF-guide echo echo "1) To limit abundance estimates to just the isoforms in the GTF file use -G/--GTF" echo "*** In this script this is called LIMIT ABUNDANCE ESTIMATION to GTF reference ***" echo NOTE: This limits the estimates to a consistent set of known isoforms echo echo "2) To guide BUT NOT limit the abundance estimates to the GTF file use -g/--GTF-guide" echo "*** In this script this is called GUIDE ABUNDANCE ESTIMATION to GTF reference ***" echo NOTE: This allows identification of known and novel isoforms echo ----------------------------------------------------------------------------------------- echo echo Would you like to LIMIT ABUNDANCE ESTIMATION with cufflinks to a gtf reference file? [Yes/No] read CUFFLINKS_QUERY_LIMIT_GTF # Test to see if the Query Response to the use of a LIMIT ABUNDANCE ESTIMATION GTF file during cufflinks isoform assembly was correct or not if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" || ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" ]] then echo else echo echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) " echo " The script was automatically killed due to a launch error - See Above Error Message" exit 2 fi echo Would you like to GUIDE ABUNDANCE ESTIMATION with cufflinks to a gtf reference file? [Yes/No] read CUFFLINKS_QUERY_GUIDE_GTF # Test to see if the Query Response to the use of a GUIDE ABUNDANCE ESTIMATION GTF file during cufflinks isoform assembly was correct or not if [[ ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" || ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" ]] then echo else echo echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) " echo " The script was automatically killed due to a launch error - See Above Error Message" exit 2 fi # Test to ensure the user did not say "Yes" to the use both the LIMIT and GUIDE GTF options if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" ]] then echo " Incorrect Response Combination, You Can Not Say Yes to both the LIMIT and GUIDE ABUNDANCE ESTIMATION Options!! " echo " The script was automatically killed due to a launch error - See Above Error Message" exit 2 fi # Ask user for a short GTF Identifier for renaming purposes if [ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" ] then echo "Please provide a short descriptor for the GTF file used to LIMIT ABUNDANCE ESTIMATION" echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs" echo Please Enter Response Now: read LIMIT_GTF_SHORT echo fi if [ ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" ] then echo "Please provide a short descriptor for the GTF file used to GUIDE ABUNDANCE ESTIMATION" echo "Such as ensembl v60 (E60), this is not standardized but for file sorting purposes you should be consistent between runs" echo Please Enter Response Now: read GUIDE_GTF_SHORT echo fi echo Would you like to run cufflinks with an abundance estimate mask file? [Yes/No] read CUFFLINKS_QUERY_MASK # Test to see if the Query Response to the use of a mask file during cufflinks isoform assembly was correct or not if [[ ${CUFFLINKS_QUERY_MASK} = "Yes" || ${CUFFLINKS_QUERY_MASK} = "No" ]] then echo else echo echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) " echo " The script was automatically killed due to a launch error - See Above Error Message" exit 2 fi # Ask user for a short mask identifier for renaming purposes if [ ${CUFFLINKS_QUERY_MASK} = "Yes" ] then echo "Please provide a short descriptor for the mask file used to prevent certain transscripts from being used during abundance estimation" echo "Such as MT for mitochorndia, IG for immunoglobulin, this is not standardized but for file sorting purposes you should be consistent between runs" echo Please Enter Response Now: read MASK_SHORT echo fi # Ask the user if they want to extract a list of Genes and associated FPKM vales from the cufflinks output into the run report echo Would you like to extract a gene list, defined by GENES_OF_INTEREST in .ini file into the run report? [Yes/No] read GENESOFINTEREST_QUERY # Test to see if the Query Response to the use of a mask file during cufflinks isoform assembly was correct or not if [[ ${GENESOFINTEREST_QUERY} = "Yes" || ${GENESOFINTEREST_QUERY} = "No" ]] then echo else echo echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) " echo " The script was automatically killed due to a launch error - See Above Error Message" exit 2 fi #Final User Query to ensure the variables in the Tophat_Run_Parameters.ini file match the users responses echo FINAL CHECK: Are you sure the Tophat_Run_Parameters.ini file is updated to match the user responses you provided? [Yes/No] read FINAL_CHECK # Test to see if the Query Response was correct or not if [[ ${FINAL_CHECK} = "Yes" || ${FINAL_CHECK} = "No" ]] then echo else echo echo " Incorrect Response Option, Please use Yes or No (Case-Sensitive) " echo " The script was automatically killed due to a launch error - See Above Error Message" exit 2 fi # Test to see if the user wants to stop the script (ie. they entered "No" as a response to the FINAL CHECK if [ ${FINAL_CHECK} = "Yes" ] then echo "Thanks... Your User Configured Run Will Start in a Moment" echo else echo echo You Enter No to the query... The Script Will Stop Now exit 2 fi echo Finished User Configuration Query Step date '+%m/%d/%y %H:%M:%S' echo ############################################################## ## Define the renaming tags based on the users responses ## ############################################################## echo Defining Renaming Convention Based on User Configuration Inputs date '+%m/%d/%y %H:%M:%S' echo # TOPHAT_TAG options are TophatPE_DeNovo or TophatPE_GTF if [ ${TOPHAT_QUERY_GTF} = "Yes" ] then TOPHAT_TAG="TophatPE_${TOPHAT_GTF_SHORT}GTF" else TOPHAT_TAG="TophatPE_DeNovo" fi echo The tophat tag has been set to ${TOPHAT_TAG} # CUFFLINKS_TAG options are Cufflinks_DeNovo_NoMask, Cufflinks_DeNovo_Masked, Cufflinks_Limited_NoMask, Cufflinks_Limited_Masked, Cufflinks_Guided_NoMask, Cufflinks_Guided_Masked if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]] then CUFFLINKS_TAG="Cufflinks_DeNovo_NoMask" fi if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]] then CUFFLINKS_TAG="Cufflinks_DeNovo_${MASK_SHORT}Mask" fi if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]] then CUFFLINKS_TAG="Cufflinks_${LIMIT_GTF_SHORT}Limited_NoMask" fi if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]] then CUFFLINKS_TAG="Cufflinks_${LIMIT_GTF_SHORT}Limited_${MASK_SHORT}Mask" fi if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "No" ]] then CUFFLINKS_TAG="Cufflinks_${GUIDE_GTF_SHORT}Guided_NoMask" fi if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]] then CUFFLINKS_TAG="Cufflinks_${GUIDE_GTF_SHORT}Guided_${MASK_SHORT}Mask" fi echo The cufflinks tag has been set to ${CUFFLINKS_TAG} echo ############################################################## ## Create a custom analysis directory for each sample ## ############################################################## ## Lots of tricks in this step, find function, basename, learning how \, ", and ' quotes differ allowing you to write special characters and use variables echo Creating Custom Directories and Splitting Files date '+%m/%d/%y %H:%M:%S' for file in `find . -type f -iname "*${READ1_ID}"` do filename=`basename $file "${READ1_ID}"` mkdir $filename mv $filename${READ1_ID} $filename mv $filename${READ2_ID} $filename echo "SAMPLE_NAME=\"$filename\"" >> $filename.ini done echo Directory Constuction and File Splitting Complete date '+%m/%d/%y %H:%M:%S' echo ############################################################## ## Analysis Pipeline - BWA, TOPHAT, CUFFLINKS, QC metrics ## ############################################################## # Initiate analysis of each identified sample one at a time echo Starting Individual Sample Analysis date '+%m/%d/%y %H:%M:%S' echo for ini in `ls *.ini` do #Read in the .ini file created during the directory construction period . $ini mv $ini ${SAMPLE_NAME} cd ${SAMPLE_NAME} #Current Directory=ngs/tophat/SampleX/tophat_out echo Analyzing ${SAMPLE_NAME} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log #Extract and Count reads in input files, 3 parse version depending on compression and type of compression if [ ${COMPRESSION_QUERY} = "No" ] then #Calculate the number of reads present in each file echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log COUNT_READ1=`wc -l ${SAMPLE_NAME}${READ1_ID} | awk '{print $1/4}'` echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log COUNT_READ2=`wc -l ${SAMPLE_NAME}${READ2_ID} | awk '{print $1/4}'` echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo Finished identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log #Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log head -n 4000000 ${SAMPLE_NAME}${READ1_ID} > ${SAMPLE_NAME}${READ1_BWA_FASTQ} echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log head -n 4000000 ${SAMPLE_NAME}${READ2_ID} > ${SAMPLE_NAME}${READ2_BWA_FASTQ} echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log else if [ ${COMPRESSION_TYPE} = "gzip" ] then echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log COUNT_READ1=`gunzip -c ${SAMPLE_NAME}${READ1_ID} | wc -l | awk '{print $1/4}'` echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log COUNT_READ2=`gunzip -c ${SAMPLE_NAME}${READ2_ID} | wc -l | awk '{print $1/4}'` echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo Finished identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log #Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log gunzip -c ${SAMPLE_NAME}${READ1_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ1_BWA_FASTQ} echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log gunzip -c ${SAMPLE_NAME}${READ2_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ2_BWA_FASTQ} echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log fi if [ ${COMPRESSION_TYPE} = "bzip2" ] then echo Starting to count the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log COUNT_READ1=`bunzip2 -k ${SAMPLE_NAME}${READ1_ID} | wc -l | awk '{print $1/4}'` echo The forward or read1 fastq contains ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log COUNT_READ2=`bunzip2 -k ${SAMPLE_NAME}${READ2_ID} | wc -l | awk '{print $1/4}'` echo The reverse or read2 fastq contains ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo Finished identifying the number of reads | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log #Cut out first 1 million reads from forward and reverse to estimate inner mate distance and STD echo Creating Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log bunzip2 -k ${SAMPLE_NAME}${READ1_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ1_BWA_FASTQ} echo Finished Creating Read1 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log bunzip2 -k ${SAMPLE_NAME}${READ2_ID} | head -n 4000000 > ${SAMPLE_NAME}${READ2_BWA_FASTQ} echo Finished Creating Read2 Subset | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log fi fi #Align subset reads to transcriptome reference using BWA aln echo Starting Alignment of Read1 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log bwa aln -t ${CPU_CORES} -I ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ1_BWA_FASTQ} > ${SAMPLE_NAME}${READ1_BWA_SAI} echo Finished Aligning Read1 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo Starting Alignment of Read2 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log bwa aln -t ${CPU_CORES} -I ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ2_BWA_FASTQ} > ${SAMPLE_NAME}${READ2_BWA_SAI} echo Finished Aligning Read2 Subset with BWA | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log #Generate .sam file using BWA sampe echo Starting BWA SAMPE Assembly of Read Subsets | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log bwa sampe ${BWA_GENOME_FILE} ${SAMPLE_NAME}${READ1_BWA_SAI} ${SAMPLE_NAME}${READ2_BWA_SAI} ${SAMPLE_NAME}${READ1_BWA_FASTQ} ${SAMPLE_NAME}${READ2_BWA_FASTQ} > ${SAMPLE_NAME}${BWA_SAMPE}.sam echo Finished BWA SAMPE Assembly Phase | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log #Convert BWA sam file to bam file using SAMTOOLS echo Starting SAM to BAM conversion with Samtools | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log samtools view -bS -o ${SAMPLE_NAME}${BWA_SAMPE}.bam ${SAMPLE_NAME}${BWA_SAMPE}.sam echo Finished SAM to BAM conversion wth Samtools | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log #Sort BWA bam file using SAMTOOLS echo Starting BAM file sorting step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log samtools sort ${SAMPLE_NAME}${BWA_SAMPE}.bam ${SAMPLE_NAME}${BWA_SAMPE}${SAMTOOLS_SORT_ID} echo Finished sorting BAM file | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log #Determine insert metrics using PICARD echo Starting Picard CollectInsertSizeMetrics step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log java -Xmx2g -jar $HOME/local/bin/CollectInsertSizeMetrics.jar INPUT=${SAMPLE_NAME}${BWA_SAMPE}${SAMTOOLS_SORT_ID}.bam OUTPUT=${SAMPLE_NAME}_Picard_InsertMetrics.txt HISTOGRAM_FILE=${SAMPLE_NAME}_Picard_InsertMetrics.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY} echo Finished collecting insert metrics step | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log #Calculate "inner mate distance" and standard deviation echo Starting to calculate insert metrics | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log INNER_DISTANCE=`head -n 8 ${SAMPLE_NAME}_Picard_InsertMetrics.txt | tail -n 1 | cut -f5 | awk -v var1="$COMBINED_READ_LENGTH" '{printf "%.0f\n", $1-var1}'` echo The mean inner-mate-distance was identified as being ${INNER_DISTANCE}bp | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log STDEV=`head -n 8 ${SAMPLE_NAME}_Picard_InsertMetrics.txt | tail -n 1 | cut -f6 | awk '{printf "%.0f\n", $1}'` echo The standard deviation of the inner-mate-distance was identified as being ${STDEV}bp | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo Finished calculating insert metrics | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log #Run tophat analysis echo Starting Tophat Alignment of ${SAMPLE_NAME} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log if [ ${TOPHAT_QUERY_GTF} = "Yes" ] then TOPHAT_COMMAND="tophat --num-threads ${CPU_CORES} -r ${INNER_DISTANCE} --mate-std-dev ${STDEV} --max-insertion-length ${MAX_INSERT_SIZE} --max-deletion-length ${MAX_DELETION_SIZE} ${READ_QUALITY} --library-type ${LIBRARY_FORMAT} -G ${GTF_FILE} ${BOWTIE_INDEX} ${SAMPLE_NAME}${READ1_ID} ${SAMPLE_NAME}${READ2_ID}" fi if [ ${TOPHAT_QUERY_GTF} = "No" ] then TOPHAT_COMMAND="tophat --num-threads ${CPU_CORES} -r ${INNER_DISTANCE} --mate-std-dev ${STDEV} --max-insertion-length ${MAX_INSERT_SIZE} --max-deletion-length ${MAX_DELETION_SIZE} ${READ_QUALITY} --library-type ${LIBRARY_FORMAT} ${BOWTIE_INDEX} ${SAMPLE_NAME}${READ1_ID} ${SAMPLE_NAME}${READ2_ID}" fi echo ${TOPHAT_COMMAND} | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log ${TOPHAT_COMMAND} ## This creates a subfolder called "tophat_out/" ## This folder will contain: accepted_hits.bam (is it sorted?) ## deletions.bed ## insertions.bed ## junctions.bed cd tophat_out/ #Current Directory=ngs/tophat/SampleX/tophat_out mv accepted_hits.bam ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam mv deletions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_deletions.bed mv insertions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_insertions.bed mv junctions.bed ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_junctions.bed echo Finished Tophat Alignment | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log #Run cufflinks analysis echo Starting Cufflinks Abundance Estimation of ${SAMPLE_NAME} | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]] then CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam" fi if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]] then CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam" fi if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "No" ]] then CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF ${GTF_FILE} --compatible-hits-norm --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam" fi if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "Yes" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "No" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]] then CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF ${GTF_FILE} --compatible-hits-norm -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam" fi if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "No" ]] then CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF-guide ${GTF_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam" fi if [[ ${CUFFLINKS_QUERY_LIMIT_GTF} = "No" && ${CUFFLINKS_QUERY_GUIDE_GTF} = "Yes" && ${CUFFLINKS_QUERY_MASK} = "Yes" ]] then CUFFLINKS_COMMAND="cufflinks --quiet --output-dir ../cufflinks_out/ --num-threads ${CPU_CORES} --GTF-guide ${GTF_FILE} -M ${MASK_FILE} --frag-bias-correct ${REFERENCE_FILE} --multi-read-correct --library-type ${LIBRARY_FORMAT} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam" fi echo ${CUFFLINKS_COMMAND} | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log ${CUFFLINKS_COMMAND} ## This creates a subfolder called "cufflinks_out/" in the sample directory ## This folder will contain: genes.expr ## transcripts.expr ## transcripts.gtf #In the next step we will rename the expression estimate files cd ../cufflinks_out/ #Current Directory=ngs/tophat/SampleX/cufflinks_out mv genes.fpkm_tracking ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_genes.expr mv isoforms.fpkm_tracking ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_transcripts.expr mv transcripts.gtf ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_transcripts.gtf echo Finished Cufflinks Abundance Estimation | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log ############################################################## ## Calculate metrics using Samtools and Picard ## ############################################################## cd ../tophat_out/ #Current Directory=ngs/tophat/SampleX/tophat_out #Determine the RNA Seq Metrics using Picard "CollectRNASeqMetrics.jar" echo Starting Picard CollectRnaSeqMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log java -Xmx2g -jar $HOME/local/bin/CollectRnaSeqMetrics.jar REF_FLAT=${REFFLAT_FILE} RIBOSOMAL_INTERVALS=${RIBOSOME_LIST} STRAND_SPECIFICITY=NONE REFERENCE_SEQUENCE=${REFERENCE_FILE} INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt CHART_OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY} echo Finished collecting RNAseq metrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log #Mark the duplicate reads using Picard "MarkDuplicates.jar" echo Starting Picard MarkDuplicates step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log ## NOTE: "-Dsnappy.disable=true" is passed to this command to prevent the use of snappy-java by MarkDuplicates as I can not figure out how to add it to the JAVA CLASSPATH java -Dsnappy.disable=true -Xmx2g -jar $HOME/local/bin/MarkDuplicates.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam METRICS_FILE=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt REMOVE_DUPLICATES=false ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY} echo Finished MarkDuplicates step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log #Create an index file for the bam file generated from the MarkDuplicates step using SAMTOOLS INDEX echo Starting bam file indexing step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log samtools index ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam echo Finished bam file indexing step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log #Calculate mapping statistics using SAMTOOLS FlAGSTAT echo Starting Samtools Flagstat step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log samtools flagstat ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam > ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt echo Finished Samtools Flagstat step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log #Calculate mapping statistics using Picard "CollectAlignmentSummaryMetrics.jar" echo Starting Picard CollectAlignmentSummaryMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log java -Xmx2g -jar $HOME/local/bin/CollectAlignmentSummaryMetrics.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt VALIDATION_STRINGENCY=SILENT REFERENCE_SEQUENCE=${REFERENCE_FILE} ASSUME_SORTED=true IS_BISULFITE_SEQUENCED=false TMP_DIR=${TEMP_DIRECTORY} echo Finished Picard CollectAlignmentSummaryMetrics step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log #Calculate read quality stats using Picard "MeanQualityByCycle.jar" echo Starting Picard MeanQualityByCycle step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log java -Xmx2g -jar $HOME/local/bin/MeanQualityByCycle.jar INPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.txt CHART_OUTPUT=${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.pdf VALIDATION_STRINGENCY=SILENT TMP_DIR=${TEMP_DIRECTORY} echo Finished Picard MeanQualityByCycle step | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log ############################################################## ## Write out summary file (Stats/Metrics/Indexes?) ## ############################################################## #Create a summary text file echo Starting Creation of Summary File | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log echo "*** Analysis Run Report ***" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo Sample ${SAMPLE_NAME} was Analyzed | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt # Test to see if the user wants a to extract FPKM values for a list of genes into the run report if [ ${GENESOFINTEREST_QUERY} = "Yes" ] then echo FPKM Values for Common genes of Interests: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt grep -wf ${GENES_OF_INTEREST} ../cufflinks_out/${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_genes.expr | awk '{print $1, $10}' | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt fi echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo Sample Summary: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "The two read files processed were ${SAMPLE_NAME}${READ1_ID} and ${SAMPLE_NAME}${READ2_ID}" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${SAMPLE_NAME}${READ1_ID} contained ${COUNT_READ1} reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${SAMPLE_NAME}${READ2_ID} contained ${COUNT_READ2} reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo After aligning the first one million reads to a transcriptome reference using BWA | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo the inner-mate-distance was determined to be ${INNER_DISTANCE}bp with a standard deviation of ${STDEV} | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "The alignment metrics for this sample were as follows:" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt #Extract Alignment Metrics UNIQUE_READ1_READS_ALIGNED=`cut -f3 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 8 | tail -n 1` UNIQUE_READ2_READS_ALIGNED=`cut -f3 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 9 | tail -n 1` TOTAL_READ1_ALIGNMENTS=`awk '{print $1}' ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt | head -n 5 | tail -n 1` TOTAL_READ2_ALIGNMENTS=`awk '{print $1}' ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_SamtoolsMappingStats.txt | head -n 6 | tail -n 1` PERCENT_READ1_ALIGNED=`echo ${COUNT_READ1} | awk -v var2="${UNIQUE_READ1_READS_ALIGNED}" '{print var2/$1}'` PERCENT_READ2_ALIGNED=`echo ${COUNT_READ2} | awk -v var3="${UNIQUE_READ2_READS_ALIGNED}" '{print var3/$1}'` PERCENT_READ1_Q20_BASES=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | tail -n 1 | awk '{print $11/$8}'` PERCENT_READ2_Q20_BASES=`head -n 9 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | tail -n 1 | awk '{print $11/$8}'` PERCENT_PAIRED_READ_ALIGNMENT=`cut -f17 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardMappingStats.txt | head -n 8 | tail -n 1 | awk -v var4="${COUNT_READ1}" '{print $1/var4}'` echo Alignment Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${UNIQUE_READ1_READS_ALIGNED} - Number of Unique Reads from Read1 Aligned | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${UNIQUE_READ2_READS_ALIGNED} - Number of Unique Reads from Read2 Aligned | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${TOTAL_READ1_ALIGNMENTS} - Total Number of Alignment Events from the Unique Reads in Read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${TOTAL_READ2_ALIGNMENTS} - Total Number of Alignment Events from the Unique Reads in Read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${PERCENT_READ1_ALIGNED} - Percentage of Unique Read1 Reads Aligned to at least one Position | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${PERCENT_READ2_ALIGNED} - Percentage of Unique Read2 Reads Aligned to at least one Position | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${PERCENT_READ1_Q20_BASES} - Percentage of Aligned Bases from Read1, which are greater than Q20 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${PERCENT_READ2_Q20_BASES} - Percentage of Aligned Bases from Read2, which are greater than Q20 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${PERCENT_PAIRED_READ_ALIGNMENT} - Percentage of Reads Aligned as Pairs | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt #Extract RNAseq Metrics PERCENT_RIBOSOME_BASES=`cut -f11 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1` PERCENT_CODING_BASES=`cut -f12 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1` PERCENT_UTR_BASES=`cut -f13 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1` PERCENT_INTRONIC_BASES=`cut -f14 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1` PERCENT_INTERGENIC_BASES=`cut -f15 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1` PERCENT_MRNA_BASES=`cut -f16 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1` MEDIAN_5PRIME_BIAS=`cut -f20 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1` MEDIAN_3PRIME_BIAS=`cut -f21 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1` MEDIAN_5PRIME_3PRIME_BIAS_RATIO=`cut -f22 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_RNAseqMetrics.txt | head -n 8 | tail -n 1` echo Transcriptome Sequencing Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${PERCENT_RIBOSOME_BASES} - Percent Ribosomal Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${PERCENT_CODING_BASES} - Percent Coding Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${PERCENT_UTR_BASES} - Percent UTR Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${PERCENT_INTRONIC_BASES} - Percent Intronic Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${PERCENT_INTERGENIC_BASES} - Percent Intergenic Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${PERCENT_MRNA_BASES} - Percent mRNA Bases | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${MEDIAN_5PRIME_BIAS} - Median 5 Prime Coverage Bias | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${MEDIAN_3PRIME_BIAS} - Median 3 Prime Coverage Bias | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${MEDIAN_5PRIME_3PRIME_BIAS_RATIO} - Median 5/3 Coverage Ratio | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt #Extract Duplication Metrics PERCENT_DUPLICATION=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt | tail -n 1 | cut -f8` LIBRARY_SIZE=`head -n 8 ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_Picard_MarkDuplicates_Metrics.txt | tail -n 1 | cut -f9` echo Duplication Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${LIBRARY_SIZE} - Estimated fragments in Library | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${PERCENT_DUPLICATION} - Percent Duplicate Reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt #Calculate Quality Metrics (first/last cycle and mean per read) QUALITY_CAPTURE_LINES=`echo ${COMBINED_READ_LENGTH} | awk '{print $1+8}'` QUALITY_CAPTURE_READS=`echo ${COMBINED_READ_LENGTH} | awk '{print $1/2}'` head -n ${QUALITY_CAPTURE_LINES} ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardQualByCycle.txt | tail -n ${COMBINED_READ_LENGTH} > run_qualities.txt head -n ${QUALITY_CAPTURE_READS} run_qualities.txt > read1_qualities.txt tail -n ${QUALITY_CAPTURE_READS} run_qualities.txt > read2_qualities.txt QUALITY_READ1_FIRST_CYCLE1=`cut -f2 read1_qualities.txt | head -n 1` QUALITY_READ1_LAST_CYCLE1=`cut -f2 read1_qualities.txt | tail -n 1` QUALITY_READ1_AVERAGE=`awk '{s+=$2} END {print s/NR}' < read1_qualities.txt` QUALITY_READ2_FIRST_CYCLE1=`cut -f2 read2_qualities.txt | head -n 1` QUALITY_READ2_LAST_CYCLE1=`cut -f2 read2_qualities.txt | tail -n 1` QUALITY_READ2_AVERAGE=`awk '{s+=$2} END {print s/NR}' < read2_qualities.txt` echo Quality Metrics: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo This run consists of paired ${QUALITY_CAPTURE_READS}x${QUALITY_CAPTURE_READS}bp reads | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${QUALITY_READ1_AVERAGE} - Mean phred score for read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${QUALITY_READ1_FIRST_CYCLE1} - Mean phred score for the first cycle of read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${QUALITY_READ1_LAST_CYCLE1} - Mean phred score for the last cycle of read1 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${QUALITY_READ2_AVERAGE} - Mean phred score for read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${QUALITY_READ2_FIRST_CYCLE1} - Mean phred score for the first cycle of read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ${QUALITY_READ2_LAST_CYCLE1} - Mean phred score for the last cycle of read2 | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt #Output complete list of variables used in the run echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo THE FOLLOWING VARIABLES WERE GENERATED DURING THE ANALYSIS OF THIS SAMPLE: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "COMPRESSION_QUERY=\"${COMPRESSION_QUERY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "COMPRESSION_TYPE=\"${COMPRESSION_TYPE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "TOPHAT_QUERY_GTF=\"${TOPHAT_QUERY_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "TOPHAT_GTF_SHORT=\"${TOPHAT_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "CUFFLINKS_QUERY_LIMIT_GTF=\"${CUFFLINKS_QUERY_LIMIT_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "CUFFLINKS_QUERY_GUIDE_GTF=\"${CUFFLINKS_QUERY_GUIDE_GTF}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "LIMIT_GTF_SHORT=\"${LIMIT_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "GUIDE_GTF_SHORT=\"${GUIDE_GTF_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "CUFFLINKS_QUERY_MASK=\"${CUFFLINKS_QUERY_MASK}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "MASK_SHORT=\"${MASK_SHORT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "GENESOFINTEREST_QUERY=\"${GENESOFINTEREST_QUERY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "FINAL_CHECK=\"${FINAL_CHECK}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "TOPHAT_TAG=\"${TOPHAT_TAG}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "CUFFLINKS_TAG=\"${CUFFLINKS_TAG}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "SAMPLE_NAME=\"${SAMPLE_NAME}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "COUNT_READ1=\"${COUNT_READ1}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "COUNT_READ2=\"${COUNT_READ2}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "INNER_DISTANCE=\"${INNER_DISTANCE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "STDEV=\"${STDEV}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "TOPHAT_COMMAND=\"${TOPHAT_COMMAND}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "CUFFLINKS_COMMAND=\"${CUFFLINKS_COMMAND}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo THE FOLLOWING VARIABLES WERE LOADED FROM THE RNAseq_MensFormalWear_Parameters.ini DURING THE ANALYSIS OF THIS SAMPLE: | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo They may or may not have been used depending on your query respones echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "LAUNCH_DIR=\"${LAUNCH_DIR}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "READ_EXT=\"${READ_EXT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "READ1_ID=\"${READ1_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "READ2_ID=\"${READ2_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "COMBINED_READ_LENGTH=\"${COMBINED_READ_LENGTH}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "CPU_CORES=\"${CPU_CORES}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "GENOME_NAME=\"${GENOME_NAME}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "REFERENCE_FILE=\"${REFERENCE_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "BWA_GENOME_FILE=\"${BWA_GENOME_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "READ1_BWA_FASTQ=\"${READ1_BWA_FASTQ}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "READ2_BWA_FASTQ=\"${READ2_BWA_FASTQ}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "READ1_BWA_SAI=\"${READ1_BWA_SAI}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "READ2_BWA_SAI=\"${READ2_BWA_SAI}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "BWA_SAMPE=\"${BWA_SAMPE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "SAMTOOLS_SORT_ID=\"${SAMTOOLS_SORT_ID}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "BOWTIE_INDEX=\"${BOWTIE_INDEX}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "GTF_FILE=\"${GTF_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "READ_QUALITY=\"${READ_QUALITY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "LIBRARY_FORMAT=\"${LIBRARY_FORMAT}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "MAX_INSERT_SIZE=\"${MAX_INSERT_SIZE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "MAX_DELETION_SIZE=\"${MAX_DELETION_SIZE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "MASK_FILE=\"${MASK_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "TEMP_DIRECTORY=\"${TEMP_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "REFFLAT_FILE=\"${REFFLAT_FILE}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "RIBOSOME_LIST=\"${RIBOSOME_LIST}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "GENES_OF_INTEREST=\"${GENES_OF_INTEREST}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "COMPLETED_RNASEQ_READ_FILE_DIRECTORY=\"${COMPLETED_RNASEQ_READ_FILE_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo "COMPLETED_RNASEQ_RESULTS_DIRECTORY=\"${COMPLETED_RNASEQ_RESULTS_DIRECTORY}\"" | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo ----------------------------------------------------------------------------- | tee -a ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_${CUFFLINKS_TAG}_RunReport.txt echo Finished Creating the Summary File | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log ############################################################## ## Clean up analysis directory and Move files to Archive ## ############################################################## echo Starting to Clean up Analysis Directory | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ../${SAMPLE_NAME}_RNAseq_Analysis.log #Remove temp files created during calculations steps rm run_qualities.txt rm read1_qualities.txt rm read2_qualities.txt #Move tophat outputs and metric files to archive directory mv ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam ${COMPLETED_RNASEQ_RESULTS_DIRECTORY} mv ${SAMPLE_NAME}_${GENOME_NAME}_${TOPHAT_TAG}_PicardDupsMarked.bam.bai ${COMPLETED_RNASEQ_RESULTS_DIRECTORY} mv *.bed ${COMPLETED_RNASEQ_RESULTS_DIRECTORY} mv *.pdf ${COMPLETED_RNASEQ_RESULTS_DIRECTORY} mv *.txt ${COMPLETED_RNASEQ_RESULTS_DIRECTORY} #Move cufflinks outputs to archive directory cd ../cufflinks_out #Current Directory=ngs/tophat/SampleX/cufflinks_out mv *.expr ${COMPLETED_RNASEQ_RESULTS_DIRECTORY} mv *.gtf ${COMPLETED_RNASEQ_RESULTS_DIRECTORY} #Move fastq, ini, and log file to archive directories cd ../ #Current Directory=ngs/tophat/SampleX mv ${SAMPLE_NAME}${READ1_ID} ${COMPLETED_RNASEQ_READ_FILE_DIRECTORY} mv ${SAMPLE_NAME}${READ2_ID} ${COMPLETED_RNASEQ_READ_FILE_DIRECTORY} echo Finished Cleaning up Analysis Directory | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log date '+%m/%d/%y %H:%M:%S' | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log echo | tee -a ${SAMPLE_NAME}_RNAseq_Analysis.log mv *.log ${COMPLETED_RNASEQ_RESULTS_DIRECTORY} #Move back to launch directory to start next sample in batch cd ../ #Current Directory=ngs/tophat/ #Remove Custom analysis directory and none required files rm -rf ${SAMPLE_NAME} done
Code:
## RNAseq Common parameters file ## # Use this variable to indicate the expected launch directory # (Update as appropriate) LAUNCH_DIR=/Users/jkeats/ngs/tophat ################################################################################## ## ## ## Common Parameters used at multiple steps or used to name output files ## ## (Update as appropriate) ## ## ## ################################################################################## #sequence read file extension READ_EXT=".fastq.gz" #forward or read1 identifier READ1_ID="_1.fastq.gz" #reverse or read2 identifier READ2_ID="_2.fastq.gz" #What is the combined read length of the two read pairs (ie. 50x50=100, 100x100=200) COMBINED_READ_LENGTH="100" #How many cores/processors are available for the multi-threaded phases of BWA, TOPHAT, CUFFLINKS CPU_CORES="1" #This is used to rename result files to include the genome version GENOME_NAME=hg19 #Provide the path and full file name to the genome reference file used to create the bowtie index REFERENCE_FILE=/Users/jkeats/ngs/refgenomes/bowtie_indexed/igenomes_hg19/genome.fa ################################################################################## ## ## ## Common BWA parameters (Only the BWA_GENOME_FILE needs to be updated) ## ## ## ################################################################################## BWA_GENOME_FILE="/Users/jkeats/ngs/refgenomes/bwa_indexed/GRCh37_E60cDNA.fa" READ1_BWA_FASTQ="_1_bwa.fastq" READ2_BWA_FASTQ="_2_bwa.fastq" READ1_BWA_SAI="_1_bwa.sai" READ2_BWA_SAI="_2_bwa.sai" BWA_SAMPE="_bwa_sampe" ################################################################################## ## ## ## Common Samtools parameters (Shouldn't need to be modified) ## ## ## ################################################################################## SAMTOOLS_SORT_ID="_sorted" ################################################################################## ## ## ## Common Tophat Parameters (Update as appropriate) ## ## ## ################################################################################## #Bowtie index to align data against [ <filename> ] BOWTIE_INDEX="/Users/jkeats/ngs/refgenomes/bowtie_indexed/igenomes_hg19/genome" #GTF file to use for LIMITED ABUNDANCE ESTIMATION (-G/--GTF) or GUIDED ABUNDANCE ESTIMATION ( -g/--GTF-guide) GTF_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_genes.gtf" #Quality type [ --solexa-quals OR --solexa1.3-quals (same as phred64-quals) OR --phred64-quals (same as solexa1.3-quals) ] READ_QUALITY="--solexa1.3-quals" #Library type [ fr-unstranded OR fr-firststrand OR fr-secondstrand) LIBRARY_FORMAT="fr-unstranded" #Maximum insertion size allow IF allowing indels [ --max-insertion-length <int> default: 3 ] MAX_INSERT_SIZE="3" #Maximum deletion size allowed IF allowing indels [ --max-deletion-length <int> default: 3 ] MAX_DELETION_SIZE="3" ################################################################################## ## ## ## Common Cufflinks Parameters (Update as appropriate) ## ## ## ################################################################################## #This should be used if using an exclusion file for cufflinks abundance estimates MASK_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/GRCh37_E60_CufflinksExcludedTranscripts.gtf" ################################################################################## ## ## ## Common Picard Parameters (Update as appropriate) ## ## ## ################################################################################## #Provide a directory for temporary files (Not always needed but on multi-harddrive systems JAVA may use an unexpected directory) TEMP_DIRECTORY="/Users/jkeats/ngs/refgenomes/temp" #This is a refFlat format table outlining the transcript space of the genome analyzed REFFLAT_FILE="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_refFlat.txt" #This is the ribosome exclusion file, a list of ribosome gene locations to report the percentage of reads related to ribosomal transcripts RIBOSOME_LIST="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/hg19_ribosome_gene_locations.txt" ################################################################################## ## ## ## Gene List of Interest (Update as appropriate) ## ## ## ################################################################################## #List of genes to extract GENES_OF_INTEREST="/Users/jkeats/ngs/refgenomes/downloads/grch37_hg19/annotation_files/TC_Class_Genes.txt" ################################################################################## ## ## ## Final Directories (Update as appropriate) ## ## ## ################################################################################## #Provide the path to the folder were you want to archive the read files COMPLETED_RNASEQ_READ_FILE_DIRECTORY="/Users/jkeats/ngs/analyzed_read_files/rnaseq/" #Provide the path to the folder were you want to archive the read files COMPLETED_RNASEQ_RESULTS_DIRECTORY="/Users/jkeats/ngs/finaloutputs/rnaseq/"
Comment