Seqanswers Leaderboard Ad

**nilshomer** · 04-02-2010, 08:58 PM

Originally posted by Jon_Keats View Post

I wonder how many people are in the same boat as me.

1) Institute bought a couple of GAIIs
2) No one has money to use them
3) Institute has internal competition to pay for a couple of runs (makes the donors feel better about their donation if someone uses the machines), and you are lucky enough to get funded
4) You send a couple of samples off to never-never land and someone sends back a terabyte drive or two with "next-gen sequencing data"
5) You quickly realize people that use to do survival curves in your bioinformatics core don't really know that Illumina fastq is different from Sanger fastq and the analysis they provide is limited at best
5) Now what do you do?
6) Google > seqanswers > let the misery begins

So what have I learned this week,

A) My boss should have made me read and do the "Unix and Perl for Biologist" tutorial years ago. Google it if you are new and a bench/gene jockey (sanger sequencing/microarray person) like me with no unix experience it is an excellent use of a day
B) A place called SorgeForge exists
C) If I had a MAQ for my TOPHAT and a BOWTIE to go with my BWA I'd be better of than GERALD and his SAMTOOLS
D) just type "make" to compile...Opps that doesn't work if Xcode is not installed yet
E) No Mac OS comes with Xcode installed and if you have a Leopard machine, you better know where the OS install disks are as you can only install the new version for snow leopard that is not compatible...One would think that pancreatic cancer survivor Steve Jobs would try to make my life easier not harder
F) The genome is not the genome, ensembl is the place to get chromosomes but 1000 genomes is the place to get the genome.
G) BWA can align on my laptop...cool...next-gen/2nd gen alignment on a laptop and I though I needed a super computer

One step forward, one backward

PS - I generally believe in the KISS principle, so I'll try to come back and list my solutions as I bumble my way to something. But in a week I've learned enough Unix to actually like it and got a couple of lanes of PE data into IGV so I can take a look see

Can I nominate this as the best post on seqanswers? You really deserve a prize.

**mangrove** · 04-03-2010, 05:21 AM

I agree... it is prizeworthy. Even at the price, the promise of all that sequence data is alluring, but no one I have talked to has gotten the data and NOT been overwhelmed. I hope this goes away as we get larger, faster computers, a guru to install all the programs (fortunately we have that), and eventually, the realization from NSF that lots of $ and many months will be required to actually use all those TBs. Good luck to us all (as Tiny Tim would say).

**ECO** · 04-04-2010, 06:33 AM

Nice post again Jonathan,

This is actually a great post/series of posts to kick off the "Basics in Bioinformatics" subforum that we've discussed in the Site Feedback forum. Don't be alarmed if I do some rearranging/forum creating later today.

**RockChalkJayhawk** · 04-04-2010, 08:11 AM

Jon,

I was is your exact situation 6 months ago. It gets better (slowly). The UNIX and Perl for Biologists was really helpful for me, as was this forum. You're headed in the right direction, just keep it up!

And I too agree that this is the best post on the site!

**drio** · 04-05-2010, 08:16 AM

Just catching up with SA posts.

This post is brilliant. Printing right now and posting in my cubicle. Instant classic.

**MQ-BCBB** · 04-05-2010, 11:21 AM

loved your post

I felt exactly like that not too long ago. And yes, I still remember the great feeling when I successfully loaded my data into IGV.

**Jon_Keats** · 04-05-2010, 04:28 PM

Thanks I'm glad to see people find my attempt at a bit of science geek humor funny, even with the typical spelling mistakes

**Jon_Keats** · 04-06-2010, 12:59 PM

Getting Started: Unix and Xcode

As I said in the first post I'm going to drop in a couple of posts over the next couple of days to outline my experiences to date.

In clinical training they have a mantra of "See one, Do one, Teach one" but on the research side it seems to be more "Need to do one, Figure one out, Maybe Teach one" so this will be my lame teaching attempt or at the very least a place others in our research group can get some basic instructions to replicate the pipeline I'm starting to put together. Hopefully this will be relevant to a number of people and will make some peoples life easier.

I've tested most of the following steps on both my laptop and the workstation we have in the lab (still waiting for Apple to release new Mac Pros… common Steve I'll buy an iPad if you release them in April). Obviously, I'm a Mac guy so these instructions are Mac oriented but should be comparable with any Unix/Linux environment, but that is only a guess.

*Workstation = Mac Pro with two dual core Intel Xeon5150 CPUs at 2.66Gz and 8Gb of 667MHz DDR2 RAM running Mac OSX Leopard 10.5.8*
*Laptop = MacBook Pro with an Intel Core 2 Duo CPU at 2.66GHz and 4Gb of 1067MHz DDR3 RAM running Mac OSX Snow Leopard 10.6.3*

Okay, so today you got some terabyte drives with Illumina data and you want to do something with it. The following instructions should get you ready to do something:

First thing to do is to familiarize yourself with Unix and the Terminal (Applications>Utilities>Terminal) application on your MAC. I would highly recommend working though at least the Unix portion of the "Unix and Perl for Biologists" course made public by Keith Bradham and Ian Korf at UC Davis (http://groups.google.com/group/unix-...for-biologists). I'd recommend going to their website and get the entire course package (http://korflab.ucdavis.edu/Unix_and_Perl/index.html) it is well worth a night or two of your time I promise.
If you are not going to do that you need to understand one or two commands to get going:

To get a manual on any command type "man command". Hit "space" to page down, "b" to back-up, and "q" to quit
To see what folder you are in currently type "pwd"
To see what folders and files exist in the current directory type "ls"
To move into a folder in the current directory type "cd myfolder" Note: you can move multiple levels downstream with "cd myfolder/myfolder2"
To go back one directory type "cd .." Note: you can move back multiple levels upstream with "cd ../.."
To copy a file from the current directory to a downstream folder "cp myfile myfolder/" or "cp myflie ../" to copy a file up a directory
To move a file from the current directory use "mv" in place of cp
A folder immediately downstream of the root directory (ie. absolute top of the tree) is always defined by "command /folder" (ie. if you type "cd /something" it looks for the folder "something" downstream of the root directory)
The current directory can always be noted by "./"
You will need to change the permissions of the compiled applications with "chmod 755 myfile". This makes the file readable and executable by everyone but only you can write, alternatively use 777 so anyone can do everything.

To run many of the applications Maq, BWA, Samtools, etc.. you will need to either place the applications in the PATH, define additional PATH locations, or you need to note the location of the application each time you call it. To find the current PATH directories used by Unix type "$PATH" and you should get a print out similar to the following:
-bash: /sw/bin:/sw/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/X11R6/bin

NOTE: These folders are directly below the root directory and represent places unix looks for applications to run. If you want to run any of the applications you will download and compile such as BWA you need to either type "./bwa" and the current directory must contain the bwa application. Assuming you have administrator rights to your machine, the way I initially got around this was to place the applications in one of the path directories as follows:

1) copy application to a PATH directory, type "sudo cp myfile /usr/bin" you will be prompted for you password (sudo = superuser … yes, today you are SUPERMAN)
2) make the file executable, type "sudo chmod /usr/bin/myfile"

NOTE: After running into an issue installing BFAST I'd suggest the following NOT the previous (Actually suggestion from Nils Homer...thanks)
1) create a directory in your home directory for the applications "mkdir -p $HOME/local/bin"
2) edit your .profile file so this directory is in your PATH directories
>open terminal
>type "ls -a" (You should see a file called .profile)
>open with nano "nano .profile"
*** Add the following lines to your .profile file, DO NOT remove things in the current version ***

export PATH=$HOME/local/bin:$PATH

> To save edits "control-O"
> To exit nano "control-X"

# Subsequently when you install things place the executable's in this directory so they are in a $PATH directory
# Either copy application to the directory $HOME/local/bin
# If using install script "./configure --prefix=$HOME/local"

This no longer requires sudo (Guess we shouldn't always be Superman)

Second you need to install Xcode on your Mac System so you can compile the various applications
- Download the current version, Xcode3.2, at (http://developer.apple.com/technolog...ols/xcode.html). You will have to become a member otherwise find your OS install discs and do it from the disc install option.
NOTE: This version is only compatible with Mac OSX Snow Leopard 10.6.x
- If you have a Leopard system go find the OS install discs (you need Disc 2) and install the package
> Mac OS X Install Disc 2 > open Xcode Tools folder > double click XcodeTools.mpkg

Third, for some applications like BFAST it will help to install "Fink" (...Another Nils suggestion) or "MacPorts" (seems more up to date)
- Download and install the current version from (http://www.finkproject.org/) or (http://www.macports.org/)
- You should install the package md5deep at least to install BFAST "fink install md5deep" or "port install md5deep"

Next step get the applications you need…

See the next post,

Jonathan

**KevinLam** · 04-07-2010, 01:10 AM

rofl I shld do a version with SOLiD data with the rainbow assorted myriad of problems with colorspace. Good Post!

**nilshomer** · 04-07-2010, 07:41 AM

Originally posted by KevinLam View Post

rofl I shld do a version with SOLiD data with the rainbow assorted myriad of problems with colorspace. Good Post!

I don't understand why you say there are problems with colorspace? Aligners (like BFAST) will convert your csfasta/qual files into FASTQ, will align your data sensitively, output to the SAM format, and then any SNP caller can be used without modification. Beyond specifying one command line option (to say the data is color space) during alignment there is no difference between Illumina/SOLiD (basespace/colorspace) data in terms of processing. It's the same workflow. Also the theoretical and practical benefits of colorspace (low false discovery rate) are rarely mentioned.

Sorry, just a slight pet peeve from a very happy SOLiD user.

Nils

**KevinLam** · 04-07-2010, 06:33 PM

Originally posted by nilshomer View Post

I don't understand why you say there are problems with colorspace? Aligners (like BFAST) will convert your csfasta/qual files into FASTQ, will align your data sensitively, output to the SAM format, and then any SNP caller can be used without modification. Beyond specifying one command line option (to say the data is color space) during alignment there is no difference between Illumina/SOLiD (basespace/colorspace) data in terms of processing. It's the same workflow. Also the theoretical and practical benefits of colorspace (low false discovery rate) are rarely mentioned.

Sorry, just a slight pet peeve from a very happy SOLiD user.

Nils

Hi Nils,
No offence meant! It's all in good fun.
by problems I think I meant it more as caveats that you should watch for.

firstly it seems terribly important to understand dual base encoding but actually you just need an overview.

2ndly you are stuck with color space aware progs unless you wanna throw the benefits of colorspace away by direct conversion to base space and risk 3' ends being wrongly converted.
for de novo assembly with velvet you have to double encode your file into a format that looks exactly like 25 bp basespace fasta files. which can be misleading if someone else comes across the file and doesn't read the documentation you left there.

and it doesn't help that ABI's documentation for their software rarely exceeds 3 pages in pdf.

Other than that I am nearly a happy SOLiD user as you

**Jon_Keats** · 04-07-2010, 08:09 PM

Getting setup and compiling the applications

As promised here is my next installment on getting a working environment going or at least my poor excuse of one. The first step I setup was a series of folders to manage the data off my terabyte drives and move it around as each step is completed. To make my examples more clear I've setup the following folders and subfolders:

Main working directory called "ngs" in my $HOME directory (Users/MeOrYou/) from which all steps and scripts will be called.
With primary subfolders: /ngs/analysisnotes
/ngs/applications
/ngs/bwase
/ngs/bwape
/ngs/finaloutputs
/ngs/scripts
With a number of secondary subfolders in each primary directory (See create_ngs_directorystructure_v3.sh script for full details)

I'm slowly building pipeline scripts to feed data from the input folders to final outputs that I'll try and post when complete.

The basic idea is that I have some raw reads from our Illumina GAIIs (exon capture and RNAseq PE data with each sample on two flowcell lanes) and I want to process them with BWA and view the alignments in the IGV browser. So I need to do a couple of things to the best of my understanding. Step one is to convert the Illumina raw data files, should look like "s_1_sequences.txt", to sanger fastq format files. To understand the differences please see the following references (http://en.wikipedia.org/wiki/FASTQ_format) or (http://maq.sourceforge.net/fastq.shtml) or (Cock, PJA et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nuc. Acids Res. 2010 38(6):1767-1771). The nice thing with the conversion is that the sanger format does not list the read info for both the read sequence and read quality values so the files are significantly smaller.

Step 1 - Download the following source files and patches

UPDATE - This step is no longer necessary as recent versions of BWA allow you to do this on the fly during alignment using the -I option. I'll leave this here in case people need a way to process illumina 1.3-1.7 fastq into sanger format. Watch the version of casava used by your core in the coming year as the new illumina 1.8 pipeline will output files in sanger format so conversion will no longer be needed.

A) Maq (http://sourceforge.net/projects/maq/) *TO REALLY CONFUSE YOU THE MAIN DOWNLOAD IS ACTUALLY BWA, argh...)

NOTE: I only downloaded this to use the ill2sanger command to convert Illumina 1.3+ fastq files (ie. s_1_sequence.txt) to Sanger fastq format. Other options exist BioPerl, BioPython but I couldn't figure them out

- click on "View all files"
- click on "maq" folder
- click on newest version "0.7.1" (ASSUMPTION: This should be a long standing version since it appears that the development of Maq is dead with Heng Li releasing BWA)
- click on the download file "maq-0.7.1.tar.bz2"

Now download the ill2sanger patch that is needed to convert illumina 1.3+ fastq files to Sanger fastq

- click on "Develop" tab
- click on "Tracker" tab and select the "Patches" dropdown menu

NOTE: There are two versions to download the historic one by "daweonline" and a new alternative by "joelmartin" I used the original patch as I could find install instructions on Seqanswers

- click on ID 2841164 "illumina to sanger conversion"
- click download to download "maq-ill2sanger.patch" (current version submitted 2009-08-20)

Now Compile, patch, and re-compile the application

- move "maq-0.7.1.tar.bz2" to your "NGS/ApplicationDownloads" folder
- move "maq-ill2sanger.patch" to your "NGS" folder
- double click "maq-0.7.1.tar.bz2" to decompress the file
- open up "Terminal"
- navigate to the decompressed folder cd Documents/NGS/ApplicationDownloads/maq-0.7.1
- compile as per option 2 in Maq Manual (Release 0.5.0) (http://maq.sourceforge.net/maq-man.shtml)
> enter the following command "make -f Makefile.generic"
- a bunch of "Stuff" will come up, check to ensure no errors are listed!
- apply the patch as follows:
> step back one directory "cd .."
> run "ls" command to ensure the current directory contains folder "maq-0.7.1" and the patch "maq-ill2sanger.patch"
> install the patch with the following command "cd maq-0.7.1; patch -p1 < ../maq-ill2sanger.patch" found on seqanswers (http://seqanswers.com/forums/showthread.php?t=2499)
* You should get the following messages : patching file fastq2bfq.c
patching file main.c
patching file main.h
- recompile the maq application
> "make -f Makefile.generic"

Now check to see if the conversion patch was successful
> enter "./maq"

***This should bring up a window with the maq command options, check that ill2sanger is available under the "Format Converting" section***

B) BWA (http://sourceforge.net/projects/bio-bwa/files/)

- click on the download link for the newest version "bwa-0.5.9"
- download the file "bwa-0.5.9.tar.bz2"
- move "bwa-0.5.9.tar.bz2" to your "NGS/ApplicationDownloads" folder
- double click "bwa-0.5.9.tar.bz2" to decompress the file
- open up "Terminal"
- navigate to the decompressed folder "cd Documents/NGS/ApplicationDownloads/bwa-0.5.9"
- compile with make command
> enter the following command "make" (Really its that simple, this command line stuffs not that scary)
- a bunch of "Stuff" will come up, check to ensure no errors are listed!

Now check to see if the install was successful
> enter "./bwa"

***This should bring up a window with the bwa command options***

C) SAMtools (http://sourceforge.net/projects/samtools/)

- click on the download link for the newest version "samtools-0.1.12a"
- download the file "samtools-0.1.12a.tar.bz2"
- move "samtools-0.1.12a.tar.bz2" to your "NGS/ApplicationDownloads" folder
- double click "samtools-0.1.12a.tar.bz2" to decompress the file
- open up "Terminal"
- navigate to the decompressed folder "cd Documents/NGS/ApplicationDownloads/samtools-0.1.12a"
- compile with make command
> enter the following command "make" (Really its that simple)
- a bunch of "Stuff" will come up, check to ensure no errors are listed!

Now check to see if the install was successful
> enter "./samtools"

***This should bring up a window with the samtools command options***

The next step is to make each application executable as per the previous post options and we are just about ready to go.

Since we will use BWA we need to download the reference genomes to align against. The simplest place to get the data seems to be ensembl (http://www.ensembl.org/info/data/ftp/index.html) but we run into a problem with the full human genome file (Homo_sapiens.GRCh37.57.dna.toplevel.fa.gz) as it exceeds the maximum character length allowed by the bwa index command. To get around this problem if you want to use a GRCh37/hg19 genome version the best option seems to be the 1000 genomes version (ftp://ftp.sanger.ac.uk/pub/1000genom...ect_reference/) file (human_g1k_v37.fasta.gz). Copy the human_g1k_v37.fasta.gz file to the NGS/RefGenomes folder and then decompress it by double clicking on the file.

Test Question: Do you know the difference between UCSC mapping versus NCBI/Ensembl....Hint: Its a difference of 0 and 1 but it can really ruin your day when the commercial software manufacture doesn't know the difference!!

Next step, making the computer do some of the work

See next post,

Jonathan

**Jon_Keats** · 04-07-2010, 08:37 PM

The following script will create all the directories noted in the previous post if you want to replicate the pipeline I'm putting together...

NOTE - THIS SCRIPT HAS BEEN UPDATED TO VERSION 3 IN A LATER POST

Code:

#!/bin/sh

# Create_NGS_DirectoryStructureV1.sh
# Created by Jonathan Keats on 4/5/10.
# This file will create the directory structure needed for the subsequent pipeline
# To get this script working do one of the following:
# Option 1 - Open Terminal
#		Navigate to directory of interest, "Documents" in my case (cd Documents/)
#		Type "nano Create_NGS_DirectoryStructureV1.sh"
#                        (This open the unix nano text editor)
#		Paste from "#!/bin/sh" to "echo Pipeline Directory Structure Created"
#		Control-O to save, Control-X to exit
#		Make executable, Type "chmod 755 Create_NGS_DirectoryStructureV1.sh"
# Option 2 -  Open Xcode
#		Click "File" and select "New File"
#		In "Choose a template for your new file" select "Shell Script"
#		In new file dialogue enter File Name:"Create_NGS_DirectoryStructureV1.sh"
#		Change location to directory of interest, "Documents" in my case
#		Click "Finish"
#		Paste from "#!/bin/sh" to "echo Pipeline Directory Structure Created"
#		Click "File" and select "Save"
#		Close the file, which should already be executable
# Regardless of Option used - type "./Create_NGS_DirectoryStructureV1.sh" to launch

echo ***Creating Pipeline Directory Structure***
pwd
ls
mkdir NGS
cd NGS/
mkdir AnalysisNotes
mkdir ApplicationDownloads
mkdir BAMfiles
mkdir FinalOutputs
mkdir InputSequence
mkdir RefGenomes
mkdir SAMfiles
mkdir Scripts
cd BAMfiles/
mkdir Merged
mkdir Original
mkdir Sorted
cd ../FinalOutputs/
mkdir AlignmentResults
mkdir Illumina
mkdir SangerFastq
mkdir SortedBAMfiles
mkdir MergedBAMfiles
cd Illumina/
mkdir Read1
mkdir Read2
cd ../../InputSequence/
mkdir Illumina
mkdir SangerFastq
cd Illumina/
mkdir Read1
mkdir Read2
cd ../../RefGenomes
mkdir BFAST_Indexed
mkdir BOWTIE_Indexed
mkdir BWA_Indexed
mkdir GenomeDownloads
cd ../Scripts
mkdir ScriptBackups
cd ../..
cp Create_NGS_DirectoryStructureV1.sh NGS/Scripts/ScriptBackups/
cd NGS/
pwd
ls
echo Pipeline Directory Structure Created

**Michael.James.Clark** · 04-09-2010, 09:30 AM

I went from being a pipette jockey who did qPCR for a living to writing an algorithm for SV detection in SOLiD data and publishing a whole genome sequence in a major journal.

YOU CAN DO IT TOO!

And here's how:

Attached Files

Giant-Coffee-Cup.jpg (27.8 KB, 316 views)

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Hello - I use to think I was good with a computer

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News