Genome assembly -hiseq illumina data

koadman replied

12-27-2011, 08:58 AM
Great to hear that A5 assembled your genomes. Yes, sounds like you've got a coculture on your hands. I would be surprised to if this is a Hiseq vs. GA2 issue, unless your samples were multiplexed with other organisms and there were barcode read failures or the person who prepped the library and loaded the lane accidentally mixed libraries. Probably unlikely, and not really a Hiseq issue but a possible sequencing issue to rule out.

Mapping contigs back to a high quality reference is a good approach for separating the "contaminant" from the desired genome. You might still miss some small scaffolds that belong in your isolate. The best way to minimize that problem is to have the best possible scaffolds!

Might be interesting to find out what else is living in your sample! I've had pretty good luck in the past picking out the dominant organisms in a mixed sample using EMIRGE, a 16s-based short read taxonomy analysis pipeline. It might tell you whether the contaminants are things you expect to be living in the environment with your desired organism or things that came from somewhere else in the lab. You might also try metAMOS on the raw reads or even a simple BLAST of the A5 scaffolds against nr and subsequent MEGAN.
Leave a comment:
Bgansw replied

12-27-2011, 02:17 AM
Tried A5 pipeline on Hiseq raw data:

Hi Koadman,

My apologies for the delay in getting back to you. Had to go overseas!

I tried the new A5 pipeline, and although it worked, I'm having the same problem, that I have with the other assembly methods. The genome size at the end of A5 is 11.3mb when it should be 4mb. I'm attaching the log file that qsub (cluster) generated.

Our samples are very hard to culture axenically.. so I guess thats what the probem is. Though I didnt have this problem for the last set of samples of the same species (used the GA2 for genome sequencing, and not the Hiseq).Do you have any suggestions on what approach I can take now? I was thinking of mapping the fasta output of A5 onto my reference genome using Bowtie or Soapaligner.
Attached Files

8RS_A5pipeline_26dec.o21938.txt.zip (5.4 KB, 31 views)
Leave a comment:
Rockx replied

12-07-2011, 10:43 PM
For a denovo assembly, I guess you could use the GC filter to assemble a draft genome. You could then use this draft as a reference for future assemblies using contigs filtered using other methods, as you mentioned jvanleuven.

I'm new at all this too, so it's a good learning experience to try these different methods.
Leave a comment:
jvanleuven replied

12-07-2011, 09:59 PM
I will remove the GC bin in my future assemblies, thank you for the input. While I do not have to worry about accidental cocultures (none of the bacteria I work with are culturable), I do sometimes have a difficult time separating genomes by coverage because the variance in the contigs belonging to one genome can overlap the with the contigs belonging to another.
Leave a comment:
koadman replied

12-07-2011, 09:35 PM
well, the largest currently known bacterial genomes are around 10Mbp, and most seem to be under 5Mbp so there's a good chance that 30Mbp is way too large. One simple method to eliminate material not part of your isolate is a coverage-based filter on contigs. For contigs that are sufficiently large, the variance in coverage due to library PCR bias and bridge PCR bias can be averaged out, and sequences that are part of the isolate will show much higher average coverage than contaminant or noisy contigs. This is the approach employed by IDBA and many other tools (the a5 pipeline includes this). I have seen this approach fail to produce the desired isolate genome in the past, and every time it turned out the lab had accidentally cocultured one or more other organisms with the target isolate.

I would be extremely wary of separating on the basis of GC (or higher order k-mer) content since as the previous poster pointed out, lateral transfer can introduce regions with unusual GC and you don't want to lose these!
Leave a comment:
jvanleuven replied

12-07-2011, 09:34 PM
I do inclusive searches. First take all contigs with similar coverage, then add all contigs with close GC, then add contigs that hit by blast, ect.

Although I am also new at this and still learning, my goal is to not exclude contigs. However, I work with samples that have multiple bacterial genomes and a large amount of non-bacterial sequences.

Does this sound reasonable? As I mentioned, I am still learning.
Leave a comment:
Rockx replied

12-07-2011, 09:02 PM
How would you deal with gene transfer in these instances? Horizontally acquired genes would have a different GC content, which would be eliminated using GC and blast binning.
Leave a comment:
jvanleuven replied

12-07-2011, 07:25 AM
How do you calculate the 30MB? In every assembly you could have excess contigs that you don't want to use. If the genome sequence is similar to an already published genome, you can use nucmer or promer to pick out big contigs that map to your genome. If they are much different, then I use a combination of binning my GC content, coverage, and tblastx hits.

jt
Leave a comment:
koadman replied

12-06-2011, 09:29 PM
Indeed you are correct! I must have been distracted or written that too hastily. It's not BWA that has that requirement, but rather SGA which shares a little bit of algorithm with BWA. My deepest apologies for the confusion. SGA's preprocess step, at least the version of the code embedded in the a5 pipeline, when run in paired-end mode, insists on having identifiers of the form /1 /2 if any are used. And lo, I just discovered it will happily run without any identifiers. So the little code fragment which mangled your file by trying to add an identifier does appear to be unnecessary. Deleting code is a win in my book, thank you.

I've posted a revised a5_pipeline.pl to the source code repository. You should be able to drop this in place of the existing a5_pipeline.pl and hopefully there won't be any more hiccups on your data.

Error 404 (Not Found)!!1

http://ngopt.googlecode.com/svn/trunk/bin/a5_pipeline.pl
Leave a comment:
swbarnes2 replied

12-06-2011, 03:41 PM
Originally posted by koadman View Post

Thanks for sending the nohup.out, really helpful!
Reading through it I can see that your input files seem to be lacking the canonical /1 and /2 paired read identifiers in their names. a5 tries to add them since bwa insists on their presence when generating a paired-end sam file.

Umm, that's not true at all. I run bwa all the time, and my reads don't have read1 or read2 identifiers on them at all. In fact, some programs, like Picard's MarkDupliactes won't realize that the two reads are paired together unless their names are identical.

bwa just assumes that the first read of fastq1 is paried with the first read of fastq2.
Leave a comment:
koadman replied

12-06-2011, 02:43 PM
Thanks for sending the nohup.out, really helpful!
Reading through it I can see that your input files seem to be lacking the canonical /1 and /2 paired read identifiers in their names. a5 tries to add them since bwa insists on their presence when generating a paired-end sam file. a bit further down we see:

Warning, FASTQ quality string is not the same length as the sequence string for read @HWI-ST705:113:C026NACXX:5:1208:5169:188022/2

My guess is that a5's perl code to add the /1 and /2 identifiers is broken, and added a /2 to that read's quality line. The problem will probably go away if you are able to add the /1 and /2 identifiers to the reads yourself and rerun. In the meantime will go fix this part of a5.
Leave a comment:
Bgansw replied

12-05-2011, 07:52 PM
Hi Koadman,

Am getting this error message, with the pipeline. I suspect that its because the new Hiseq data has been processed by casava 1.8, and generates reads in sanger format (also known as illumina 1.9 format). Most of our scripts were designed for reads in illumina 1.3 format (cause we got the 3 genomes sequenced using the GA2 and not the hiseq).

I had to convert the files to illumina 1.3 to get the pipeline to work on them. Have you used the A5_pipeline on Hiseq data before? If yes, do you know what format your hiseq data is in?

This is the error I'm getting. I've only pasted the last bit of the file. I've attached the file, "nohup.out" as well, so you can see what ran and where it failed. I would really really appreciate your help.

nohup.out
[bgansw@bioinfo ngopt_a5pipeline_linux-x64]$ head nohup.out
[a5] Found the following libraries:
raw1:
id=raw1
p1=8RS_CAGATC_L005_R1_illuminaformat_sed.fastq
p2=8RS_CAGATC_L005_R2_illumina_sed.fastq
[a5] Cleaning reads with SGA
148311540
[a5] Found 1 libraries
[a5] Starting pipeline at step 1
[a5] Cleaning reads with SGA
[bgansw@bioinfo ngopt_a5pipeline_linux-x64]$ tail nohup.out
at java.security.SecureClassLoader.defineClass(libgcj.so.7rh)
at java.net.URLClassLoader.findClass(libgcj.so.7rh)
at java.lang.ClassLoader.loadClass(libgcj.so.7rh)
at java.lang.ClassLoader.loadClass(libgcj.so.7rh)
at org.halophiles.assembly.InsertSizeExporter.main(InsertSizeExporter.java:36)
Use of uninitialized value in numeric gt (>) at ./bin/a5_pipeline.pl line 1205, <IN> line 55454.
Use of uninitialized value in concatenation (.) or string at ./bin/a5_pipeline.pl line 1211, <IN> line 55454.
[a5] Discarding estimate. Not enough data points:
Use of uninitialized value in multiplication (*) at ./bin/a5_pipeline.pl line 1212, <IN> line 55454.
[a5_preproc] Insert estimate for raw1 not reliable.[a5_preproc] No insert estimate for raw1. Please provide one in a library file. Exiting
[bgansw@bioinfo ngopt_a5pipeline_linux-x64]$
Attached Files

nohup.out.txt (13.1 KB, 34 views)
Leave a comment:
Bgansw replied

12-04-2011, 09:07 PM
Hi Koadman,

The A5 pipeline is running on our cluster right now. So far all good!

I'll keep you updated with the results. Hopefully, it'll all run smoothly..

Thanx a ton.
Leave a comment:
koadman replied

11-30-2011, 04:59 PM
blinking and highlighted in red? I don't know where to begin.

Are you using a mac? linux? is it blinking in your terminal window when you run `ls`? Or in some GUI file manager?

To start, it's not necessary to move any of the files in the ngopt pipeline, it can and should be run in place. If you move everything in a5's bin to a bin directory somewhere else (but still in your $PATH) it probably will run but will not be able to find the adapters fasta file used for adapter trimming. We should probably change this to look for adapters in $prefix/share/a5 to be more unixy.

To be specific, you should only have to do the following to get an assembly (instructions for linux):

Code:

wget http://ngopt.googlecode.com/files/ngopt_a5pipeline_linux-x64_20111128.tar.gz tar xvzf ngopt_a5pipeline_linux-x64_20111128.tar.gz ngopt_a5pipeline_linux-x64/bin/a5_pipeline.pl <read 1 fastq> <read 2 fastq> <my_assembly>

where <read 1 fastq> and <read 2 fastq> are the filesystem paths to your paired read fastq files (one file per read in a pair, do not include the <>! it is just a convention to denote a required argument) and my_assembly is the base name or path you want to give your assembly files.

Instructions for mac (which doesn't ship with wget's holy awesomeness):

Code:

curl http://ngopt.googlecode.com/files/ngopt_a5pipeline_macOS-x64_20111128.tar.gz > ngopt_a5pipeline_macOS-x64_20111128.tar.gz tar xvzf ngopt_a5pipeline_macOS-x64_20111128.tar.gz ngopt_a5pipeline_macOS-x64/bin/a5_pipeline.pl <read 1 fastq> <read 2 fastq> my_assembly
Leave a comment:
Bgansw replied

11-30-2011, 04:22 PM
Hi Koadman,

I've installed the A5 pipeline in my bin folder n have tried to install it.

The a5_pipeline.pl is blinking, and its highlighted in red n not working.

How do I rectify this?
Leave a comment:

Previous 1 2 template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News