Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • koadman
    replied
    Great to hear that A5 assembled your genomes. Yes, sounds like you've got a coculture on your hands. I would be surprised to if this is a Hiseq vs. GA2 issue, unless your samples were multiplexed with other organisms and there were barcode read failures or the person who prepped the library and loaded the lane accidentally mixed libraries. Probably unlikely, and not really a Hiseq issue but a possible sequencing issue to rule out.

    Mapping contigs back to a high quality reference is a good approach for separating the "contaminant" from the desired genome. You might still miss some small scaffolds that belong in your isolate. The best way to minimize that problem is to have the best possible scaffolds!

    Might be interesting to find out what else is living in your sample! I've had pretty good luck in the past picking out the dominant organisms in a mixed sample using EMIRGE, a 16s-based short read taxonomy analysis pipeline. It might tell you whether the contaminants are things you expect to be living in the environment with your desired organism or things that came from somewhere else in the lab. You might also try metAMOS on the raw reads or even a simple BLAST of the A5 scaffolds against nr and subsequent MEGAN.

    Leave a comment:


  • Bgansw
    replied
    Tried A5 pipeline on Hiseq raw data:

    Hi Koadman,

    My apologies for the delay in getting back to you. Had to go overseas!

    I tried the new A5 pipeline, and although it worked, I'm having the same problem, that I have with the other assembly methods. The genome size at the end of A5 is 11.3mb when it should be 4mb. I'm attaching the log file that qsub (cluster) generated.

    Our samples are very hard to culture axenically.. so I guess thats what the probem is. Though I didnt have this problem for the last set of samples of the same species (used the GA2 for genome sequencing, and not the Hiseq).Do you have any suggestions on what approach I can take now? I was thinking of mapping the fasta output of A5 onto my reference genome using Bowtie or Soapaligner.
    Attached Files

    Leave a comment:


  • Rockx
    replied
    For a denovo assembly, I guess you could use the GC filter to assemble a draft genome. You could then use this draft as a reference for future assemblies using contigs filtered using other methods, as you mentioned jvanleuven.

    I'm new at all this too, so it's a good learning experience to try these different methods.

    Leave a comment:


  • jvanleuven
    replied
    I will remove the GC bin in my future assemblies, thank you for the input. While I do not have to worry about accidental cocultures (none of the bacteria I work with are culturable), I do sometimes have a difficult time separating genomes by coverage because the variance in the contigs belonging to one genome can overlap the with the contigs belonging to another.

    Leave a comment:


  • koadman
    replied
    well, the largest currently known bacterial genomes are around 10Mbp, and most seem to be under 5Mbp so there's a good chance that 30Mbp is way too large. One simple method to eliminate material not part of your isolate is a coverage-based filter on contigs. For contigs that are sufficiently large, the variance in coverage due to library PCR bias and bridge PCR bias can be averaged out, and sequences that are part of the isolate will show much higher average coverage than contaminant or noisy contigs. This is the approach employed by IDBA and many other tools (the a5 pipeline includes this). I have seen this approach fail to produce the desired isolate genome in the past, and every time it turned out the lab had accidentally cocultured one or more other organisms with the target isolate.

    I would be extremely wary of separating on the basis of GC (or higher order k-mer) content since as the previous poster pointed out, lateral transfer can introduce regions with unusual GC and you don't want to lose these!

    Leave a comment:


  • jvanleuven
    replied
    I do inclusive searches. First take all contigs with similar coverage, then add all contigs with close GC, then add contigs that hit by blast, ect.

    Although I am also new at this and still learning, my goal is to not exclude contigs. However, I work with samples that have multiple bacterial genomes and a large amount of non-bacterial sequences.

    Does this sound reasonable? As I mentioned, I am still learning.

    Leave a comment:


  • Rockx
    replied
    How would you deal with gene transfer in these instances? Horizontally acquired genes would have a different GC content, which would be eliminated using GC and blast binning.

    Leave a comment:


  • jvanleuven
    replied
    How do you calculate the 30MB? In every assembly you could have excess contigs that you don't want to use. If the genome sequence is similar to an already published genome, you can use nucmer or promer to pick out big contigs that map to your genome. If they are much different, then I use a combination of binning my GC content, coverage, and tblastx hits.

    jt

    Leave a comment:


  • koadman
    replied
    Indeed you are correct! I must have been distracted or written that too hastily. It's not BWA that has that requirement, but rather SGA which shares a little bit of algorithm with BWA. My deepest apologies for the confusion. SGA's preprocess step, at least the version of the code embedded in the a5 pipeline, when run in paired-end mode, insists on having identifiers of the form /1 /2 if any are used. And lo, I just discovered it will happily run without any identifiers. So the little code fragment which mangled your file by trying to add an identifier does appear to be unnecessary. Deleting code is a win in my book, thank you.

    I've posted a revised a5_pipeline.pl to the source code repository. You should be able to drop this in place of the existing a5_pipeline.pl and hopefully there won't be any more hiccups on your data.

    Leave a comment:


  • swbarnes2
    replied
    Originally posted by koadman View Post
    Thanks for sending the nohup.out, really helpful!
    Reading through it I can see that your input files seem to be lacking the canonical /1 and /2 paired read identifiers in their names. a5 tries to add them since bwa insists on their presence when generating a paired-end sam file.
    Umm, that's not true at all. I run bwa all the time, and my reads don't have read1 or read2 identifiers on them at all. In fact, some programs, like Picard's MarkDupliactes won't realize that the two reads are paired together unless their names are identical.

    bwa just assumes that the first read of fastq1 is paried with the first read of fastq2.

    Leave a comment:


  • koadman
    replied
    Thanks for sending the nohup.out, really helpful!
    Reading through it I can see that your input files seem to be lacking the canonical /1 and /2 paired read identifiers in their names. a5 tries to add them since bwa insists on their presence when generating a paired-end sam file. a bit further down we see:

    Warning, FASTQ quality string is not the same length as the sequence string for read @HWI-ST705:113:C026NACXX:5:1208:5169:188022/2

    My guess is that a5's perl code to add the /1 and /2 identifiers is broken, and added a /2 to that read's quality line. The problem will probably go away if you are able to add the /1 and /2 identifiers to the reads yourself and rerun. In the meantime will go fix this part of a5.

    Leave a comment:


  • Bgansw
    replied
    Hi Koadman,

    Am getting this error message, with the pipeline. I suspect that its because the new Hiseq data has been processed by casava 1.8, and generates reads in sanger format (also known as illumina 1.9 format). Most of our scripts were designed for reads in illumina 1.3 format (cause we got the 3 genomes sequenced using the GA2 and not the hiseq).

    I had to convert the files to illumina 1.3 to get the pipeline to work on them. Have you used the A5_pipeline on Hiseq data before? If yes, do you know what format your hiseq data is in?

    This is the error I'm getting. I've only pasted the last bit of the file. I've attached the file, "nohup.out" as well, so you can see what ran and where it failed. I would really really appreciate your help.


    nohup.out
    [bgansw@bioinfo ngopt_a5pipeline_linux-x64]$ head nohup.out
    [a5] Found the following libraries:
    raw1:
    id=raw1
    p1=8RS_CAGATC_L005_R1_illuminaformat_sed.fastq
    p2=8RS_CAGATC_L005_R2_illumina_sed.fastq
    [a5] Cleaning reads with SGA
    148311540
    [a5] Found 1 libraries
    [a5] Starting pipeline at step 1
    [a5] Cleaning reads with SGA
    [bgansw@bioinfo ngopt_a5pipeline_linux-x64]$ tail nohup.out
    at java.security.SecureClassLoader.defineClass(libgcj.so.7rh)
    at java.net.URLClassLoader.findClass(libgcj.so.7rh)
    at java.lang.ClassLoader.loadClass(libgcj.so.7rh)
    at java.lang.ClassLoader.loadClass(libgcj.so.7rh)
    at org.halophiles.assembly.InsertSizeExporter.main(InsertSizeExporter.java:36)
    Use of uninitialized value in numeric gt (>) at ./bin/a5_pipeline.pl line 1205, <IN> line 55454.
    Use of uninitialized value in concatenation (.) or string at ./bin/a5_pipeline.pl line 1211, <IN> line 55454.
    [a5] Discarding estimate. Not enough data points:
    Use of uninitialized value in multiplication (*) at ./bin/a5_pipeline.pl line 1212, <IN> line 55454.
    [a5_preproc] Insert estimate for raw1 not reliable.[a5_preproc] No insert estimate for raw1. Please provide one in a library file. Exiting
    [bgansw@bioinfo ngopt_a5pipeline_linux-x64]$
    Attached Files

    Leave a comment:


  • Bgansw
    replied
    Hi Koadman,

    The A5 pipeline is running on our cluster right now. So far all good!

    I'll keep you updated with the results. Hopefully, it'll all run smoothly..

    Thanx a ton.

    Leave a comment:


  • koadman
    replied
    blinking and highlighted in red? I don't know where to begin.

    Are you using a mac? linux? is it blinking in your terminal window when you run `ls`? Or in some GUI file manager?

    To start, it's not necessary to move any of the files in the ngopt pipeline, it can and should be run in place. If you move everything in a5's bin to a bin directory somewhere else (but still in your $PATH) it probably will run but will not be able to find the adapters fasta file used for adapter trimming. We should probably change this to look for adapters in $prefix/share/a5 to be more unixy.

    To be specific, you should only have to do the following to get an assembly (instructions for linux):

    Code:
    wget http://ngopt.googlecode.com/files/ngopt_a5pipeline_linux-x64_20111128.tar.gz
    tar xvzf ngopt_a5pipeline_linux-x64_20111128.tar.gz
    ngopt_a5pipeline_linux-x64/bin/a5_pipeline.pl <read 1 fastq> <read 2 fastq> <my_assembly>
    where <read 1 fastq> and <read 2 fastq> are the filesystem paths to your paired read fastq files (one file per read in a pair, do not include the <>! it is just a convention to denote a required argument) and my_assembly is the base name or path you want to give your assembly files.

    Instructions for mac (which doesn't ship with wget's holy awesomeness):

    Code:
    curl http://ngopt.googlecode.com/files/ngopt_a5pipeline_macOS-x64_20111128.tar.gz > ngopt_a5pipeline_macOS-x64_20111128.tar.gz
    tar xvzf ngopt_a5pipeline_macOS-x64_20111128.tar.gz
    ngopt_a5pipeline_macOS-x64/bin/a5_pipeline.pl <read 1 fastq> <read 2 fastq> my_assembly

    Leave a comment:


  • Bgansw
    replied
    Hi Koadman,

    I've installed the A5 pipeline in my bin folder n have tried to install it.

    The a5_pipeline.pl is blinking, and its highlighted in red n not working.

    How do I rectify this?

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Latest Developments in Precision Medicine
    by seqadmin



    Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

    Somatic Genomics
    “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
    05-24-2024, 01:16 PM
  • seqadmin
    Recent Advances in Sequencing Analysis Tools
    by seqadmin


    The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
    05-06-2024, 07:48 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 01:32 PM
0 responses
4 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-24-2024, 07:15 AM
0 responses
198 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-23-2024, 10:28 AM
0 responses
220 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-23-2024, 07:35 AM
0 responses
229 views
0 likes
Last Post seqadmin  
Working...
X