Newbler Illumina Paired End Reads - Interleave or not to interleave?

Zapages

Member

Join Date: Oct 2012

Posts: 98
- Share
- Tweet
#1

Newbler Illumina Paired End Reads - Interleave or not to interleave?

05-12-2015, 04:49 PM

Hi Everyone,

Thew Newbler documentation has been very bare bone from what I have been able to gather. I have been able to successfully install Newbler with GUI on our Ubuntu (Bio Linux 8) workstation.

A summary for any future researchers is to use the following scripts. A huge thanks Jeff Wintersinger and dsenalik for their posts.

Code:

# Install 32-bit version of libs needed for JRE packaged with Newbler - do this as root apt-get install libxi6:i386 libxtst6:i386 # Extract assembler archive downloaded from 454 tar xvzf DataAnalysis_2.8_All_20120731_2108.tgz cd DataAnalysis_2.8_All/packages/ # Extract RPMs - Do not do this as root. for foo in *.rpm; do rpm2cpio $foo | cpio -idmv; done cd opt/454/apps # Run assembler assembly/bin/gsAssembler #Optional, if you have trouble with importing your FASTQ, SFF, or FASTA files into the GUI of Newbler cd /opt/454/apps/assembly/config for file in ../../gsSeqTools/config/* ; do sudo ln -s $file ; done

Credit:: http://jeff.wintersinger.org/posts/2...n-ubuntu-1204/ and http://seqanswers.com/forums/showpos...1&postcount=22

I am working with 3 types of data sets:

Have two sets of Illumina Pair End reads (275 bp). On top of this, I have two sets of Ion PGM data sets (both SFF and FASTQ - longest read is about 600 bp). Finally I have a Fasta (Sanger) data set.

In the future, I hope to do a hybrid assembly with Newbler.

I was wondering do I have interleave the FASTQ File for the Illumina data sets before adding them to the Newbler GUI or do I leave them as they should be? I have been Interleaving the files for Ray and Velvet Assemblers (via command lines).

I know FASTQ Format is based on Sanger Quality (Illumina 1.8+). Also do I have to play around with the FastQ files to make it more acceptable for Newbler?

Should I play around with the settings? Should I leave default settings for the minimum overlap length (40) and minimum overlap identity (90)? Also any suggestions for the all contig threshold and longest contig threshold for bacterial and viral genomes? Does enabling low end coverage help? I was thinking of 50 for all contigs and 65K for the largest contig.

Thank you in advance.

-Zapages

Last edited by Zapages; 05-12-2015, 04:51 PM.
Tags: None
flxlex

Moderator

Join Date: Nov 2008

Posts: 414
- Share
- Tweet
#2

05-13-2015, 05:22 AM

Originally posted by Zapages View Post

I was wondering do I have interleave the FASTQ File for the Illumina data sets before adding them to the Newbler GUI or do I leave them as they should be? I have been Interleaving the files for Ray and Velvet Assemblers (via command lines).

Newbler determines pair based on the sequence ID's. I think it does not matter whether the reads are interleaved or not. But you need to check the 454PairStatus file(s) to be sure.

Originally posted by Zapages View Post

I know FASTQ Format is based on Sanger Quality (Illumina 1.8+). Also do I have to play around with the FastQ files to make it more acceptable for Newbler?

Maybe. Get them into the standard Sanger format.

Originally posted by Zapages View Post

Should I play around with the settings? Should I leave default settings for the minimum overlap length (40) and minimum overlap identity (90)? Also any suggestions for the all contig threshold and longest contig threshold for bacterial and viral genomes? Does enabling low end coverage help? I was thinking of 50 for all contigs and 65K for the largest contig.

With enough coverage (>20x) the default settings should be OK.

Note, though, that newbler does not really understand the Ion Torrent error model, although it is very close to the 454 one.

Good luck!
Comment

Previous template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 26 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Newbler Illumina Paired End Reads - Interleave or not to interleave?

Comment

Latest Articles

ad_right_rmr

News