Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Newbler Illumina Paired End Reads - Interleave or not to interleave?

    Hi Everyone,

    Thew Newbler documentation has been very bare bone from what I have been able to gather. I have been able to successfully install Newbler with GUI on our Ubuntu (Bio Linux 8) workstation.

    A summary for any future researchers is to use the following scripts. A huge thanks Jeff Wintersinger and dsenalik for their posts.

    Code:
    # Install 32-bit version of libs needed for JRE packaged with Newbler - do this as root
    apt-get install libxi6:i386 libxtst6:i386
    
    # Extract assembler archive downloaded from 454
    tar xvzf DataAnalysis_2.8_All_20120731_2108.tgz
    cd DataAnalysis_2.8_All/packages/
    
    # Extract RPMs - Do not do this as root.
    for foo in *.rpm; do rpm2cpio $foo | cpio -idmv; done
    cd opt/454/apps
    
    # Run assembler
    assembly/bin/gsAssembler
    
    #Optional, if you have trouble with importing your FASTQ, SFF, or FASTA files into the GUI of Newbler
    cd /opt/454/apps/assembly/config
    for file in ../../gsSeqTools/config/* ; do sudo ln -s $file ; done
    Credit:: http://jeff.wintersinger.org/posts/2...n-ubuntu-1204/ and http://seqanswers.com/forums/showpos...1&postcount=22

    I am working with 3 types of data sets:

    Have two sets of Illumina Pair End reads (275 bp). On top of this, I have two sets of Ion PGM data sets (both SFF and FASTQ - longest read is about 600 bp). Finally I have a Fasta (Sanger) data set.

    In the future, I hope to do a hybrid assembly with Newbler.

    I was wondering do I have interleave the FASTQ File for the Illumina data sets before adding them to the Newbler GUI or do I leave them as they should be? I have been Interleaving the files for Ray and Velvet Assemblers (via command lines).

    I know FASTQ Format is based on Sanger Quality (Illumina 1.8+). Also do I have to play around with the FastQ files to make it more acceptable for Newbler?

    Should I play around with the settings? Should I leave default settings for the minimum overlap length (40) and minimum overlap identity (90)? Also any suggestions for the all contig threshold and longest contig threshold for bacterial and viral genomes? Does enabling low end coverage help? I was thinking of 50 for all contigs and 65K for the largest contig.

    Thank you in advance.

    -Zapages
    Last edited by Zapages; 05-12-2015, 04:51 PM.

  • #2
    Originally posted by Zapages View Post
    I was wondering do I have interleave the FASTQ File for the Illumina data sets before adding them to the Newbler GUI or do I leave them as they should be? I have been Interleaving the files for Ray and Velvet Assemblers (via command lines).
    Newbler determines pair based on the sequence ID's. I think it does not matter whether the reads are interleaved or not. But you need to check the 454PairStatus file(s) to be sure.

    Originally posted by Zapages View Post
    I know FASTQ Format is based on Sanger Quality (Illumina 1.8+). Also do I have to play around with the FastQ files to make it more acceptable for Newbler?
    Maybe. Get them into the standard Sanger format.

    Originally posted by Zapages View Post
    Should I play around with the settings? Should I leave default settings for the minimum overlap length (40) and minimum overlap identity (90)? Also any suggestions for the all contig threshold and longest contig threshold for bacterial and viral genomes? Does enabling low end coverage help? I was thinking of 50 for all contigs and 65K for the largest contig.
    With enough coverage (>20x) the default settings should be OK.

    Note, though, that newbler does not really understand the Ion Torrent error model, although it is very close to the 454 one.

    Good luck!

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    26 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    29 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    25 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    52 views
    0 likes
    Last Post seqadmin  
    Working...
    X