Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PBJelly errors in setup, extraction, support stages

    I've been trying to construct a de novo assembly of a mammalian genome for some time now. Currently I have an incomplete genome constructed from Illumina data on AllpathsLG, and I would like to use PBJelly to fill in the gaps using PacBio reads.

    I ran the test data successfully, but the pipeline doesn't seem to work on my real data. I'm seeing essentially no improvement in my assembly quality after running PBJelly on my Pacbio reads. I'm getting a lot of errors in the assembly, especially at the setup and mapping stages. About twenty percent of my scaffold references are giving me this error in setup:

    Code:
    2015-03-19 09:48:26,814 [DEBUG] Scaffold scaffold_40566|ref0053720 is empty
    I'm not seeing any other errors in setup, though. In extraction, I get these kind of outputs:

    Code:
    2015-03-24 11:25:12,545 [INFO] Parsing /scratch/02985/emg2497/mouse_genome_project/pbjelly_nojoblimit/pacbioreads/Pacbio_A05_1.1.mod.fastq
    2015-03-24 11:25:18,887 [INFO] Loaded 53626 Reads
    2015-03-24 11:25:21,197 [INFO] Parsed 12357 Reads
    2015-03-24 11:25:21,197 [INFO] Parsing /scratch/02985/emg2497/mouse_genome_project/pbjelly_nojoblimit/pacbioreads/Pacbio_A05_1.2.mod.fastq
    2015-03-24 11:25:24,073 [INFO] Loaded 48605 Reads
    2015-03-24 11:25:28,346 [INFO] Parsed 11056 Reads
    And so forth for the rest of my data. Again, it appears to be throwing out another 20% of the data. Support is where I start to see even more issues, with both of these flags coming up in large numbers:

    Code:
    2015-03-20 14:02:14,425 [DEBUG] Hit for m140207_170145_42153_c100619042550000001
    823119607181456_s1_p0/2576/2848_6155 has mapq 0 - below threshold 200
    2015-03-20 14:02:14,429 [DEBUG] Hit for m140207_170145_42153_c100619042550000001
    823119607181456_s1_p0/2782/17335_18304 has mapq 0 - below threshold 200
    
    2015-03-20 14:02:30,989 [DEBUG] gapSup
    2015-03-20 14:02:30,989 [DEBUG] - Strand on m140207_170145_42153_c100619042550000001823119607181456_s1_p0/16349/3190_8591
    2015-03-20 14:02:30,989 [DEBUG] RightDist 202 remainSeq -25
    2015-03-20 14:02:30,990 [DEBUG] LeftDist -4938 remainSeq -25
    2015-03-20 14:02:30,990 [DEBUG]
    2015-03-20 14:02:30,990 [DEBUG] gapSup
    2015-03-20 14:02:30,990 [DEBUG] - Strand on m140207_170145_42153_c100619042550000001823119607181456_s1_p0/16349/3190_8591
    2015-03-20 14:02:30,990 [DEBUG] RightDist -3599 remainSeq -25
    2015-03-20 14:02:30,990 [DEBUG] LeftDist -1217 remainSeq -25
    2015-03-20 14:02:30,990 [DEBUG] span support
    2015-03-20 14:02:30,990 [DEBUG]
    I've checked the reads using metrics like Fastqc and they don't seem to be noticeably lower quality than I would expect, so I'm finding this very confusing. I'm running PBJelly with all the defaults--is there anything that might be confounding my analysis to display these results? I'd be happy to display more log data if it would be helpful.

    Does anyone have any advice? Any insight at all would be very welcome.

  • #2
    Can you share your starting assembly statistics, and PacBio coverage level?

    Comment


    • #3
      Sure! My starting assembly was done in AllpathsLG using two Illumina libraries--a fragment library with 62x coverage and a matepair library with 43x coverage. All of my PacBio libraries together come to about 1x coverage.

      The starting assembly had scaffold N50s of 96,649 bp (with gaps) and 73,555 bp (without gaps). The contig N50 is 6,131 bp.

      Any other metrics that might be useful?

      Comment


      • #4
        You cannot close gaps using 1x of data. I would recommend 5x at an absolute minimum, more like 10x and it really helps if the data is size selected.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin


          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
          Yesterday, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        45 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        46 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        39 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Working...
        X