Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PBJelly errors in setup, extraction, support stages

    I've been trying to construct a de novo assembly of a mammalian genome for some time now. Currently I have an incomplete genome constructed from Illumina data on AllpathsLG, and I would like to use PBJelly to fill in the gaps using PacBio reads.

    I ran the test data successfully, but the pipeline doesn't seem to work on my real data. I'm seeing essentially no improvement in my assembly quality after running PBJelly on my Pacbio reads. I'm getting a lot of errors in the assembly, especially at the setup and mapping stages. About twenty percent of my scaffold references are giving me this error in setup:

    Code:
    2015-03-19 09:48:26,814 [DEBUG] Scaffold scaffold_40566|ref0053720 is empty
    I'm not seeing any other errors in setup, though. In extraction, I get these kind of outputs:

    Code:
    2015-03-24 11:25:12,545 [INFO] Parsing /scratch/02985/emg2497/mouse_genome_project/pbjelly_nojoblimit/pacbioreads/Pacbio_A05_1.1.mod.fastq
    2015-03-24 11:25:18,887 [INFO] Loaded 53626 Reads
    2015-03-24 11:25:21,197 [INFO] Parsed 12357 Reads
    2015-03-24 11:25:21,197 [INFO] Parsing /scratch/02985/emg2497/mouse_genome_project/pbjelly_nojoblimit/pacbioreads/Pacbio_A05_1.2.mod.fastq
    2015-03-24 11:25:24,073 [INFO] Loaded 48605 Reads
    2015-03-24 11:25:28,346 [INFO] Parsed 11056 Reads
    And so forth for the rest of my data. Again, it appears to be throwing out another 20% of the data. Support is where I start to see even more issues, with both of these flags coming up in large numbers:

    Code:
    2015-03-20 14:02:14,425 [DEBUG] Hit for m140207_170145_42153_c100619042550000001
    823119607181456_s1_p0/2576/2848_6155 has mapq 0 - below threshold 200
    2015-03-20 14:02:14,429 [DEBUG] Hit for m140207_170145_42153_c100619042550000001
    823119607181456_s1_p0/2782/17335_18304 has mapq 0 - below threshold 200
    
    2015-03-20 14:02:30,989 [DEBUG] gapSup
    2015-03-20 14:02:30,989 [DEBUG] - Strand on m140207_170145_42153_c100619042550000001823119607181456_s1_p0/16349/3190_8591
    2015-03-20 14:02:30,989 [DEBUG] RightDist 202 remainSeq -25
    2015-03-20 14:02:30,990 [DEBUG] LeftDist -4938 remainSeq -25
    2015-03-20 14:02:30,990 [DEBUG]
    2015-03-20 14:02:30,990 [DEBUG] gapSup
    2015-03-20 14:02:30,990 [DEBUG] - Strand on m140207_170145_42153_c100619042550000001823119607181456_s1_p0/16349/3190_8591
    2015-03-20 14:02:30,990 [DEBUG] RightDist -3599 remainSeq -25
    2015-03-20 14:02:30,990 [DEBUG] LeftDist -1217 remainSeq -25
    2015-03-20 14:02:30,990 [DEBUG] span support
    2015-03-20 14:02:30,990 [DEBUG]
    I've checked the reads using metrics like Fastqc and they don't seem to be noticeably lower quality than I would expect, so I'm finding this very confusing. I'm running PBJelly with all the defaults--is there anything that might be confounding my analysis to display these results? I'd be happy to display more log data if it would be helpful.

    Does anyone have any advice? Any insight at all would be very welcome.

  • #2
    Can you share your starting assembly statistics, and PacBio coverage level?

    Comment


    • #3
      Sure! My starting assembly was done in AllpathsLG using two Illumina libraries--a fragment library with 62x coverage and a matepair library with 43x coverage. All of my PacBio libraries together come to about 1x coverage.

      The starting assembly had scaffold N50s of 96,649 bp (with gaps) and 73,555 bp (without gaps). The contig N50 is 6,131 bp.

      Any other metrics that might be useful?

      Comment


      • #4
        You cannot close gaps using 1x of data. I would recommend 5x at an absolute minimum, more like 10x and it really helps if the data is size selected.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        31 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        33 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        28 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Working...
        X