Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Alex Lee
    Member
    • Apr 2014
    • 10

    Estimating Paired insert size with Picard

    HI all, I have paired reads, for & rev of 150 bp.
    I used the picard tools to estimate insert size.
    The output file has median size for both for & reverse with different sizes,

    MEDIAN_INSERT_SIZE
    185 ... FR
    133 ... RF

    my understanding is that Mean Inner Distance between Mate Pairs is: mode - 2*read length.

    In this case however which number should I use? the FR or RF since they give very different results.
    - I used tophat2 to generate the bam file first then I used picard with the following commands,

    java -Xmx2g -jar pathtopicard\CollectInsertSizeMetrics.jar INPUT=accepted_hits.bam OUTPUT=size.txt

    Thanks.
    Last edited by Alex Lee; 08-30-2014, 08:30 PM.
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    Unfortunately, both of those are wrong, since it is providing one median for reads with insert size over 150bp and one for reads under 150bp. It looks like your insert size may be pretty close to 150bp, but it's hard to say for sure.

    You can get the correct insert size, and insert size histogram, with BBMerge (which uses overlap, and thus does not require a reference) or BBMap (which uses a reference). It only takes a few hundred thousand to get a good estimate; for example (in Linux/bash):

    bbmerge.sh in1=r1.fq in2=r2.fq ihist=ihist_merge.txt reads=400000

    or
    bbmap.sh -Xmx24g in1=r1.fq in2=r2.fq ihist=ihist_map.txt reads=400000 ref=ref.fa

    The command would be different on a different OS like Windows, though, so let me know if you encounter any trouble. As for the equation "median_insert_size - 2*read length", that's for calculating the unsequenced fraction in the middle, which is not the insert size. The insert size includes both reads and the unsequenced part, if any.

    Comment

    • Alex Lee
      Member
      • Apr 2014
      • 10

      #3
      wow thanks Brian - bbmerge is more than that. Sorry for my mistake what I meant was to calculate "Mean Inner Distance between Mate Pairs". A benefit right of is that I see that its written in JAVA so possibly running on windows. I tried this one Windows 7 but got an error so had to do this on linux. Result was mode: 127 going to realign with tophat with this setting. Oh and you were right about being close to 150 - not sure how you figure that out but awesome all around. thanks again.
      Last edited by Alex Lee; 08-30-2014, 08:35 PM.

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        Alex,

        Since most of your inserts appear to be shorter than read length, you will have substantial adapter contamination. I recommend removing them before mapping (with for example BBDuk) which will greatly increase the mapping rate and accuracy. The BBTools package includes the TruSeq adapters (bbmap/resources/truseq.fa) but it's possible some other kind of adapters were use, so I suggest you find out first.

        Also, it's possible that the mode at 127bp was an artifact peak; you may want to use the median (reported as 50th percentile) or average instead of the mode. This will be obvious if you graph the data as a scatterplot in Excel - either the peak at 127bp will be super-sharp, or the middle of a broad peak. If it is super-sharp, you should find out what the 127bp reads are and remove them. The difference in Tophat results from a 20bp difference in estimated insert size will probably be very small, though.

        If you want to run these tools in Windows, you can replace "bbmerge.sh" with "java -ea -Xmx2g -cp path_to_bbmap/current/ jgi.BBMerge" or "bbduk.sh" with "java -ea -Xmx2g path_to_bbmap/current/ jgi.BBDukF".

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM
        • SEQadmin2
          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
          by SEQadmin2


          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


          Introduction

          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
          05-22-2026, 06:42 AM
        • SEQadmin2
          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
          by SEQadmin2

          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
          05-06-2026, 09:04 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, Today, 08:59 AM
        0 responses
        9 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-02-2026, 12:03 PM
        0 responses
        21 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-02-2026, 11:40 AM
        0 responses
        17 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 05-28-2026, 11:40 AM
        0 responses
        30 views
        0 reactions
        Last Post SEQadmin2  
        Working...