Header Leaderboard Ad


How to correctly estimate RNA-seq mean insert size and standard distribution



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to correctly estimate RNA-seq mean insert size and standard distribution

    Hi SEQanswers community!

    So I'm not sure if this has been posted - I poked around, but didn't find anything specific enough so I'll just ask:

    I realize that specifying the mean insert size and std. dev. during tophat mapping of RNAseq data is an important thing to do because it can return better mapping results. I also know that the best way to do this is to map your reads and then run a program such as 'getinsertsize.py' or run tools in picard (?) to emperically determine this.

    Before mapping my RNAseq rides, I performed quality trimming using Trimmomatic-0.32.

    So my question is - what order am I supposed to perform all of these steps to determine the insert size and std. dev.? Will quality trimming skew the results of the insert size calculation?? I've got 14 different read sets (each from a different time point or control vs. KO tissue) and I'm also not sure if I need to perform this estimation for every single read set. (i.e. Do I need to find the mean insert size separately for E11.5KO, E11.5Het, E12.5KO, E12.5Het, etc... E17.5KO, E17.5Het??)

    By read set, I mean E11.5KO is one read set, and E11.5Het is another read set, etc.

    What I've done is:

    1. Quality trim the reads
    2. Map E11.5Het (an arbitrary choice)
    3. Run 'getinsertsize.py' on the output (the output was bam then converted to sam)
    4. Remap all of the reads with the new parameters

    Does this seem legitimate?? Or should I go back and remap everything naturally, and then calculate the insert size for each read set? And finally remap everything using the calculated mean insert size/std.dev.?

    I'm happy to clarify anything that doesn't make sense!


  • #2
    For RNA-seq data of a eukaryote, if you have overlapping paired reads, the best way to get the insert size distribution is by merging the reads based on overlap, as that will not be affected by introns. You can do that with BBMerge like this:

    bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist_merging.txt loose

    It is best in this case to NOT quality-trim, as that may eliminate some of the longer inserts. Trimming adapters is fine.

    If the reads are not overlapping, I recommend BBMap, which calculates insert size in such a way that reads spanning introns still yield the correct insert size:

    bbmap.sh ref=reference.fasta in1=read1.fq in2=read2.fq ihist=ihist_mapping.txt out=mapped.sam maxindel=200000

    For either program you can cap it at, say, 1M read pairs with the flag "reads=1000000". This will make it stop early, since 1 million is plenty to calculate an insert size distribution, if you don't want the sam file.


    • #3
      Reply-insert size calculation

      Hi Brian ,

      Sorry, I should have clarified that - I'm working on Mouse tissue.

      So if I'm understanding this correctly - you're suggesting to run BBmap or BBmerge just to estimate the insert size, and then use the estimated insert size in Tophat? (Very clever if so!)

      I'm not at all sure if we have overlapping reads or not. I'm assuming that we do - the output from the RNA-seq was ~30 million pairs of reads per read set, so there is bound to be overlapping reads. What am I using to determine/How do I tell if the reads are overlapping? (i.e. is this something special about the RNA-seq run or anything?)

      Thanks for the clarification and help!

      Last edited by pkstarstorm05; 01-15-2015, 12:16 AM.


      • #4
        Hi Paul,

        BBMerge will tell you the insert size distribution in under a minute if you cap it at 1M reads and they are, in fact, overlapping. You'll have to graph the histogram to determine whether the mode was captured or if the insert size was too long for merging. If you have 2x150bp reads, BBMerge can capture insert sizes up to around 288bp with default settings. If the graph is ascending, then there is a sudden, steep dropoff at 188bp, and/or you only get, say, 20% of your reads to merge, then the inserts were too long (or reads too short) for that approach. You know the mode is captured if the graph goes up, peaks, and then starts going back down well before the sharp dropoff to zero just before 2x(read length).

        BBMap is able to calculate the distribution for arbitrarily long inserts, but takes longer and of course requires a reference. You can use the insert size distribution to feed into Tophat, or you can just use BBMap's sam file directly; it is a splice-aware aligner that does not need insert size as a parameter (since it auto-detects it while running) and is substantially faster and more sensitive than Tophat anyway. To do that, and produce output with cufflinks-compatible tags, you'd also need to add the flags "xstag=unstranded intronlen=10 ambig=random" (unless your data is strand-specific, in which case you could use 'firststrand' or 'secondstrand').
        Last edited by Brian Bushnell; 01-15-2015, 09:03 AM.