Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat insert size of paired-end reads

    I used tophat to align paired-end 76bp RNA-seq.
    The fragments selected at 300bp.
    As the example in tophat manual shows- I used: tophat -r 148 ...
    When I inspected the .bam with:
    samtools view accepted_hits.bam
    or- samtools faststat accepted_hits.bam
    I relized that none of the reads was align with correct insert size.
    1. What to put as insert?
    2.what is the best size of the std. to put in --mate-std-dev ?
    Thanks in advance,
    Oz

  • #2
    Originally posted by ozs2006 View Post
    I used tophat to align paired-end 76bp RNA-seq.
    The fragments selected at 300bp.
    As the example in tophat manual shows- I used: tophat -r 148 ...
    When I inspected the .bam with:
    samtools view accepted_hits.bam
    or- samtools faststat accepted_hits.bam
    I relized that none of the reads was align with correct insert size.
    1. What to put as insert?
    2.what is the best size of the std. to put in --mate-std-dev ?
    Thanks in advance,
    Oz
    Can you reproduce the samtools flagstat output here? That would help diagnose a few things with your run.
    Also, --mate-std-dev could roughly be thought of as the width in bp of the band that was excised after adapter ligation during library prep.

    Comment


    • #3
      Thanks for the quick reply.

      1. here is samtools flagstat output:

      23722180 in total
      0 QC failure
      0 duplicates
      23722180 mapped (100.00%)
      23722180 paired in sequencing
      11861090 read1
      11861090 read2
      0 properly paired (0.00%)
      23722180 with itself and mate mapped
      0 singletons (0.00%)
      0 with mate mapped to a different chr
      0 with mate mapped to a different chr (mapQ>=5)

      2. And about the stdev. of inserts from the manual I saw that the default value is 20bp,
      and it seems very far from the band size in bp we used.
      Do I need to use the band size?

      Comment


      • #4
        Originally posted by ozs2006 View Post
        Thanks for the quick reply.

        1. here is samtools flagstat output:

        23722180 in total
        0 QC failure
        0 duplicates
        23722180 mapped (100.00%)
        23722180 paired in sequencing
        11861090 read1
        11861090 read2
        0 properly paired (0.00%)
        23722180 with itself and mate mapped
        0 singletons (0.00%)
        0 with mate mapped to a different chr
        0 with mate mapped to a different chr (mapQ>=5)

        2. And about the stdev. of inserts from the manual I saw that the default value is 20bp,
        and it seems very far from the band size in bp we used.
        Do I need to use the band size?
        So you actually have zero reads properly paired, which seems very strange. A few very fundamental questions to consider:
        1. Were the input fastq files in the correct orientation (i.e. read2 was in the 5'-3' orientation and the sequence in the fastq file would match the plus strand in the reference genome where it mapped)?
        2. What was your syntax for running tophat?
        3. What value did you use for -r (inner distance between mate pairs) and what was the size range of the band excised during library prep?


        Also, --mate-std-dev can be estimated from the width of the band, not the size. So, if you cut a band at 300+/-25bp on the gel, you have an expected std. dev. of 25.

        Comment


        • #5
          Do you have sizable adapters on your fragments? (Still, even after accounting for adapter sequence, you should get more than zero properly paired read pairs, so I guess there is some more fundamental issue ...)

          Comment


          • #6
            The command I used:
            tophat -r 83 -p 8 --solexa1.3-quals /path_to_bowtie_index/bowtie-0.12.5/indexes/h_sapiens_hg18 s_1_1.fq s_1_2.fq

            I have paired-end reads with 76 cycles for each side.

            1. The other side of the pair is mapped to the reference genome (5' to 3' match to + strand).

            2. I assigned the distance between ends to be 83bp (-r 83).
            Prior to tophat I used eland and aligned these reads. So the summary.htm for this lane detected the insert to be 235bp.
            I will check what the deviation of the cut from the gel.

            3. Another related question: I read tophat manual and paper, and as I understand it,
            tophat shows all results from bowtie alignment in the .bam file + all the reads that align to junctions (its unique property).
            I checked this by running bowtie separately from tophat and compare their output.
            I find that the same reads align to the same place, but have different samtools flag
            (actually, in bowtie I have got "better" flags)
            I can't understand why when I align with bowtie without any inner-distance as input,
            I get results better than tophat. Why do we need this inner-distance in order to run tophat on paired-end?

            From bowtie:
            HWUSI-EAS753_0001:1:4:93:588#0 99 chr2 38562474 255 76M = 38562580 182
            8@>:@8==ACBA@B??3B@BBB?9B79@69@A<?@;793679717;8'5151:;;1B@;=<@=0:@;=9=454? XA:i:0 MD:Z:76 NM:i:0
            HWUSI-EAS753_0001:1:4:93:588#0 147 chr2 38562580 255 76M = 38562474 -182 CATCGTGAGAGCAGACCATGTGGGCCCCAAGCAGATGCAGCAGATCCGCATGTCCCTTCGCGGGAAGGCTGTGGTG ###############################<<:?39=>?=8A<?6=??8+/=?2::4A5>9<;15?=>?96B?>B XA:i:2 MD:Z:7G8A8T42T0C6 NM:i:5

            From tophat:
            HWUSI-EAS753_0001:1:4:93:588#0 129 chr2 38562474 3 76M = 38562474 0 GCAATCCCTGACGCACCGCCGTGATGCCCAGGGAAGACAGGGCGACCTGGAAGTCCAACTACTTCCTTAAGATCAT 8@>:@8==ACBA@B??3B@BBB?9B79@69@A<?@;793679717;8'5151:;;1B@;=<@=0:@;=9=454? NM:i:0 NH:i:2
            HWUSI-EAS753_0001:1:4:93:588#0 65 chr2 38562474 3 76M = 38562474 0 GCAATCCCTGACGCACCGCCGTGATGCCCAGGGAAGACAGGGCGACCTGGAAGTCCAACTACTTCCTTAAGATCAT 8@>:@8==ACBA@B??3B@BBB?9B79@69@A<?@;793679717;8'5151:;;1B@;=<@=0:@;=9=454? NM:i:0 NH:i:2

            Thanks and sorry for this long letter,
            Oz
            Last edited by ozs2006; 11-25-2010, 02:46 PM. Reason: Misspellings

            Comment


            • #7
              1. Size selection (or, more accurately, the intended size selected for) and actual library fragment length distribution are often very different things; you should always check it yourself. Which it seems you did, so so far so good

              2. The reason this parameter has to be specified is that TopHat produces a probabilistic estimate for the best mapping of your reads as a pair based on a number of criteria, one of which is the fragment length distribution. There are situations where it matter very much whether the fragment length is 200 or 600 with respect to proper assignment of read pairs to the genome.

              3. You problem seems to be that your fastq files can't be matched by TopHat, which is extremely odd, even more so given that it seems like both ends are present with the correct IDs in what you posted, and it is not as if your are mixing lanes. Which version of TopHat are you using? Is your data stranded?

              Comment


              • #8
                Thanks for your replay :-)
                I use tophat v1.1.0 .
                Maybe this problem happen because I took the s_1_1_sequence.txt s_1_2_sequence.txt and changed their file extension to s_1_1.fq s_1_2.fq . As far as I know (but maybe I wrong) these file have the same format as any other file in FastQ.
                Do I need to take the data straight from the .qseq files, and convert them to fastq, or to do some text manipulations in order to use the s_1_1_sequence.txt s_1_2_sequence.txt ?
                (When I checked earlier it seems to me that tophat can't handle the _sequence.txt file without changing their file extension).*

                Another curious question about tophat: Is there a simple way to know which reads in the .bam file support the junctions in the bed file, or I need to write a script that check this property (I know this script is also simple).

                *If this is the reason for my odd output I'm sorry I bother all the members of seqanswers with my stupid questions.

                Comment


                • #9
                  It seems very strange, I ran tophat on other lanes of my experiment, and it ran well (more than 50% of properly paired reads according to samtools flagstat). I still don't know why lane 1 running went wrong.
                  1. Do you have any suggestion on this kind of problem?
                  2. Do I need in the downstream analysis take in account only properly paired read?, or in different words: I want to know how tophat assigns this flag in order to make the proper decision.

                  Comment


                  • #10
                    Hello everyone!

                    I also met a similar problem when dealling the insert size using TopHat, I am not sure the reason. Wish you could help.

                    (a). Before I got my solexa paired end insert size is about 300 bp, I mistook it for 500 bp. And my pair-end reads lengths were 80 bp, so at first I chose the -r at 340 (=500-80*2), after running the TopHat with the commend line:
                    tophat -o /output/ -i 30 --solexa1.3-quals -p 10 -r 140 --closure-search --coverage-search --microexon-search --butterfly-search new_scaffold TR_1.fq TR_2.fq >tophat.log

                    I got the bam information there was about 74.82% paried properly :
                    36749126 in total
                    0 QC failure
                    0 duplicates
                    36749126 mapped (100.00%)
                    36749126 paired in sequencing
                    18449324 read1
                    18299802 read2
                    27494744 properly paired (74.82%)
                    32515892 with itself and mate mapped
                    4233234 singletons (11.52%)
                    0 with mate mapped to a different chr
                    0 with mate mapped to a different chr (mapQ>=5)


                    (b) When I realized that I used the wrong insert size parameter, which should be 140 bp (=300-80*2), I run the TopHat with the same other options for the second time,
                    but the information was that only 47.01% reads were properly paired. But other information was similar compared with (a).

                    36520363 in total
                    0 QC failure
                    0 duplicates
                    36520363 mapped (100.00%)
                    36520363 paired in sequencing
                    18334751 read1
                    18185612 read2
                    17167904 properly paired (47.01%)
                    32285318 with itself and mate mapped
                    4235045 singletons (11.60%)
                    0 with mate mapped to a different chr
                    0 with mate mapped to a different chr (mapQ>=5)

                    I don't know why when I give the right insert size I would got the worse mapping result for overall paired reads.

                    Thank you very much !

                    Comment


                    • #11
                      Maybe neither 500 nor 300 are correct. It's better to try to use cufflinks to estimate the distribution of fragment lengths.
                      Last edited by GJin; 04-30-2012, 11:12 AM.

                      Comment


                      • #12
                        how does cufflinks to estimate the distribution of fragment lengths?

                        Originally posted by GJin View Post
                        Maybe neither 500 nor 300 are correct. It's better to try to use cufflinks to estimate the distribution of fragment lengths.
                        Hello, Do you know how can I know the insert size? I see you say can use cufflinks, could you tell that how does cufflinks to estimate the distribution of fragment lengths in detail? Thank you very much!

                        Comment


                        • #13
                          Originally posted by qqtwee View Post
                          Hello, Do you know how can I know the insert size? I see you say can use cufflinks, could you tell that how does cufflinks to estimate the distribution of fragment lengths in detail? Thank you very much!
                          I cannot speak for Gjin, but I assume what she/he meant is that you can use Tophat to align a subset of your reads. Then after running the resulting bam file in Cufflinks you can obtain an estimate of your insert size as Cufflinks provides the following output:

                          > Fragment Length Distribution: Empirical (learned)
                          > Estimated Mean: 202.50
                          > Estimated Std Dev: 69.92

                          Then re-run Tophat using all of your reads and use your estimated mean for -r and your estimated standard deviation for --mate-std-dev

                          Although, I do think it's advisable to read other threads on SEQanswers discussing this topic as there are differing opinions of how much these parameters matter (in terms of successful mapping) as well as how much they make sense (i.e. if you are mapping to the genome vs transcriptome).

                          Comment


                          • #14
                            Hi,
                            i have a question on that topic.
                            I am trying to map some 76x2 Illumina data with tophat-2.0.4. My reads are 76bp and the fragment size is around 280bp. So i choose '-r 128 --mate-std-dev 40' as i have a pdf with graph for the fragment size. I took only 1000 reads and mapped them with tophat. From samtools flagstat i get:

                            1622 + 0 in total (QC-passed reads + QC-failed reads)
                            0 + 0 duplicates
                            1622 + 0 mapped (100.00%:-nan%)
                            1622 + 0 paired in sequencing
                            763 + 0 read1
                            859 + 0 read2
                            656 + 0 properly paired (40.44%:-nan%)
                            676 + 0 with itself and mate mapped
                            946 + 0 singletons (58.32%:-nan%)
                            16 + 0 with mate mapped to a different chr
                            0 + 0 with mate mapped to a different chr (mapQ>=5)

                            properly paired only 40%. i ran cufflinks on the accepted.bam and see this:

                            Warning: Could not connect to update server to verify current version. Please check at the Cufflinks website (http://cufflinks.cbcb.umd.edu).
                            [15:16:58] Inspecting reads and determining fragment length distribution.
                            > Processed 86 loci. [*************************] 100%
                            Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges. It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.
                            > Map Properties:
                            > Normalized Map Mass: 225.85
                            > Raw Map Mass: 225.85
                            > Fragment Length Distribution: Truncated Gaussian (default)
                            > Default Mean: 200
                            > Default Std Dev: 80
                            [15:16:58] Assembling transcripts and estimating abundances.
                            > Processed 86 loci. [*************************] 100%


                            Fragment length mean 200 and Std.Dev 80.

                            Well i dont get it. Does that mean that my fragment length is actually around 200 and not around 280 as i am told by the graph? And should i use that value for mapping with tophat?

                            Because of that result i tried mapping with tophat with -r 48 (200 - 2*76) but that yeilded 31.7% properly paired reads.

                            Im a bit confused. Can someone please explain to me what is going on?

                            Thank you in advance.

                            Comment


                            • #15
                              Hello, check the appropriate sra file at ncbi, they will give you the length of the left and right reads and the insert mean and std!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 10-11-2024, 06:55 AM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              110 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              114 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              1 response
                              121 views
                              0 likes
                              Last Post EmiTom
                              by EmiTom
                               
                              Working...
                              X