Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • help for tophat -r

    Hello,

    I am interested in NCBI GEO "GSM765405:CshlLong_RnaSeq_K562_cell_longPolyA", but there is not fragment length.
    a description for "readtype description: Paired 76 n.t. directed reads
    readtype: 2x76D
    replicate description: tier 1
    replicate: 1
    Instrument model: Illumina Genome Analyzer II
    rnaextract description: Poly(A)+ RNA longer than 200 nt
    rnaextract: longPolyA".

    I did not know fragment length for tophat software, "tophat -r ."
    "-r/--mate-inner-dist <int> This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs."

    Generally speaking, Illumina paired-end fragment is 200bp, I confuse to select -r 50.

    Thanks for any comments in advance.
    I am looking forward to your letter.

    AronaldJ

  • #2
    Hi,

    What people typically do is to map the library (or a subset) to the genome (with Bowtie / BWA), compute an estimate of the fragment size (both mean and Standard Deviation, option --mate-std-dev).
    Don't forget to substract the size of the reads to the mean fragment size and use this as your "mate-inner-dist".
    A possible command-line to calculate this is: (this assume that the read length is constant)

    Code:
    samtools view -F 0x4 File.mapped.bam |  awk '{if ($9 >0) {sum+=$9;sumsq+=$9*$9;N+=1}} END {print "mean = " sum/N " SD=" sqrt(sumsq/N - (sum/N)**2)}'
    Alternatively, if you can estimate the fragment size during the library preparation, that should also work.

    I have no idea how sensitive this parameter is and its effect on the resulting mapping. Anybody has comments on that?

    Hope that helps,

    Nicolas

    Comment


    • #3
      Hi Nicholas,

      When I used your one line command to calculate the mean fragment size, I got mean = approx. 158. I am working on PE data with 2x100bp read length. In this case, does the insert size = 158-200 or 200-158?

      Thanks

      Comment


      • #4
        Remember that for RNA-seq, estimating insert size by mapping to the genome will give misleading results (because of splicing). Perhaps mapping to a transcriptome reference would be better.

        Comment


        • #5
          Originally posted by pinki999 View Post
          Hi Nicholas,

          When I used your one line command to calculate the mean fragment size, I got mean = approx. 158. I am working on PE data with 2x100bp read length. In this case, does the insert size = 158-200 or 200-158?

          Thanks
          Hi,

          Using the command-line above will summarize the values in field 9 ($9), which, in the SAM format indicates the distance between the "left" position of each reads in the pair. So, the inner distance required in the -r option by Tophat should be 158 - 100. Substract only once the size of the reads, not twice!

          [EDIT] I got it wrong, sorry. Field 9 is the oberved Template length, and therefore the inner-distance is equal to "Template length - 2x size of the reads".
          Last edited by Nicolas; 02-11-2012, 02:44 PM.

          Comment


          • #6
            Originally posted by kopi-o View Post
            Remember that for RNA-seq, estimating insert size by mapping to the genome will give misleading results (because of splicing). Perhaps mapping to a transcriptome reference would be better.
            Sure. The point here is to map to the genome to estimate the size of the insert (from full length alignments), and feed this parameter into Tophat, to map to the transcriptome (or at least, to allow splicing inside the reads).

            The first step (mapping to the genome) can usually be done on a small subset of the raw data.

            Comment


            • #7
              Hi Nicholas,

              The following is an alignment from the bowtie output:
              HWI-ST751:9607KCACXX:5:1101:2896:1974 163 chr1 15755557 255 101M = 15755677 221 TGTGCCTGCCGCCTGCCCTCCACACATCCCTGTCCCCCCAACCCGGGAACCCCTGCCCTCCTCCAGCAGGCCGCACCGCCCCTGGGGCCCCCTGCCAGCCC CCBFFFFFHHHHHJJJJJJJJJJJJJJJJJJIHJJJJJJJJJJIJHHFCDDDDDDDDDDDDDDDDDDDDDDDDDDDDBBDDDBBBDDDDDDDBDDDCCDDD XA:i:0 MD:Z:101 NM:i:0
              HWI-ST751:9607KCACXX:5:1101:2896:1974 83 chr1 15755677 255 101M = 15755557 -221 GCAGAAGAGATAGAATCAGGGCTGCCCCCACAGAGTGGGACCCAAGGGGCTAATTGGAGGCACGAGGGGACCCCTCCCCAGGGCCTTTTCCTCCTCTGCGT ###########@:AA9?9BDB<599BDC>(::5,53<DDDDDDDDDEEEFFFFFHHHHHGIHIIIJJJIGJJJJJJJJJJJJIJJJJJGHHHHFDD?FCCB XA:i:0 MD:Z:101 NM:i:0

              If 9th column is the distance between the left position of each reads in the pair then the difference between their POS should be equal to the 9th column right? But here the difference between 15755557 and 15755677 is 120. Am I missing something here?

              Thank you

              Comment


              • #8
                Originally posted by pinki999 View Post
                Hi Nicholas,

                If 9th column is the distance between the left position of each reads in the pair then the difference between their POS should be equal to the 9th column right? But here the difference between 15755557 and 15755677 is 120. Am I missing something here?

                Thank you

                Oups, my bad. Field 9 is actually the observed template length, and therefore, in your case, the inner-distance is negative... I never had to run Tophat with a negative inner-distance, but you should try ;-)
                If it doesn't work, set it to 0.
                I never did a full test of the sensibility of this parameter, I am not sure how important this is.
                If anybody knows of such a test, or is willing to try, please let us know...

                Comment


                • #9
                  Say you have an expressed transcript containing two exons separated by a 500-bp intron, and you align to the genome. Your paired-end read then aligns, using BWA or Bowtie, to exon 1 with read 1, and to exon 2 with read 2, with no overlapping of the splice junctions. Now you will estimate the fragment size to be 500 bp longer than it actually is.

                  I don't know how common this scenario is, but it would mess up the estimate. This would not happen with a transcriptome reference.

                  Comment


                  • #10
                    This is our standard method. Align the first 5 million reads using bwa against a transcriptome reference (ensembl has easy download). Run Picard CalcualteInsertMetrics on the bam file, extract out the insert size and standard deviation values. Subtract the combined read length from the calculated insert size to get the inner-mate distance. The pass that calculated value and the standard deviation to tophat for the alignment. On our end it uses about 20 minutes of computation time for bwa aln, bwa sampe, samtools sam to bam, samtools sort, picard insertmetrics on 5 million reads. We also use just the first million, not much effect of the numbers but if you want a pretty histogram picture it looks better.

                    Comment


                    • #11
                      unexpectedly small ineset sizes

                      Thanks for the description on estimating insert-size with bwa and picard. I just tried it with some data we have and am a bit surprised by the results. Bioanalyzer results on the sample suggested that mean RNA fragment length was around 300 bp, and we opted for 2x100 bp paired-end Illumina. Data quality is excellent, but when I follow bwa+picard tools aligning to refseq, it reports a mean insert size of ~160 bp. This seems to be significantly smaller than it should be, and implies that the reads overlap in many cases. The estimated inner mate distance would be negative (160-2*100).

                      Am I misinterpreting the picard output ? thanks !

                      Comment


                      • #12
                        Are you sure your fragment sizes from Bioanalyzer don't include adapters? That might explain the discrepancy.

                        I don't think you are misinterpreting the Picard output - if it gives 160 bp as mean insert size, then on average read 1 and read 2 are overlapping.

                        Comment


                        • #13
                          Originally posted by kopi-o View Post
                          Are you sure your fragment sizes from Bioanalyzer don't include adapters? That might explain the discrepancy.
                          I don't think so - the reported length including the adapters is >400 bp. Given that the reads appear to overlap, it's not clear to me what to tell tophat for the mate pair inner distance. Does it handle negative values ?

                          Comment


                          • #14
                            Yes, I believe Tophat does handle negative values.

                            Comment


                            • #15
                              The bioanalyzer trace you would get includes the adaptors so your 300bp bioanalyzer result does include the 120bp of adaptors so the average insert would be around 180bp, pretty close to the sequencer output of around 160bp. This is case and point why I do it this way as lab based estimation is often way off. Why trust the lab tech who always reports some generic value when you have millions of data points from which you can defined the true insert size and thus for tophat calculate the inner-distance.

                              As mention it does take negative values, the caveat being for some applications like hybrid-transcript identification the absence of a gap ruins many of the current methods.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin


                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                [Article Coming Soon!]...
                                Yesterday, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM
                              • seqadmin
                                Understanding Genetic Influence on Infectious Disease
                                by seqadmin




                                During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                                Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                                09-09-2024, 10:59 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              14 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              1 response
                              31 views
                              0 likes
                              Last Post EmiTom
                              by EmiTom
                               
                              Started by seqadmin, 09-26-2024, 12:57 PM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X