Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Abundande with bowtie, tophat and cufflinks

    Dear colleagues,

    i am building a pipeline to estimate gene abundance (expression) from RNA-seq data. I am wondering if my plan is reasonable:

    a) map reads with bowtie using -m 10 (for example), allowing 10 multiple hits per read
    a1) here i don't understand how the mapq values will be set in the SAM format, i understand that with allowing only single hits (-m 1) all mapq values will be 255

    b) take only unmapped reads from a) for mapping with tophat
    b1) again same question, with -g 40 (default), what are the mapq values in the SAM result?

    OK, now i have alignments and i also have a GTF file for my organism (my.gtf)

    c) join alignments from step a) and b) into one sorted SAM file (a_b.sam)

    d) cufflinks -G my.gtf a_b.sam

    * how will cufflinks take into account the mapq values from the SAM and by doing so "weight" the multiple hits (giving more meaning to single hits etc.)?

    * why is cufflinks mentioned with tophat all the time and not with bowtie also?

    thank you for your answers,
    Gregor

  • #2
    I am looking at accepted_hits.bam (output from TopHat):

    read_id_0 16 1 2803 0 36M * 0 0 ACACATACACTGCGCTATTAAACAAGACACTTGTAC ffdfffefdfefffffffffffffffffffffffff NM:i:0 NH:i:14 CC:Z:= CP:i:7210

    Are in this file only alignments that mapped to splice-sites? How to know how the read was spliced? (both locations of mapping)

    Perhaps from the last part (SAM TAGS?): NM:i:0 NH:i:14 CC:Z:= CP:i:7210

    tnx,
    Gregor

    Comment


    • #3
      Hi Gregor,
      take a look at this file:http://samtools.sourceforge.net/SAM1.pdf

      tophat print both splided and non spliced alignemnts. In this case you do not have a splice (36M)

      You will see a spliced sequence as XXMXXIXXM. In this case X are the number of bases that Matched on one exon, number of bases from the intron, and number of bases that matched on the other exon.

      I hope it helps.
      Fernando


      In the item 2.2.3 of that file you have:

      2.2.3. Extended CIGAR format
      A CIGAR string is comprised of a series of operation lengths plus the operations. The conventional CIGAR format allows
      for three types of operations: M for match or mismatch, I for insertion and D for deletion. The extended CIGAR format
      further allows four more operations, as is shown in the following table, to describe clipping, padding and splicing:
      op Description
      M Alignment match (can be a sequence match or mismatch)
      I Insertion to the reference
      D Deletion from the reference
      N Skipped region from the reference
      S Soft clip on the read (clipped sequence present in <seq>)
      H Hard clip on the read (clipped sequence NOT present in <seq>)
      P Padding (silent deletion from the padded reference sequence

      Comment


      • #4
        Greg, to answer your last question: tophat uses bowtie as the engine for its read -> genome mapping as part of the algorithm for finding spliced reads. Cufflinks in turn can use the tophat alignments. The programs are modular so that you can run Cufflinks using (spliced) read alignments made with other programs.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-25-2024, 11:49 AM
        0 responses
        19 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-24-2024, 08:47 AM
        0 responses
        20 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        62 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        61 views
        0 likes
        Last Post seqadmin  
        Working...
        X