Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Abundande with bowtie, tophat and cufflinks

    Dear colleagues,

    i am building a pipeline to estimate gene abundance (expression) from RNA-seq data. I am wondering if my plan is reasonable:

    a) map reads with bowtie using -m 10 (for example), allowing 10 multiple hits per read
    a1) here i don't understand how the mapq values will be set in the SAM format, i understand that with allowing only single hits (-m 1) all mapq values will be 255

    b) take only unmapped reads from a) for mapping with tophat
    b1) again same question, with -g 40 (default), what are the mapq values in the SAM result?

    OK, now i have alignments and i also have a GTF file for my organism (my.gtf)

    c) join alignments from step a) and b) into one sorted SAM file (a_b.sam)

    d) cufflinks -G my.gtf a_b.sam

    * how will cufflinks take into account the mapq values from the SAM and by doing so "weight" the multiple hits (giving more meaning to single hits etc.)?

    * why is cufflinks mentioned with tophat all the time and not with bowtie also?

    thank you for your answers,

  • #2
    I am looking at accepted_hits.bam (output from TopHat):

    read_id_0 16 1 2803 0 36M * 0 0 ACACATACACTGCGCTATTAAACAAGACACTTGTAC ffdfffefdfefffffffffffffffffffffffff NM:i:0 NH:i:14 CC:Z:= CP:i:7210

    Are in this file only alignments that mapped to splice-sites? How to know how the read was spliced? (both locations of mapping)

    Perhaps from the last part (SAM TAGS?): NM:i:0 NH:i:14 CC:Z:= CP:i:7210



    • #3
      Hi Gregor,
      take a look at this file:

      tophat print both splided and non spliced alignemnts. In this case you do not have a splice (36M)

      You will see a spliced sequence as XXMXXIXXM. In this case X are the number of bases that Matched on one exon, number of bases from the intron, and number of bases that matched on the other exon.

      I hope it helps.

      In the item 2.2.3 of that file you have:

      2.2.3. Extended CIGAR format
      A CIGAR string is comprised of a series of operation lengths plus the operations. The conventional CIGAR format allows
      for three types of operations: M for match or mismatch, I for insertion and D for deletion. The extended CIGAR format
      further allows four more operations, as is shown in the following table, to describe clipping, padding and splicing:
      op Description
      M Alignment match (can be a sequence match or mismatch)
      I Insertion to the reference
      D Deletion from the reference
      N Skipped region from the reference
      S Soft clip on the read (clipped sequence present in <seq>)
      H Hard clip on the read (clipped sequence NOT present in <seq>)
      P Padding (silent deletion from the padded reference sequence


      • #4
        Greg, to answer your last question: tophat uses bowtie as the engine for its read -> genome mapping as part of the algorithm for finding spliced reads. Cufflinks in turn can use the tophat alignments. The programs are modular so that you can run Cufflinks using (spliced) read alignments made with other programs.


        Latest Articles


        • seqadmin
          The Impact of AI in Genomic Medicine
          by seqadmin

          Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
          02-26-2024, 02:07 PM
        • seqadmin
          Multiomics Techniques Advancing Disease Research
          by seqadmin

          New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

          A major leap in the field has
          02-08-2024, 06:33 AM





        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:12 AM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 02-23-2024, 04:11 PM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 02-21-2024, 08:52 AM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 02-20-2024, 08:57 AM
        0 responses
        Last Post seqadmin