Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting RNAseq read names from Tophat accepted hits

    Hi all,
    I'm trying to extract what gene each of my RNA seq reads mapped to from the tophat acceptedhits_bam file. Ideally I want a list of each RNA seq read name and the corresponding gene it mapped to. Do any of you know a way to get this information from the bam file? Any info would be much appreciated.
    Thanks,
    Melissa

  • #2
    That's not usually in the BAM file. Use htseq-count with the -o option on the file.

    Comment


    • #3
      Thank you Devon.
      Htseq-count shows me how many transcripts mapped to each gene. What I want, however, is the opposite; a list of transcripts an what genes they mapped to. Any idea if Htseq-count is capable of getting that information? Tophat must retain the read and alignment information, I just can't figure out how to get that information.
      -Melissa

      Comment


      • #4
        Originally posted by mashbaugh View Post
        Thank you Devon.
        Htseq-count shows me how many transcripts mapped to each gene. What I want, however, is the opposite; a list of transcripts an what genes they mapped to. Any idea if Htseq-count is capable of getting that information? Tophat must retain the read and alignment information, I just can't figure out how to get that information.
        -Melissa
        I don't know the reference used in your tophat mapping procedure, gene level or chromosome level. If chromosome level, you may also need a gene annotation file with .gff suffixed. Then the htseq-count command can help you to finish what u want; If your reference was gene level, a samtools view command can read bamfile into samfile. This file is readable and you can write a text process script to count the reads in each gene.
        PS: Since the output bamfile of tophat contains mapped reads only, you can just count the reads without filtering flags

        Comment


        • #5
          You seem to be using "transcript" when you mean "alignments". There's a very fundamental difference between the two concepts. I'll assume that you meant to write "alignments".

          Htseq-count will normally produce a table of how many reads/pairs mapped to each gene. With the -o option, it will also produce a SAM file with each alignment annotated as to which gene (if any) it overlaps (assuming it overlaps only one gene). You can then simply use "grep" to find all alignments for each gene of interest, should you want to do that.

          Comment


          • #6
            Thank you for pointing that out Devon, I do in fact mean read alignments rather than transcripts. Within the HTseq-count SAM output file my alignments are still being identified on the chromosome level rather than the gene level, even though it does also produce the normal table you described. Is there any way to get the gene information into the SAM file?

            Comment


            • #7
              It should be adding an XF:Z:some_gene_name auxiliary tag to the output SAM file. The coordinates will still be genomic, of course. If you really want things with transcript-centric coordinates, then just align against the transcriptome with bowtie2 or bwa.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                Yesterday, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              58 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              45 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Working...
              X