Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Obtain Spliced Read Names/Mapping with TopHat??

    Hi all. Well, I've finally made my first post.

    Here's my question. I have run Tophat using 27M 76nt Illumina reads mapped against a small (120kb) genome looking for spliced reads. The run appears to complete successfully and I have the typical output files such as junctions.bed which has 582 junctions.

    However, I would really like to look at some of the spliced read alignments. Even when I use the command line option to save temp files I can't seem to find the reads that TopHat believes are spliced. I can see some fastq files but, of course, the read names have been changed. Is there a way to look at finer scale results such as this? Thanks for any help you can give.

  • #2
    bumpbumpbump

    Comment


    • #3
      Originally posted by mendl7 View Post
      bumpbumpbump
      Please do not bump, your question will be answered in time. Keep searching the forums.

      Comment


      • #4
        You should have both a SAM file and a wig file in addition to the junctions.bed you mention ... From the TopHat manual:

        TopHat Output
        1. accepted_hits.sam. A list of read alignments in SAM format. SAM is a compact short read alignment format that is increasingly being adopted. The formal specification is here.

        2. coverage.wig. A UCSC BedGraph wigglegram track, showing the depth of coverage at each position, including the spliced read alignments.

        3. junctions.bed. A UCSC BED track of junctions reported by TopHat. Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction.

        Have you tried converting the SAM file to BAM format, indexing it, and then viewing in IGV?

        TopHat manual
        http://tophat.cbcb.umd.edu/manual.html

        IGV
        http://www.broadinstitute.org/igv/AlignmentData

        Comment


        • #5
          The issue is that the accepted_hits.sam file do not contain reads that are spliced. I believe these are the reads that are simply mapped by Bowtie in the beginning of the TopHat run to determine regions of expression that TopHat tries to link later in the algorithm by spliced reads. I say this because there are no results in accepted_hits.sam that contain the XO: flag which gives the number of gap openings.

          The coverage.wig file does contain results from the spliceds read but only as the depth of coverage or number of reads that mapped to the position.

          If I've got something confused I would greatly appreciate the help.


          What I am seeking is simply a list of reads that TopHat has called spliced and the reference positions that the read positions were mapped to.

          Comment


          • #6
            @mendl7:

            accepted_hits.sam indeed does contain split-read-alignments. At least for me, but i might miss your point. Does the following deliver want you're looking for?

            Code:
            fgrep --invert-match "50M" accepted_hits.sam > spliced_hits.sam
            (you probably have to adjust the search string from 50M to whatever your read length was - you can check it by "head -n 100 accepted_hits.sam" [see column 6])

            For my data this delivers for example:

            Code:
            HWUSI-EAS618_0001:3:63:1663:315#0       147     1       14788   255     42M140N8M       =       14783   0       CTCCGGGCCCCTCACCAGCCCCAGGTCCTTTCCCAGAGATGCCCTTGCGC      =1791>/79<==5'>>9<@/8>6AA=6=A<>@AA8AA?B@B?AB@BBA==        NM:i:0  XS:A:-  NH:i:1
            HWUSI-EAS618_0001:3:11:1021:354#0       147     1       14791   255     39M140N11M      =       14720   0       CGGGCCCCTCACCAGCCCCAGGTCCTTTCCCAGAGATGCCCTTGCGCCTC      =>13;?=?>>::AA=A=A?;@?A>BBA@ABAB@AA@BBBABA@A@AB<@B        NM:i:0  XS:A:-  NH:i:1
            HWUSI-EAS618_0001:3:34:352:749#0        147     1       14791   255     39M140N11M      =       14786   0       CGGGCCCCTCACCAGCCCCAGGTCCTTTCCCAGAGATGCCCTTGCGCCTC      A;>A4BAABB?B@ABB@BA@@A@AB@BAA?A@AB@BBBB@9BBBB@BBAB        NM:i:0  XS:A:-  NH:i:1
            HWUSI-EAS618_0001:3:113:114:1346#0      163     1       14794   0       36M140N14M      =       14818   0       GTCCCTCACCAGCCCCAGGTCCTTTCCCAGAGATGCCCTTGCGCCTCATG      B=A@>BB@>B>5@B@A@?A?BAABBAABBA@ABBA===BAA@B@=B?A>6        NM:i:1  XS:A:-  NH:i:6  CC:Z:15 CP:i:102516182
            HWUSI-EAS618_0001:3:91:1318:920#0       73      1       14796   0       34M140N16M      *       0       0       CCCTCACCAGCCCCAGGTCCTTTCCCAGAGATGCCCTTGCGCCTCATGAC      B@@BA@A=AAAAABB@B=BABBAA=A<BABABA<<=BA@=A7>A??BA:7        NM:i:0  XS:A:-  NH:i:6  CC:Z:15 CP:i:102516180
            HWUSI-EAS618_0001:3:51:933:611#0        83      1       14809   255     21M140N29M      =       14743   0       CAGGTCCTTTCCCAGAGATGCCCTTGCGCCTCATGACCAGCTTGTTGAAG      CB@9@BBAA@BAC?@BBBBBBBBABBC=CABCCBBABBC@BCBBACBBCB        NM:i:0  XS:A:-  NH:i:1
            HWUSI-EAS618_0001:3:117:1392:1226#0     147     1       14814   255     16M140N34M      =       14764   0       CCTTTCCCAGAGATGCCCTTGCGCCTCATGACCAGCTTGTTGAAGAGATC      7=<>?==49A9@AA?AAA@AAB?BBB?ABB;B@B;B?A?AB>BA@A=BAA        NM:i:0  XS:A:-  NH:i:1
            HWUSI-EAS618_0001:3:113:114:1346#0      83      1       14818   0       12M140N38M      =       14794   0       TCCCAGAGATGCCCTTGCGCCTCATGACCAGCTTGTTGAAGAGATCCGAC      ;:)?B<A=>A6;-37A==>?A=;A@A<=??=/65A>A=BCAC<B>A>?BB        NM:i:0  XS:A:-  NH:i:6  CC:Z:15 CP:i:102516158
            HWUSI-EAS618_0001:3:100:997:670#0       137     1       15000   0       39M757N11M      *       0       0       GATCCGACATCAAGTGCCCACCTTGGCTCGTGGCTCTCACTTGCTCCTGC      73A?A;=A@?AB=<9=BA<;;@AA?7@AB@BB9B@AA@?BBB@B@9B@@@        NM:i:0  XS:A:-  NH:i:8  CC:Z:12 CP:i:89817
            HWUSI-EAS618_0001:3:101:1284:137#0      83      1       15000   255     39M757N11M      =       15101   0       GATCCGACATCAAGTCCCCACCTTGGCTCGTGGCTCTCACTTGCTCCTGC      ###########==55;=>744A107:=;?>AAA@B?>B@B8@BBABCAAB        NM:i:1  XS:A:-  NH:i:1
            HWUSI-EAS618_0001:3:110:1622:914#0      83      1       15000   255     39M757N11M      =       14920   0       GATCCGACATCAAGTGCCCACCTTGGCTCGTGGCTCTCACTTGCTCCTGC      #:::==>?A;BAA=>=?==5?=?AA?BB@?BB=BA@@A?ACBBAA?B@BB        NM:i:0  XS:A:-  NH:i:1
            HWUSI-EAS618_0001:3:29:668:1845#0       137     1       15000   0       39M757N11M      *       0       0       GATCCGACATCAAGTGCCCACCTTGGCTCGTGGCTCTCACTTGCTCCTGC      709:93>A?@?>=>>>A??;AA@?=?8?@ABAABB>AA>B?@@BBAABAA        NM:i:0  XS:A:-  NH:i:8  CC:Z:12 CP:i:89817
            HWUSI-EAS618_0001:3:52:794:1776#0       83      1       15000   1       39M757N11M      =       14743   0       GATCCGACATCAAGTGCCCACCTTGGCTCGTGGCTCTCACTTGCTCCTGC      =;B<BBAABB@BBABBABA;BBBBBBBBBBBBBBBB@>;<ACCC>C;BCB        NM:i:0  XS:A:-  NH:i:4  CC:Z:12 CP:i:89817
            HWUSI-EAS618_0001:3:65:849:538#0        73      1       15000   0       39M757N11M      *       0       0       GATCCGACATCAAGTCCCCACCTTGGCTCGTGGCTCTCACTTGCTCCTGC      ?@?7?@@AA+A@BBBB<AB,BB<B<?@A9BA<?BCCBBBC@BBCCBC6@B        NM:i:1  XS:A:-  NH:i:8  CC:Z:12 CP:i:89817
            HWUSI-EAS618_0001:3:69:1365:376#0       83      1       15000   255     39M757N11M      =       15085   0       GATCCGACATCAAGTGCCCACCTTGGCTCGTGGCTCTCACTTGCTCCTGC      ###=:4:=;;=;44=56=;49=7;=;?;?=AA?AAA?A=AAB@BABBABB        NM:i:0  XS:A:-  NH:i:1
            HWUSI-EAS618_0001:3:78:894:857#0        73      1       15000   0       39M757N11M      *       0       0       GATCCGACATCAAGTGCCCACCTTGGCTCGTGGCTCTCACTTGCTCCAGA      BCBBCBBB>BBBA?<BAABBABBB;@:><>=<?<<<B#############        NM:i:2  XS:A:-  NH:i:8  CC:Z:12 CP:i:89817
            HWUSI-EAS618_0001:3:79:517:552#0        147     1       15000   255     39M757N11M      =       15096   0       GATCCGACATCAAGTGCCCACCTTGGCTCGTGGCTCTCACTTGCTCCTGC      57@6A<:>??BA=84;A=A6BBBBA=BBB@B@BBBCBB=CBC?BBBBBAB        NM:i:0  XS:A:-  NH:i:1
            HWUSI-EAS618_0001:3:97:1036:226#0       73      1       15000   0       39M757N11M      *       0       0       GATCCGACATCAAGTGCCCACCTTGGCTCGTGGCTCTCACTTGCTCCTGC      A9@BBBBB@@BBBBBBBB;4;BB4B@B5B@BB<BCCBCACCCBC@CCCBB        NM:i:0  XS:A:-  NH:i:8  CC:Z:12 CP:i:89817
            Cheers
            Uwe
            Last edited by Uwe Appelt; 09-15-2010, 09:39 AM.

            Comment


            • #7
              yep, that's exactly what I needed. I had FALSELY assumed that XO flag would be used. Thank you very much for your assistance Uwe.

              Comment


              • #8
                Just a quick update in case this is useful to anyone. I trim my reads to get rid of obviously lower quality 3' bases. Thus, fgrep --invert-match is not a useful filter since my read lengths are variable.
                Instead I now use "awk '$6 ~/N/{print $_;}' accepted_hits.sam >spliced-hits.sam" which obviously looks for the presence of "N", inserted bases, in the CIGAR string column of the tophat SAM output.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin


                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                  Yesterday, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                39 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                41 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                35 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                55 views
                0 likes
                Last Post seqadmin  
                Working...
                X