Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • tophat + cufflinks: strand information

    I am using tophat.
    Does anybody know a way to get the strand information in the output sam file?

    I am interested in using cufflinks on tophat output. In the following is what I found on the cufflinks web page, it shows what I am after, how do I get that custom tag into the sam file?

    Here's an example of an alignment Cufflinks will accept:
    s6.25mer.txt-913508 16 chr1 4482736 255 14M431N11M * 0 0 \
    Note the use of the custom tag XS. This attribute, which must have a value of "+" or "-", indicates which strand the RNA that produced this read came from. While this tag can be applied to any alignment, including unspliced ones, it must be present for all spliced alignment records (those with a 'N' operation in the CIGAR string).

  • #2
    Hi there,

    I believe the second column (the flag field. For more information, please refer to SAM format definition) contains the strand information. So you could write a script to determine the strand for that alignment and append the XS:A: field at the end of each line.


    • #3
      Yes, the 2nd column of sam file could contain this information, but it appears (at least in tophat 1.0.13) the output sam file - accepted_hits.sam - contains the numbers 0 or 16 in this column. I am not sure what this means.

      For the sam output of bowtie (0.12.5) it's the same story.

      I got around that by running bowtie such that it produces its standard output, there the + - strand information is preserved and then use (one of the samtools tools).

      I still don't know how to get the strand information out of tophat, if it's possible at all.


      • #4
        The strand is 'encoded' in hexadecimal within the field values. So to get the strand information, you will have to do an "AND" operation on the flag field.

        In the SAM definition for the flag field:
        0x0010 strand of the query (1 for reverse)

        So if you do AND between the flag and 0x0010 and you get a 1, the entry is on the negative strand, else it's on the positive strand.


        • #5
          That was the info I needed. Thanks.


          • #6
            Hi Haneko,

            If it were so I suppose I shouldn't be getting the following from TopHat:

            HWI-EAS202_170:5:4:655:1891     0       chr1    11102910        255     15M10277N21M    *       0       0       GGATCATCCACCATGTTACCGATAAGCACCAGTTCA    B>@A@BA@>@??BAB@B@@@@??A>@;@>6>@:?>@    NM:i:0  XS:A:+  NS:i:0
            HWI-EAS202_170:5:12:784:1887    16      chr1    11102913        255     12M10277N24M    *       0       0       TCATCCACCATGTTACCGATAAGCACCAGTTCAAAC    @@??AA9@AA@=AA9B@@A@BB?A9@BB@B@BBA@B    NM:i:0  XS:A:+  NS:i:0
            One read has flag 0 and the other 16, meaning one maps to the forward and the other to the reverse, yet both have XS set to +

            Did I get it wrong? I'm using TopHat v1.0.13


            • #7
              Hi Angela,

              I'm using TopHat v1.0.13 too, but there is no these two columns "XS:A:+ NS:i:0" in my output files.

              So, I used this command to add "XS:A:" column to the .sam file for cufflinks:
              the flag column number is 2, when 16 then strand=- else strand='+'

              awk -F'\t' 'BEGIN { OFS="\t"} {if($2==16) {$(NF+1)="XS:A:-"; print $0,$(NF+1)} else {$(NF+1)="XS:A:+"; print $0,$(NF+1)}}' test.sam > test.sam2


              • #8
                I believe the XS tag is to be understood as the strand of the transcript which was sequenced and not the strand which the read itself was aligned to. That is why you, Angela, see reads aligned to opposite strands with identical XS tag. Because of that, it would be unwise to simply insert the XS tag manually by a script if you wish to run Cufflinks as a downstream analysis.

                xhuister, have you checked if your TopHat output SAM contains the XS tag in the reads covering splice junctions but not in the reads contained within exons? Cufflinks only needs the tag for the reads covering splice junctions so if you just did a quick check of the TopHat output you might have missed it.


                • #9
                  Hi Thomas,

                  Thanks. You are right, there is XS tag in some of the reads of TOPHAT output, but in only a very small number of the reads.

                  $ grep -c "XS:A" accepted_hits.sam
                  wc -c accepted_hits.sam

                  About what your said "Cufflinks only needs the tag for the reads covering splice junctions "?
                  At first, I thought I could used .sam output of Bowtie, then add "XS" tag to the reads, then use cufflinks to find the transcript boundaries. Am I wrong if I do so?


                  • #10
                    Yes, it would be wrong to run Cufflinks on a SAM file produced directly by Bowtie. Bowtie does not identify the strand of the transcript (or rather the splice junctions) so you need to align your reads with TopHat and then run Cufflinks. TopHat runs Bowtie as part of its pipeline so you don't need to run anything before TopHat. Having said that, I haven't actually run Cufflinks on Bowtie output, it might be able to do reasonably well but there is simply no reason to do it since you get a much better estimate of transcript abundances by using TopHat and then Cufflinks.

                    The reason TopHat only reports a strand for the transcript when a read spans a splice junction is that it cannot determine the strand of the read if it is contained within an exon. TopHat determines the strand of a read that spans a splice junction by looking at the splice sites to determine what strand contains a valid GT-AG pair (or AT-AC in case of the minor spliceosome).
                    Last edited by Thomas Doktor; 05-09-2010, 06:06 AM.


                    • #11
                      Thanks Thomas, but I'm rather confused about the strand. I thought that when a read is mapped to the chromosome, then it should be + strand, when mapped to the reverse chromosome, then - strand.
                      But in what you said "it cannot determine the strand of the read within an exon", is this strand not the strand of read mapping to the chromosome?


                      • #12
                        Sorry, I was being unclear. TopHat can't determine what strand the original transcript came from, not the read itself, if the read is entirely contained within an exon.


                        • #13
                          Sorry, I'm still confused.

                          Given the followings:
                          chromosome AAGGGG....
                          read1 AAGGGG
                          read2 CCCCTT

                          Read1 will map to strand +, read2 strand -. If a transcript contains read1 then I think this transcript will be strand +. I don't know why the strand of this transcript is undetermined?


                          • #14
                            It depends on the way you prepared your library. If you used a non-strand-specific protocol (which most people still do) you're not actually sequencing transcripts, you're sequencing cDNA with two strands. So the read could come from either of the two strands of a cDNA and you don't have any information which of the two strands corresponds to the original mRNA strand. This can be inferred when a read spans a splice junctions because splice site are highly conserved at the first 2 bases and last 2 bases of an intron.
                            Another way of inferring directionality would be to look at the exon islands of the transcripts and identify valid open reading frames, but TopHat does not, to my knowledge, employ that strategy.
                            Last edited by Thomas Doktor; 05-09-2010, 10:06 AM.


                            • #15
                              Thank you Thomas. I think the library I analyzed is strand-specific, but I didn't see any option in Tophat to specify strand-specific or non-strand-specific, do you know how to set this in Tophat?


                              Latest Articles


                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin

                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM
                              • seqadmin
                                Multiomics Techniques Advancing Disease Research
                                by seqadmin

                                New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                                A major leap in the field has
                                02-08-2024, 06:33 AM





                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:12 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 02-23-2024, 04:11 PM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 02-21-2024, 08:52 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 02-20-2024, 08:57 AM
                              0 responses
                              Last Post seqadmin