Announcement

Collapse
No announcement yet.

using STAR+Cufflinks for transcript assembly turns unstranded RNA-seq to stranded?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • using STAR+Cufflinks for transcript assembly turns unstranded RNA-seq to stranded?

    I am trying to use STAR+Cufflinks to do a reference based transcript assembly using unstranded RNA-seq data.

    As mentioned in the STAR manual "If you have un-stranded RNA-seq data, and wish to run Cufflinks/Cuffdiff on STAR alignments, you will
    need to run STAR with --outSAMstrandField intronMotif option, which will generate the XS strand attribute for all alignments that contain splice junctions"

    Thus in the generated SAM file, strand will be derived from the intron motif. Unstranded RNA-seq data will be assigned a strand, which results in a lot of genes have both sense and antisense transcripts in the merged transcript assembly.

    My questions are:

    1) how reliable is the derived strand info from intron motif?
    2) Is the assembled transcripts affected by this?

    Thank you very much!

    Runxuan

  • #2
    hi,
    Your un-stranded data doesn't get 'converted to stranded'. An un-stranded data would have reads from both strands as PCR amplification (during library prep.) amplifies both strands of the DNA.

    The derived strand by STAR is based on alignment of any particular read and is not necessarily reflecting the strand due to the above reason.

    Regarding whether assembly would be affected or not => Cufflinks wont run without the XS attribute in the SAM/BAM file.

    Comment


    • #3
      thanks a lot, amitm. if the strand attribute from STAR feeding into cufflink is not really the strand information, is it going to affect how cufflink uses the info to assemble the transcripts? How should i deal with the sense and antisense assembled transcripts to reduce false positives?

      Comment


      • #4
        hi,
        If you are worried about a scenario where a gene locus has no/minimal sense transcription but very high antisense transcription and then Cufflinks not able to differentiate then you might need to do prepare a Stranded library before sequencing.

        If not then at data analysis step there is very minimal you could do -
        1) Do you know the sequence of these antisense? Do they maintain the exon intron boundary (introns spliced off), but just in complementary strand? Or do they read through introns? If they read through introns then you can set an arbitrary threshold (depending on your read length) saying -
        If a read extends beyond the exon boundary into the intron sequence for at least 'n' bases, then it might be from an unspliced transcript/ antisense. Hence discard the read. then use the filtered reads only for transcript assembly.

        Doing so genome-wide would be very tricky as there might be genuine transcripts with alternate exon start-ends.

        I'm not aware of your organism, but if it is something that has been widely studied then there would be datasets available around & PCR validations to cross-check your results for.
        Last edited by amitm; 07-21-2015, 08:45 AM. Reason: Corrected typo

        Comment


        • #5
          Since it wasn't mentioned yet I'll add that cufflinks determines the strand of the assembled isoforms from the value of the XS attribute in the alignments (generated by STAR with --outSAMstrandField intronMotif set at runtime). The XS attribute is only populated with strand information for spliced reads. The 4-bp motif at the splice site informs STAR what the strand is if the motif is a known one. If it is an unknown motif then there is no strand information. 90+% of splices will have those known motifs in mammalian genomes. The only other way cufflinks can determine strand is if you provide a reference GTF for assembly in which case it will use the strand information from that for matching assembled isoforms from the data.
          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
          Salk Institute for Biological Studies, La Jolla, CA, USA */

          Comment


          • #6
            Originally posted by sdriscoll View Post
            The only other way cufflinks can determine strand is if you provide a reference GTF for assembly in which case it will use the strand information from that for matching assembled isoforms from the data.
            but this is not necessarily correct strand information if i use an unstranded RNA-seq data, isn't it?

            Comment

            Working...
            X