Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lzu
    Junior Member
    • Nov 2013
    • 9

    How to extract assembled transcript sequence from RNA-seq data instead of ref genome?

    Hi All,

    I have a RNA-seq data from a 'subspeices' or 'variety' of grape. Grape genome is available.

    I want to get the transcript fasta file of this 'species variety' after I mapped the grape 'variety' RNA-seq reads to grape reference genome via 'TopHat and Cufflinks' pipeline.

    Cufflinks only produced output of 'transcript coordinate' file (positions of transcripts in the grape reference genome). But I need to extract the transcript assembly fasta sequence from this grape 'variety' RNA-seq data, not from the reference grape genome because there is a little bit evolutionary difference between my sample and the reference genome which I want to analyse later.

    So how do I extract transcript fasta file from RNA-seq data of my sample instead of the reference genome after I ran the 'TopHat and Cufflinks' pipeline?

    Thanks for your help and suggestion!

    lzu
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Cufflinks outputs a GTF assembly that annotates each of the loci that it calls. If you want a multi-fasta file of that, just use gtf_to_fasta (that probably comes with tophat, but if not you can google around for it).

    Comment

    • sphil
      Senior Member
      • Apr 2010
      • 192

      #3
      Originally posted by dpryan View Post
      Cufflinks outputs a GTF assembly that annotates each of the loci that it calls. If you want a multi-fasta file of that, just use gtf_to_fasta (that probably comes with tophat, but if not you can google around for it).
      be aware of where cufflinks gets the sequences from. maybe it uses the provided fasta and just extracts the sequecnes according to the coordinates given by the gtf. Which means, you end up with your 'non-variety' grape sequences.

      Comment

      • lzu
        Junior Member
        • Nov 2013
        • 9

        #4
        Then I will end up getting a fasta file of transcript sequence extracted from the reference genome which is not what I wanted. I want the transcript sequences from my sample (a grape 'variety').

        Comment

        • lzu
          Junior Member
          • Nov 2013
          • 9

          #5
          You are right, you've got my point. I still don't know how to extract sequences from the grape 'variety' RNA-seq data. Maybe it is hard, or should I assemble RNA-seq de novo by using Trinity?

          Comment

          • sphil
            Senior Member
            • Apr 2010
            • 192

            #6
            Originally posted by lzu View Post
            You are right, you've got my point. I still don't know how to extract sequences from the grape 'variety' RNA-seq data. Maybe it is hard, or should I assemble RNA-seq de novo by using Trinity?
            Yep, maybe that's the better way of doing it. Assemble the transcripts denovo an map those transcript to the reference genome.

            Comment

            • lzu
              Junior Member
              • Nov 2013
              • 9

              #7
              Originally posted by sphil View Post
              Yep, maybe that's the better way of doing it. Assemble the transcripts denovo an map those transcript to the reference genome.
              Do you know any paper(s) that "first denovo assemble RNA-seq, then map to ref genome"?

              Comment

              • sphil
                Senior Member
                • Apr 2010
                • 192

                #8
                sorry, can't find one from the top of my head but the 'normal' mapping procedure after de novo assembly of transcripts should do the job pretty well. Just account for your diversity of strains when you choose the mapping parameters. Use loose mapping criteria after your assembly and it should be fine. If, however, this doesn't give you the desired results, what I normally do is to BLAST the transcripts against an in-house database. This is even looser than what most of the mappers allow . Also, if the transcripts are becoming too long this should be the way to go.

                Hope that helps:


                FWIW: see below some papers for assembly and mapping which might be helpful anyways.

                There you go:
                Garber et al.
                Trinity used to assembly transcripts
                Oases assembler

                Comment

                • lzu
                  Junior Member
                  • Nov 2013
                  • 9

                  #9
                  Originally posted by sphil View Post
                  sorry, can't find one from the top of my head but the 'normal' mapping procedure after de novo assembly of transcripts should do the job pretty well. Just account for your diversity of strains when you choose the mapping parameters. Use loose mapping criteria after your assembly and it should be fine. If, however, this doesn't give you the desired results, what I normally do is to BLAST the transcripts against an in-house database. This is even looser than what most of the mappers allow . Also, if the transcripts are becoming too long this should be the way to go.

                  Hope that helps:


                  FWIW: see below some papers for assembly and mapping which might be helpful anyways.

                  There you go:
                  Garber et al.
                  Trinity used to assembly transcripts
                  Oases assembler
                  ----
                  Thanks for the suggestion. I read some papers which use model reference genome to predict alternative splicing diversity of subspecies or species 'variety' with RNA-seq data. There might be errors in results if some exons or introns are truly physically lost in those subspecies/variety genome due to the genetic diversity among different groups/populations...

                  Comment

                  • Jeremy
                    Senior Member
                    • Nov 2009
                    • 190

                    #10
                    It would probably not be too difficult to get a list of variants between your sample and the reference, then convert the reference genome to the variant bases and then use gtf to fasta to get the variant transcripts. I have done something similar in R using the seqinr package.

                    Comment

                    • sindrle
                      Senior Member
                      • Aug 2013
                      • 266

                      #11
                      Lets say you have called indels and SNPs with GATK. Would that work, or can you please share some more details?

                      I have never done this before.

                      Comment

                      Latest Articles

                      Collapse

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, 06-05-2026, 10:09 AM
                      0 responses
                      11 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-04-2026, 08:59 AM
                      0 responses
                      23 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-02-2026, 12:03 PM
                      0 responses
                      28 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-02-2026, 11:40 AM
                      0 responses
                      22 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...