Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • extract assembled transcripts

    Hi All,

    I use Bowtie/TopHat/cufflink to analyze the RNA-Seq. I want to extract the rebuild transcripts by these softwares. However, I can only the BED file. Can anyone give idea to extract these assembled transcript sequence?

    Thanks,
    ZC

  • #2
    If you want to assemble the transcript with cufflinks you need to use the SAM file in the output of Tophat. If Tophat is not outputting a SAM file.. then maybe you need to email the authors for help.

    Cufflinks takes SAM files as input.

    Alternatively, you could try SpliceMap as another option, followed by Cufflinks
    SpliceMap: De novo detection of splice junctions from RNA-seq
    Download SpliceMap Comment here

    Comment


    • #3
      Thanks. I used tophat/cufflink to rebuild the transcripts. tophat generates sam format results and then, cufflink rebuilds transcripts. However, cufflink can only give the rebuild transcript name. If we want to analyze infurther based on these rebuild transcripts, we have to get rebuilt transcript sequence. Is there tool to get seq from gtf file, except write script by ourself?

      Comment


      • #4
        Originally posted by zchou View Post
        Thanks. I used tophat/cufflink to rebuild the transcripts. tophat generates sam format results and then, cufflink rebuilds transcripts. However, cufflink can only give the rebuild transcript name. If we want to analyze infurther based on these rebuild transcripts, we have to get rebuilt transcript sequence. Is there tool to get seq from gtf file, except write script by ourself?
        So for example, if cufflinks outputs the following assembled transcript

        chr21 Cufflinks transcript 39473730 39474150 1000 - . gene_id "CUFF.171"; transcript_id "CUFF.171.1"; FPKM "15397.8812515398"; frac "1.000000"; conf_lo "0.000000"; conf_hi "33177.823023"; cov "3.252033";

        chr21 Cufflinks exon 39473730 39473782 1000 - . gene_id "CUFF.171"; transcript_id "CUFF.171.1"; exon_number "1"; FPKM "15397.8812515398"; frac "1.000000"; conf_lo "0.000000"; conf_hi "33177.823023"; cov "3.252033";

        chr21 Cufflinks exon 39474081 39474150 1000 - . gene_id "CUFF.171"; transcript_id "CUFF.171.1"; exon_number "2"; FPKM "15397.8812515398"; frac "1.000000"; conf_lo "0.000000"; conf_hi "33177.823023"; cov "3.252033";

        You would like a tool to output some kind of consensus sequence of this transcript with two exons based on all reads that mapped there?

        I'm not aware of such a tool, but I see how it could be useful.
        SpliceMap: De novo detection of splice junctions from RNA-seq
        Download SpliceMap Comment here

        Comment


        • #5
          Exactly, that's I want. It is necessary to get consensus transcript sequence for each rebuild transcript for further analysis.

          Comment


          • #6
            Hello,

            One option could be to use samtools to generate a pileup file of the mapped reads (output of tophat). Then from the pileup format you could extract the column of the consensus base (the 4th column I think) at the genomic positions corresponding to your exons/transcripts.

            Something like (pseudocode):

            Code:
            ## Use -c option to output consensus sequence, not just the reference sequence
            samtools pilup -c tophat_output_accepted_hits.sam/bam -f myreference_genome.fa
            Pileup file should like:
            Code:
            MT      1       C       C       36      0       60      3       ^~.^~.^~.       ABA     ~~~
            MT      2       A       A       12      0       60      6       ...^~T^~.^~.    =BCBAB  ~~~~~~
            etc..
            It would take a little bit of coding to reconstruct the sequences from the pileup file but it's not too bad.

            Hope this helps

            Dario

            Comment


            • #7
              samtools can only give consensus sequence information, but where is the reconstructed sequence information for each transcript? Moreover, cufflink or other RNA-Seq software can get different isoforms for each gene. How can we extract these information?



              Originally posted by dariober View Post
              Hello,

              One option could be to use samtools to generate a pileup file of the mapped reads (output of tophat). Then from the pileup format you could extract the column of the consensus base (the 4th column I think) at the genomic positions corresponding to your exons/transcripts.

              Something like (pseudocode):

              Code:
              ## Use -c option to output consensus sequence, not just the reference sequence
              samtools pilup -c tophat_output_accepted_hits.sam/bam -f myreference_genome.fa
              Pileup file should like:
              Code:
              MT      1       C       C       36      0       60      3       ^~.^~.^~.       ABA     ~~~
              MT      2       A       A       12      0       60      6       ...^~T^~.^~.    =BCBAB  ~~~~~~
              etc..
              It would take a little bit of coding to reconstruct the sequences from the pileup file but it's not too bad.

              Hope this helps

              Dario

              Comment


              • #8
                (Just rephrasing my previous post...)

                "samtools pileup" with -c option gives at each genomic position the base of the reference *and* the base reconstructed from the reads. This latter is what you need, isn't it?

                As I said, to reconstruct the sequences of transcripts and isoforms you would need to write some script that extracts the regions from the pileup file that correspond to the transcripts in the cufflinks gtf file.

                So, if the GTF file looks (vaguely) like this:

                Code:
                chr    start    end    feature          exon_number           transcript_id
                -------------------------------------------------------------------------
                chr1    1         9        transcript                                cuff_1
                chr1    1         3         exon             1                       cuff_1
                chr1    6         9         exon             2                       cuff_1
                etc.
                And the pileup file looks (vaguely) like this:

                Code:
                chr    position        ref_base    consensus
                -------------------------------------------
                chr1     1                 A             T
                chr1     2                 C             T
                chr1     3                 G             T
                chr1     4                 A             A
                chr1     5                 C             A
                chr1     6                 G             G
                chr1     7                 A             G
                chr1     8                 C             G
                chr1     9                 G             G
                etc.
                Than you need a script that extracts the bases in the consensus column corresponding to chr1:1-3 (TTT) and to chr1:6-9 (GGGG) to have the reconstructed sequence of the transcript cuff_1.

                Apologies if I'm misunderstanding your problem

                Dario

                Comment


                • #9
                  Thanks dario,

                  It is very clear to illustrate how to get transcript sequence from sam and gtf file. I want to write scripts to extract sequence from gtf and fasta file before. Thanks a lot,

                  ZC

                  Comment


                  • #10
                    bump

                    I'm looking to do this myself.. anyone have a script handy? If not, I'll get started.

                    Thanks!

                    Comment


                    • #11
                      I believe you can upload the GTF to the UCSC genome browser and then extract the sequences from the Table tab.

                      Comment


                      • #12
                        Can you share those perl codes for the sequence extraction according to the GTF file? Thanks a lot!

                        Comment


                        • #13
                          This software will do what you want.

                          Comment


                          • #14
                            Hi k-gun12,
                            mind do share the script that you have written to extract the rebuild transcripts sequence?
                            Thanks in advance.

                            Comment


                            • #15
                              Hi zchou,
                              Do you already figure out the problem to extract assembled transcript?
                              Can I have a reference on how you do that?
                              I'm facing the same problems as well now

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              104 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              112 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              1 response
                              115 views
                              0 likes
                              Last Post EmiTom
                              by EmiTom
                               
                              Started by seqadmin, 09-26-2024, 12:57 PM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X