Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA-seq analysis by Tophat, Cufflink Problem facing

    Hi,
    Below is the background of my input file:
    -Two input file named as s_1_1.fq (left) and s_1_2.fq (right);
    -Illumina pair-end read;
    -2X80bp;
    -insert size for the PE library is 300;
    -the read are made with a random priming process and are not stranded;
    -human heart tissue;

    Command used in Tophat:
    /tophat-1.2.0.Linux_x86_64/tophat -r 140 -p 4 --solexa1.3-quals --library-type fr-unstranded human_ref_genome s_1_1.fq s_1_2.fq
    Output file:
    accepted_hits.bam;
    deletions.bed;
    insertions.bed;
    junctions.bed;

    Command used in Cufflink:
    Cufflinks-0.9.2/cufflinks -p 4 --library-type fr-unstranded accepted_hits.bam
    transcripts.gtf
    isoforms.fpkm_tracking
    genes.fpkm_tracking

    Problem 1:
    Based on the background of my input file, the command that I try for tophat and cufflink, are they correct?

    Problem 2:
    Below is the statistics of the transcripts.gtf:
    Number of transcript: 89,005
    Total bases of assembly transcript: 552,350,446
    Number of exon: 198,560
    N50 assembly transcript length: 52,930
    Longest length of assembly transcript: 850,231
    Shortest length of assembly transcirpt: 49

    The above statistic result is reasonable for Cufflink assembly?
    It seems the assembly read is wrong when comparing with the available RNA seq published in NCBI, ftp://ftp.ncbi.nih.gov/refseq/H_sapi...man.rna.fna.gz
    Below is the statistics of the human.rna.fna.gz:
    Number of transcript: 46,296
    Total bases of assembly transcript: 124,026,830
    N50 assembly transcript length: 3,697
    Longest length of assembly transcript: 101,520
    Shortest length of assembly transcirpt: 33

    Problem 3:
    head -1 transcript.gtf:
    gi|224589800|ref|NC_000001.10| Cufflinks transcript 17231 17728 1000 + . gene_id "CUFF.137"; transcript_id "CUFF.137.1"; FPKM "18.1745565587"; frac "1.000000"; conf_lo "9.648231"; conf_hi "26.700882"; cov "122.001851";
    head -2 transcript.expr:
    trans_id bundle_id chr left right FPKM FMI frac FPKM_conf_lo FPKM_conf_hi coverage length effective_length status
    CUFF.137.1 168890 gi|224589800|ref|NC_000001.10| 17230 17728 18.1746 1 1 9.64823 26.7009 122.002 498 449 OK
    head -2 gene.expr:
    gene_id bundle_id chr left right FPKM FPKM_conf_lo FPKM_conf_hi status
    CUFF.137 168890 gi|224589800|ref|NC_000001.10| 17230 17728 18.1746 9.64823 26.7009 OK

    Why the start position of the transcript shown in transcript.gtf (17231) with transcript.expr (17230) and gene.expr (17230) are different?
    In order to extract out the sequence read in fasta, which info that I should refer to, transcript.gtf or transcript.expr or gene.expr?

    Problem 4:
    transcript.expr and transcript.gtf shown the total number of transcript is 89,005; while the gene.expr only shown 84,950.
    What is the main reason cause the difference number of transcript in both file?
    Is it because individual genes can produce alt-splice isoforms?

    Problem 5:
    Based on the info shown as Cufflink user manual, http://cufflinks.cbcb.umd.edu/manual...racking_format
    Cufflink will generate three output file named as: transcripts.gtf, isoforms.fpkm_tracking, genes.fpkm_tracking
    Why the above cufflink command generate transcript.expr and genes.expr instead of isoforms.fpkm_tracking, genes.fpkm_tracking?
    How to generate isoforms.fpkm_tracking and genes.fpkm_tracing in Cufflink?

    Problem 6:
    How to extract out the transcript sequence that assembly by Cufflink?
    Which output file in Cufflink that I should refer in order to get the info of transcript sequence region assembly by Cufflink?
    I would like to extract out the transcript sequence that assembly by Cufflink in FASTA format for downstream analysis.

    Kindly correctly me if I misunderstanding about the proper command that I should key in for Tophat and cufflink based on my input file.
    Thanks for any advice.
    Last edited by edge; 05-11-2011, 11:23 PM.

  • #2
    This post is a duplicate of one in the Bioinformatics forum. Please do not duplicate your question in multiple forums.

    Comment


    • #3
      First, you should get the newest version of Cufflinks (1.0.1).
      I don't know what your coverage is like, but with human RNA, you should supply Cufflinks with a gtf file containing known transcripts. If you don't do this and do not have sufficient coverage to reconstruct transcripts completely, cufflinks might not link all the reads mapping to a gene together if there is no evidence supporting a link (split reads, pairs mapping to neighboring exons). This might explain the higher number of transcripts you are seeing since it could cause one gene to be split up into multiple smaller transcripts that do not represent reality.

      Comment


      • #4
        Thanks for remind, nilshomer.
        I will take note of it.
        Sorry for my mistakes

        Comment


        • #5
          Hi pbluescript,

          Thanks for your advice.
          I just download the latest version of Cufflinks (1.0.1) and re-run it again.
          Do you know what is the main difference between ff-unstranded and fr-unstranded?
          If my input file is non-stranded-specific RNA seq, what option should I choose my library type in Cufflink and Tophat?
          Is it correct I set it as "fr-unstranded" in both Cufflink and Tophat?
          Thanks first for your advice

          Comment


          • #6
            Originally posted by edge View Post
            Hi pbluescript,

            Thanks for your advice.
            I just download the latest version of Cufflinks (1.0.1) and re-run it again.
            Do you know what is the main difference between ff-unstranded and fr-unstranded?
            If my input file is non-stranded-specific RNA seq, what option should I choose my library type in Cufflink and Tophat?
            Is it correct I set it as "fr-unstranded" in both Cufflink and Tophat?
            Thanks first for your advice
            I don't use strand specific libraries for my RNA-Seq based on how I have to prepare my libraries, so I haven't played around with those settings. Perhaps someone else can offer some input?
            They aren't required by Tophat, so you can do the mapping without specifying a library type.

            Comment


            • #7
              Based on the Tophat manual, http://tophat.cbcb.umd.edu/manual.html
              Tophat will treat the read as strand specific.
              Since my input raw file is non-stranded specific RNA seq.
              I realized Tophat got an option to specific the "--library-type".
              I just not sure whether which option I should use in "--library-type" for non-stranded specific RNA seq.
              Below is the option for "--library-type" in Tophat:
              --fr-unstranded; --fr-firststrand; --fr-secondstrand; --ff-unstranded; --ff-firststrand; --ff-secondstrand

              Comment


              • #8
                Hi pbluescript,
                Do you have experience running Cufflink and Tophat before?
                Are you familiar with it?
                Do you know that which output file in Cufflink is refer to the transcript assembly by Cufflink?
                I'm interesting to extract out the transcript sequence in FASTA format for downstream analysis.
                Apart from that, since my input sample is only heart tissue and only one SAM file will generate by Cufflink at the end.
                Is it impossible for me to run Cuffdiff with my sample set?
                "Cuffdiff takes a GTF2/GFF3 file of transcripts as input, along with two or more SAM files containing the fragment alignments for two or more samples."

                Comment


                • #9
                  I have asked one of the developers of Tophat about this library type option:

                  "If you don't specify --library-type, TopHat just treats your reads as
                  unstranded. (The default is *unstranded*). Actually, library-type is only intended for paired-end, so if you specify the library-type fr-unstranded option, should be the same as non-specified. "

                  HTH.

                  Originally posted by edge View Post
                  Based on the Tophat manual, http://tophat.cbcb.umd.edu/manual.html
                  Tophat will treat the read as strand specific.
                  Since my input raw file is non-stranded specific RNA seq.
                  I realized Tophat got an option to specific the "--library-type".
                  I just not sure whether which option I should use in "--library-type" for non-stranded specific RNA seq.
                  Below is the option for "--library-type" in Tophat:
                  --fr-unstranded; --fr-firststrand; --fr-secondstrand; --ff-unstranded; --ff-firststrand; --ff-secondstrand

                  Comment


                  • #10
                    Originally posted by edge View Post
                    Hi pbluescript,
                    Do you have experience running Cufflink and Tophat before?
                    Are you familiar with it?
                    Do you know that which output file in Cufflink is refer to the transcript assembly by Cufflink?
                    I'm interesting to extract out the transcript sequence in FASTA format for downstream analysis.
                    Apart from that, since my input sample is only heart tissue and only one SAM file will generate by Cufflink at the end.
                    Is it impossible for me to run Cuffdiff with my sample set?
                    "Cuffdiff takes a GTF2/GFF3 file of transcripts as input, along with two or more SAM files containing the fragment alignments for two or more samples."
                    Sorry, I missed your post.
                    I'm not entirely clear on your question, but the output of Cufflinks includes tables with transcript information and expression values, all of which you can read about on the Cufflinks manual page.
                    You can get all the sequences using the UCSC table browser.
                    Cuffdiff looks for differential expression between multiple samples, so if you don't have at least two samples to compare, there is no point in running Cuffdiff.

                    Comment


                    • #11
                      Hi Wei-HD,

                      Thanks a lot for your verification
                      Which mean the command that I type for library-type is correct ^^

                      Comment


                      • #12
                        Hi pbluescript,

                        Thanks for your reply.
                        I'm understand more about Cuffdiff right now
                        Apart from that, do you mind to share what is the main difference between gene and transcript in general?
                        I just quite confusing because some journal mention gene/transcript while some journal mention gene and transcript separately

                        Comment


                        • #13
                          Hi Wei-HD,

                          Do you familiar with Tophat software?
                          I'm quite confusing regarding the "-r" option in Tophat.
                          My input file size are Illumina pair-end read, 2X80bp, insert library size is 300.
                          I should set the -r as 300 or 140 (300-2X80)?
                          Which of the following command are correct?

                          Command 1:
                          /tophat-1.2.0.Linux_x86_64/tophat -r 300 -p 4 --solexa1.3-quals --library-type fr-unstranded human_ref_genome s_1_1.fq s_1_2.fq

                          or

                          Command 2:
                          /tophat-1.2.0.Linux_x86_64/tophat -r 140 -p 4 --solexa1.3-quals --library-type fr-unstranded human_ref_genome s_1_1.fq s_1_2.fq

                          Thanks.

                          Comment


                          • #14
                            In your example the proper setting for -r is 140.

                            Comment


                            • #15
                              Originally posted by edge View Post
                              Hi pbluescript,

                              Thanks for your reply.
                              I'm understand more about Cuffdiff right now
                              Apart from that, do you mind to share what is the main difference between gene and transcript in general?
                              I just quite confusing because some journal mention gene/transcript while some journal mention gene and transcript separately
                              Generally, one gene can have multiple transcripts due to alternative start sites, spliced exons, etc. Assigning reads to specific known transcripts can be a tricky process, especially when an exon is shared across multiple transcripts. There is a good description of how Cufflinks does this here:

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Best Practices for Single-Cell Sequencing Analysis
                                by seqadmin



                                While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                                06-06-2024, 07:15 AM
                              • seqadmin
                                Latest Developments in Precision Medicine
                                by seqadmin



                                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                Somatic Genomics
                                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                05-24-2024, 01:16 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 07:49 AM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 07:23 AM
                              0 responses
                              14 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 06-17-2024, 06:54 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 06-14-2024, 07:24 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X