Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

cufflinks - having difficulty reading reference GTF

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Idong,

    Did you use the --no-novel-juncs command?

    Comment


    • #17
      No, I didn't. Would that help?

      Comment


      • #18
        I'm sorry I actually should have addressed that to Pejman.

        The idea is that if you're supplying a GTF in TopHat and you don't know want it find any splice junctions that are not in the GTF, you should use the --no-novel-juncs option.

        Comment


        • #19
          Hi, thanks for the hint. Yeah I know that, but apparently TopHat+colorspace has just been released in too much rush. It crashes on SOLiD data:
          http://seqanswers.com/forums/showthread.php?t=7118
          http://seqanswers.com/forums/showthread.php?p=26692

          Comment


          • #20
            Hi,

            In the tophat wiki page "https://sites.google.com/a/brown.edu/bioinformatics-in-biomed/cufflinks-and-tophat#read", I found some very helpful perl scripts, for example, find_known_gene_and_alternative_splicing_expression_from_cuffcompare_result.pl. Do you konw where I can download them?

            Cheers,
            Jun

            Comment


            • #21
              Hi, Jun, Here is the script. Any feedback is welcome. Best, ldong

              #!/gpfs/runtime/bin/perl

              use Spreadsheet::WriteExcel;
              use strict;

              my %gene;
              my %alter;

              my $file = $ARGV[0];
              if ($file eq '') {
              die "Usage: find_gene_and_alternative_splicing_expression_from_cuffcompare_result.pl <\'tmap\' file with path from cuffcompare>\n";
              }

              open I, $file or die "can not open file $file.";

              while (<I>) {
              my @p = split /\t/;
              if ($p[2] eq '=') {
              $gene{$p[0]} = $gene{$p[0]} + $p[6];
              $alter{$p[1]} = $alter{$p[1]} + $p[6];
              }
              }
              close I;

              # Create a new workbook called simple.xls and add a worksheet.
              my $workbook = Spreadsheet::WriteExcel->new($ENV{'LANE'}.'_expression_of_known_gene_&_alternative_splicing.xls');
              my $worksheet = $workbook->add_worksheet("AceView Genes");

              my $count = 1;
              $worksheet->write(0, 0, "gene");
              $worksheet->write(0, 1, "expression");
              for my $k (sort keys %gene) {
              $worksheet->write_url($count, 0, "http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?exdb=AceView&db=mouse&submit=Go&term=$k", $k);
              $worksheet->write($count, 1, $gene{$k});
              $count++;

              }

              $worksheet = $workbook->add_worksheet("AceView Alternative Splicing");

              $count = 1;
              $worksheet->write(0, 0, "alternative splicing variants");
              $worksheet->write(0, 1, "expression");
              for my $k (sort keys %alter) {
              $worksheet->write_url($count, 0, "http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?exdb=AceView&db=mouse&submit=Go&term=$k", $k);
              $worksheet->write($count, 1, $alter{$k});
              $count++;
              }

              Comment


              • #22
                Hi, ldong,

                This is great. Thx a million. I will look into it.

                Cheers,
                Jun

                Comment


                • #23
                  I just read the script, it is very neat and clear. But I notice that you only count the assembled transcripts with "=" mark (match). How do you deal with transcripts with "c" mark (contained).

                  Comment


                  • #24
                    why the FPKM is 0 ??

                    I have the same problem as you guys here: Cufflinks works well without a reference GTF file, while with reference, it gives 0 FPKM values. And this is not always for all genes, only some genes are like that. I am really confused.

                    Here are my input files:

                    htt.sam (output from Tophat, for example purpose, I only contain reads in HTT genes regions, where there are >3000 reads. See the screenshot of these aligned reads in UCSC)
                    htt.gtf (GTF for HTT genes, extracted from ensemble gtf file)

                    Here is my code (cufflinks version: 0.9.3.Linux_x86_64):
                    Code:
                    cufflinks -G htt.gtf htt.sam
                    Here is the output from screen:
                    Code:
                    $ /home/dongx/bin/cufflinks-0.9.3.Linux_x86_64/cufflinks -G htt.gtf htt.sam 
                    /home/dongx/bin/cufflinks-0.9.3.Linux_x86_64/cufflinks: /usr/lib64/libz.so.1: no version information available (required by /home/dongx/bin/cufflinks-0.9.3.Linux_x86_64/cufflinks)
                    [bam_header_read] EOF marker is absent.
                    File htt.sam doesn't appear to be a valid BAM file, trying SAM...
                    [17:33:52] Inspecting reads and determining fragment length distribution.
                    > Processed 1 loci.                            [*************************] 100%
                    > Map Properties:
                    >	Total Map Mass: 3542.44
                    >	Read Type: 40bp single-end
                    >	Fragment Length Distribution: Gaussian (default)
                    >	              Estimated Mean: 204.01
                    >	           Estimated Std Dev: 74.81
                    [17:33:52] Estimating transcript abundances.
                    > Processed 1 loci.                            [*************************] 100%
                    Here is the content in output genes.expr file:
                    Code:
                    $ cat genes.expr
                    gene_id	bundle_id	chr	left	right	FPKM	FPKM_conf_lo	FPKM_conf_hi	status
                    [COLOR="Red"]ENSG00000197386	3	chr4	3076406	3245676	0	0	0	OK[/COLOR]
                    Here is the content of output transcripts.expr file:
                    Code:
                    [[email protected] ~]$ cat transcripts.expr
                    trans_id	bundle_id	chr	left	right	FPKM	FMI	frac	FPKM_conf_lo	FPKM_conf_hi	coverage	length	effective_length	status
                    ENST00000355072	3	chr4	3076406	3245676	0	0	0	0	0	0	13531	13531	OK
                    ENST00000506137	3	chr4	3117054	3123439	0	0	0	0	0	0	784	784	OK
                    ENST00000512909	3	chr4	3123125	3125193	0	0	0	0	0	0	613	613	OK
                    ENST00000510626	3	chr4	3130064	3245667	0	0	0	0	0	0	14491	14491	OK
                    ENST00000509618	3	chr4	3162087	3174954	0	0	0	0	0	0	432	432	OK
                    ENST00000513639	3	chr4	3180072	3182400	0	0	0	0	0	0	261	261	OK
                    ENST00000513326	3	chr4	3180072	3182514	0	0	0	0	0	0	375	375	OK
                    ENST00000509043	3	chr4	3180072	3182521	0	0	0	0	0	0	382	382	OK
                    ENST00000502820	3	chr4	3204731	3208306	0	0	0	0	0	0	290	290	OK
                    ENST00000509751	3	chr4	3213637	3214782	0	0	0	0	0	0	725	725	OK
                    ENST00000512068	3	chr4	3230370	3231649	0	0	0	0	0	0	403	403	OK
                    ENST00000513806	3	chr4	3230438	3237123	0	0	0	0	0	0	436	436	OK
                    ENST00000508321	3	chr4	3237149	3237906	0	0	0	0	0	0	388	388	OK
                    It seems that cufflinks has read the GTF file correctly and the reads are obviously mapped to the gene. But why the FPKM is 0 ??


                    Anyone have clue for this? You can test the above code in your machine. Pls let me know if you got different results. THANKS!!


                    Originally posted by Pejman View Post
                    I have the same problem. I'm using cufflinks-0.9.1.Linux_x86_64 and trying to get it to produce expression levels based on a GTF file. If I run it without a reference GTF file, it works fine, gives some expressions. But with reference:
                    Code:
                    $cufflinks -G hg18_annotation.gtf MyFile.sam
                    it gives:
                    Code:
                    [20:24:17] Inspecting reads and determining fragment length distribution.
                    > Processing Locus chr6:27032750-27099732      [*****************        ]  69%
                    Error: this SAM file doesn't appear to be correctly sorted!
                            current hit is at chr10:272501, last one was at chr9:139953334
                    Cufflinks requires that if your file has SQ records in
                    the SAM header that they appear in the same order as the chromosomes names 
                    in the alignments.
                    If there are no SQ records in the header, or if the header is missing,
                    the alignments must be sorted lexicographically by chromsome
                    name and by position.

                    The GTF file I downloaded from UCSC genomebrowser portal and .sam file is produced by Bowtie, and then sorted by Samtools. Does anybody have any clue?

                    Comment


                    • #25
                      Originally posted by nkwuji View Post
                      I just read the script, it is very neat and clear. But I notice that you only count the assembled transcripts with "=" mark (match). How do you deal with transcripts with "c" mark (contained).
                      Hi, Jun,
                      Sorry I didn't see your post earlier. I treat c and j as novel transcripts. His is my script to find them. Feedbacks are welcome:

                      #!/usr/bin/perl

                      use Spreadsheet::WriteExcel;
                      use Text::ParseWords 'shellwords';
                      use strict;

                      my %exp;
                      my %geneo; #known gene
                      my %genen; #novel gene
                      my %trano; #old transcript
                      my %rel;

                      my %gene;

                      my $file = $ARGV[0];
                      if (not -f $file) {
                      print "Usage: find_novel_alternative_splicing_expression_from_cuffcompare_result.pl <\'tmap\' file with path from cuffcompare> <threshold for the novel transcript>\n";
                      die "File not exist: $file";
                      }
                      if ($ARGV[1] eq '') {
                      print "Usage: find_novel_alternative_splicing_expression_from_cuffcompare_result.pl <\'tmap\' file with path from cuffcompare> <threshold for the novel transcript>\n";
                      die "You should provide a threshold.";
                      }

                      print "looking for novel transcipts...\n";

                      #open cufflinks output file and query the novel transcitps. 'j' means novel, 'c' means contain
                      open I, $file;
                      while (<I>) {
                      my @p = split /\t/;
                      if (($p[2] eq 'j' or $p[2] eq 'c') and $p[6] > $ARGV[1]) {

                      #expressoion level
                      $exp{$p[4]} = $p[6];

                      #old gene name
                      $geneo{$p[4]} = $p[0];
                      #old transctipt name

                      $trano{$p[4]} = $p[1];

                      #new gene name
                      $genen{$p[4]} = $p[3];

                      #relationship
                      $rel{$p[4]}= $p[2];

                      #old gene name
                      $gene{$p[0]} =1;

                      #new gene name
                      $gene{$p[3]} =1;
                      }
                      }
                      close I;

                      #open the gtf to query the regions in gtf file which are related to the nevol transcipts
                      open I, "/gpfs/runtime/bioinfo/aceviewdb/aceview_annotation_mm9_with_chrID.gtf" or die "can not open gtf file";

                      print "looking for the related regions in gtf file...\n";

                      my (%start, %end, %chr);
                      while (<I>) {
                      my ($seqid,$source,$method,$start,$end,$score,$strand,$phase,$tags) = split "\t";
                      $tags =~ /gene_id\s\"(\S+)?\"/;
                      if($method eq "exon" and defined $gene{$1}) {
                      $start = int($start);
                      $end = int($end);
                      if(not defined $start{$1} or $start{$1} > $start) {
                      $start{$1} = $start;
                      $chr{$1} = $seqid;
                      }
                      if(not defined $end{$1} or $end{$1} < $end) {
                      $end{$1} = $end;
                      }
                      }
                      }
                      close I;

                      #find the file name for transcirpt.gtf
                      $file =~ s/\..+/.gtf/;

                      if(not -f $file) {
                      die "cound not find file $file";
                      }

                      #query the gtf file for the regions which are related to the novel transcripts, and print out the regions into the new file
                      open I, $file or die "can not open file $file";
                      open O, ">".$ENV{'LANE'}."_novel_scripts_threshold_".$ARGV[1].".gtf";
                      while (my $line = <I>) {
                      my ($seqid,$source,$method,$start,$end,$score,$strand,$phase,$tags) = split "\t", $line;
                      $tags =~ /gene_id\s\"(\S+)?\"/;

                      #get all the entries for the novel transcripts
                      if(defined $gene{$1}) {
                      print O $line;
                      }
                      }
                      close I;
                      close O;


                      #find the sam file name
                      $file =~ s/transcripts.gtf/accepted_hits.sam/;

                      if(not -f $file) {
                      die "cound not find file $file";
                      }

                      open I, $file or die "can not open file $file";

                      print "looking for related regions in sam file...\n";

                      #calculate the region we are interested
                      my %range;
                      for my $g (keys %chr) {
                      for my $i($start{$g}..$end{$g}) {
                      $range{$chr{$g}}{$i} = 1;
                      }
                      }

                      $file = $ENV{'LANE'}."_novel_transcripts_related_region_threshold_".$ARGV[1];
                      #query the sam file for the regions which related to the novel transcipts and print out into a new file
                      open O, ">$file.sam";
                      while (my $line = <I>) {
                      my ($a,$b,$refid,$start,$d,$e,$f,$end) = split "\t", $line;
                      #this will be removed
                      #$refid = "chr".$refid;
                      if(defined $range{$refid}{int($start)} or defined $range{$refid}{int($end)}) {
                      print O $line;
                      }
                      }
                      close I;
                      close O;

                      #convert the sam file into bam file for UCSC genome browser
                      my $cmd = "samtools view -b -S -T /gpfs/runtime/bioinfo/bowtie/indexes/mm9_all_folded_with_chrID.fa -o $file.bam $file.sam";
                      #print "$cmd\n";
                      system($cmd);

                      #sort and index the bam file
                      system("samtools sort $file.bam $file.sorted");
                      system("samtools index $file.sorted.bam");

                      #this is part of sam file
                      #ILLUMINA-2F52BD_0002:5:1:2235:9480#0 177 chr1 3047832 255 60M = 173820455 0 TCCTATCTCCACCATGGGCTGCAGCAAGGAGTACCAACCTGGTGCCAATTTAAAATAAAA @D:[email protected]>6C-BDDBCCC?CAA5AAA?CBACC55CF=FFFFFFFD:?FFFFFFFDFFFF F NM:i:2 NH:i:1
                      #ILLUMINA-2F52BD_0002:5:1:1243:4783#0 73 chr1 4122380 0 60M * 0 0 AAGCGAAACCCCGTCCCTATTAAGGAGAATAATCCTTCGCCTAGGACGTGTCACTCCCTG [email protected]??E5EEE?DADFDEDGFGGDFGDFFFDF NM:i:2 NH:i:37 CC:Z:= CP:i:5794552

                      print "print out result into exel file...\n";

                      # Create a new workbook called simple.xls and add a worksheet.
                      my $workbook = Spreadsheet::WriteExcel->new($ENV{'LANE'}.'_expression_novel_transcipts.xls');
                      my $worksheet = $workbook->add_worksheet("Alternative Splicing");

                      my $count = 1;
                      $worksheet->write(0, 0, "novel gene name"); #alternative splicing variants");
                      $worksheet->write(0, 1, "novel transcipt");
                      $worksheet->write(0, 2, "expression");
                      $worksheet->write(0, 3, "related gene");
                      $worksheet->write(0, 4, "related transcript");
                      $worksheet->write(0, 5, "relationship");
                      $worksheet->write(0, 6, "aceview link");

                      for my $k (sort keys %exp) {
                      my $range = $chr{$geneo{$k}}.":".($start{$geneo{$k}}-150)."-".($end{$geneo{$k}}+150);
                      $worksheet->write_url($count, 6, "http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Mouse&db=mm9&Submit=submit&position=".$range, $range);
                      $worksheet->write($count, 0, $genen{$k});
                      $worksheet->write($count, 1, $k);
                      $worksheet->write($count, 2, $exp{$k});
                      $worksheet->write($count, 3, $geneo{$k});
                      $worksheet->write($count, 4, $trano{$k});
                      if ($rel{$k} eq "j") {
                      $worksheet->write($count, 5, "novel");
                      } else {
                      $worksheet->write($count, 5, "contain");
                      }
                      $count++;
                      }
                      print "all done\n";
                      Last edited by ldong; 12-04-2010, 05:01 AM.

                      Comment


                      • #26
                        Hi sterding
                        I have observed the exact same thing and described it in http://seqanswers.com/forums/showthr...9933#post29933
                        I have reproduced your strange results using your files locally.
                        If anyone can shed some light on this, it would be great.

                        Comment


                        • #27
                          Originally posted by vang42 View Post
                          Hi sterding
                          I have observed the exact same thing and described it in http://seqanswers.com/forums/showthr...9933#post29933
                          I have reproduced your strange results using your files locally.
                          If anyone can shed some light on this, it would be great.
                          Hi, Vang42

                          I think my problem was due to the wrong Tophat version I was using (v. 1.1.2). It's fixed after V.1.1.3. Check the release news, pls.

                          Regards,

                          Sterding

                          Comment


                          • #28
                            Hi Sterding,
                            That is strange. I am using Tophat version 1.1.4 (beta, Linux x86_64 binary)
                            Bowtie version: 0.12.7.0
                            Samtools version: 0.1.8.0
                            And I reproduced your zero values.

                            -Vang42

                            Comment


                            • #29
                              The sam file I attached in my previous post was produced by Tophat V1.1.2, which was wrong! I am re-running the mapping using new version of Tophat (v.1.1.4), hoping it's fine. I will let you know.

                              Comment


                              • #30
                                Thanks for the scripts, ldong. The Spreadsheet::WriteExcel perl module creates an excel spread sheet which is 40mb in size and I am not able to open it... Excel reports that it is corrupted.

                                Others people using the same module had the problem and they were suggested to use Spreadsheet::WriteExcel::Big module. I have tried using it but I still have the same problem. Any suggestions?

                                More info is here: http://bit.ly/dZh1iS

                                - Dinesh

                                Comment

                                Working...
                                X