Hi,
Below is the background of my input file:
-Two input file named as s_1_1.fq (left) and s_1_2.fq (right);
-Illumina pair-end read;
-2X80bp;
-insert size for the PE library is 300;
-the read are made with a random priming process and are not stranded;
-human heart tissue;
Command used in Tophat:
/tophat-1.2.0.Linux_x86_64/tophat -r 140 -p 4 --solexa1.3-quals --library-type fr-unstranded human_ref_genome s_1_1.fq s_1_2.fq
Output file:
accepted_hits.bam;
deletions.bed;
insertions.bed;
junctions.bed;
Command used in Cufflink:
Cufflinks-0.9.2/cufflinks -p 4 --library-type fr-unstranded accepted_hits.bam
transcripts.gtf
isoforms.fpkm_tracking
genes.fpkm_tracking
Problem 1:
Based on the background of my input file, the command that I try for tophat and cufflink, are they correct?
Problem 2:
Below is the statistics of the transcripts.gtf:
Number of transcript: 89,005
Total bases of assembly transcript: 552,350,446
Number of exon: 198,560
N50 assembly transcript length: 52,930
Longest length of assembly transcript: 850,231
Shortest length of assembly transcirpt: 49
The above statistic result is reasonable for Cufflink assembly?
It seems the assembly read is wrong when comparing with the available RNA seq published in NCBI, ftp://ftp.ncbi.nih.gov/refseq/H_sapi...man.rna.fna.gz
Below is the statistics of the human.rna.fna.gz:
Number of transcript: 46,296
Total bases of assembly transcript: 124,026,830
N50 assembly transcript length: 3,697
Longest length of assembly transcript: 101,520
Shortest length of assembly transcirpt: 33
Problem 3:
head -1 transcript.gtf:
gi|224589800|ref|NC_000001.10| Cufflinks transcript 17231 17728 1000 + . gene_id "CUFF.137"; transcript_id "CUFF.137.1"; FPKM "18.1745565587"; frac "1.000000"; conf_lo "9.648231"; conf_hi "26.700882"; cov "122.001851";
head -2 transcript.expr:
trans_id bundle_id chr left right FPKM FMI frac FPKM_conf_lo FPKM_conf_hi coverage length effective_length status
CUFF.137.1 168890 gi|224589800|ref|NC_000001.10| 17230 17728 18.1746 1 1 9.64823 26.7009 122.002 498 449 OK
head -2 gene.expr:
gene_id bundle_id chr left right FPKM FPKM_conf_lo FPKM_conf_hi status
CUFF.137 168890 gi|224589800|ref|NC_000001.10| 17230 17728 18.1746 9.64823 26.7009 OK
Why the start position of the transcript shown in transcript.gtf (17231) with transcript.expr (17230) and gene.expr (17230) are different?
In order to extract out the sequence read in fasta, which info that I should refer to, transcript.gtf or transcript.expr or gene.expr?
Problem 4:
transcript.expr and transcript.gtf shown the total number of transcript is 89,005; while the gene.expr only shown 84,950.
What is the main reason cause the difference number of transcript in both file?
Is it because individual genes can produce alt-splice isoforms?
Problem 5:
Based on the info shown as Cufflink user manual, http://cufflinks.cbcb.umd.edu/manual...racking_format
Cufflink will generate three output file named as: transcripts.gtf, isoforms.fpkm_tracking, genes.fpkm_tracking
Why the above cufflink command generate transcript.expr and genes.expr instead of isoforms.fpkm_tracking, genes.fpkm_tracking?
How to generate isoforms.fpkm_tracking and genes.fpkm_tracing in Cufflink?
Problem 6:
How to extract out the transcript sequence that assembly by Cufflink?
Which output file in Cufflink that I should refer in order to get the info of transcript sequence region assembly by Cufflink?
I would like to extract out the transcript sequence that assembly by Cufflink in FASTA format for downstream analysis.
Kindly correctly me if I misunderstanding about the proper command that I should key in for Tophat and cufflink based on my input file.
Thanks for any advice.
Below is the background of my input file:
-Two input file named as s_1_1.fq (left) and s_1_2.fq (right);
-Illumina pair-end read;
-2X80bp;
-insert size for the PE library is 300;
-the read are made with a random priming process and are not stranded;
-human heart tissue;
Command used in Tophat:
/tophat-1.2.0.Linux_x86_64/tophat -r 140 -p 4 --solexa1.3-quals --library-type fr-unstranded human_ref_genome s_1_1.fq s_1_2.fq
Output file:
accepted_hits.bam;
deletions.bed;
insertions.bed;
junctions.bed;
Command used in Cufflink:
Cufflinks-0.9.2/cufflinks -p 4 --library-type fr-unstranded accepted_hits.bam
transcripts.gtf
isoforms.fpkm_tracking
genes.fpkm_tracking
Problem 1:
Based on the background of my input file, the command that I try for tophat and cufflink, are they correct?
Problem 2:
Below is the statistics of the transcripts.gtf:
Number of transcript: 89,005
Total bases of assembly transcript: 552,350,446
Number of exon: 198,560
N50 assembly transcript length: 52,930
Longest length of assembly transcript: 850,231
Shortest length of assembly transcirpt: 49
The above statistic result is reasonable for Cufflink assembly?
It seems the assembly read is wrong when comparing with the available RNA seq published in NCBI, ftp://ftp.ncbi.nih.gov/refseq/H_sapi...man.rna.fna.gz
Below is the statistics of the human.rna.fna.gz:
Number of transcript: 46,296
Total bases of assembly transcript: 124,026,830
N50 assembly transcript length: 3,697
Longest length of assembly transcript: 101,520
Shortest length of assembly transcirpt: 33
Problem 3:
head -1 transcript.gtf:
gi|224589800|ref|NC_000001.10| Cufflinks transcript 17231 17728 1000 + . gene_id "CUFF.137"; transcript_id "CUFF.137.1"; FPKM "18.1745565587"; frac "1.000000"; conf_lo "9.648231"; conf_hi "26.700882"; cov "122.001851";
head -2 transcript.expr:
trans_id bundle_id chr left right FPKM FMI frac FPKM_conf_lo FPKM_conf_hi coverage length effective_length status
CUFF.137.1 168890 gi|224589800|ref|NC_000001.10| 17230 17728 18.1746 1 1 9.64823 26.7009 122.002 498 449 OK
head -2 gene.expr:
gene_id bundle_id chr left right FPKM FPKM_conf_lo FPKM_conf_hi status
CUFF.137 168890 gi|224589800|ref|NC_000001.10| 17230 17728 18.1746 9.64823 26.7009 OK
Why the start position of the transcript shown in transcript.gtf (17231) with transcript.expr (17230) and gene.expr (17230) are different?
In order to extract out the sequence read in fasta, which info that I should refer to, transcript.gtf or transcript.expr or gene.expr?
Problem 4:
transcript.expr and transcript.gtf shown the total number of transcript is 89,005; while the gene.expr only shown 84,950.
What is the main reason cause the difference number of transcript in both file?
Is it because individual genes can produce alt-splice isoforms?
Problem 5:
Based on the info shown as Cufflink user manual, http://cufflinks.cbcb.umd.edu/manual...racking_format
Cufflink will generate three output file named as: transcripts.gtf, isoforms.fpkm_tracking, genes.fpkm_tracking
Why the above cufflink command generate transcript.expr and genes.expr instead of isoforms.fpkm_tracking, genes.fpkm_tracking?
How to generate isoforms.fpkm_tracking and genes.fpkm_tracing in Cufflink?
Problem 6:
How to extract out the transcript sequence that assembly by Cufflink?
Which output file in Cufflink that I should refer in order to get the info of transcript sequence region assembly by Cufflink?
I would like to extract out the transcript sequence that assembly by Cufflink in FASTA format for downstream analysis.
Kindly correctly me if I misunderstanding about the proper command that I should key in for Tophat and cufflink based on my input file.
Thanks for any advice.
Comment