Is anyone aware of a program or simple script to select the most abundant transcript (for ex., based on FPKM values) for each gene, from a Trinity assembled transcriptome that has been run through RSEM? I have the RSEM output file RSEM.isoform.results that looks like this:
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
comp1000093_c0_seq1 comp1000093_c0 257 180.57 2.00 0.33 0.31 100.00
comp1000100_c0_seq1 comp1000100_c0 308 231.21 4.00 0.51 0.49 100.00
comp1000106_c0_seq1 comp1000106_c0 279 202.37 2.00 0.29 0.28 100.00
comp135533_c0_seq1 comp135533_c0 233 156.94 0.00 0.00 0.00 0.00
comp135533_c0_seq2 comp135533_c0 288 211.31 4.00 0.56 0.54 48.65
comp135533_c0_seq3 comp135533_c0 235 158.90 0.00 0.00 0.00 0.00
comp135533_c0_seq4 comp135533_c0 426 349.02 7.00 0.60 0.57 51.35
As well as a fasta file with all the transcripts.
So I would want to end of with a fasta file with only a single transcript_id per gene_id.
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
comp1000093_c0_seq1 comp1000093_c0 257 180.57 2.00 0.33 0.31 100.00
comp1000100_c0_seq1 comp1000100_c0 308 231.21 4.00 0.51 0.49 100.00
comp1000106_c0_seq1 comp1000106_c0 279 202.37 2.00 0.29 0.28 100.00
comp135533_c0_seq1 comp135533_c0 233 156.94 0.00 0.00 0.00 0.00
comp135533_c0_seq2 comp135533_c0 288 211.31 4.00 0.56 0.54 48.65
comp135533_c0_seq3 comp135533_c0 235 158.90 0.00 0.00 0.00 0.00
comp135533_c0_seq4 comp135533_c0 426 349.02 7.00 0.60 0.57 51.35
As well as a fasta file with all the transcripts.
So I would want to end of with a fasta file with only a single transcript_id per gene_id.
Comment