Seqanswers Leaderboard Ad

**ssing** · 02-10-2012, 11:13 AM

Hi LizBent,

I have been working on the exact same problem and have come up with some metrics to estimate the quality of a transcriptome in the absence of a ref genome. Some stats that I have used are:
*n50
*percent annotated to my closest reference
*percent of annotated proteins that have (what seem to be) premature stop codons
*percent of reads used/percent of paired reads used
*contiguity & completeness (see http://www.nature.com/nrg/journal/v1...l/nrg3068.html)
*incidence of chimeric transcripts

As for calculating simple metrics like n50, max contig size, etc, I use the command line program abyss-fac, which is available as part of the general ABySS package.

Good luck!

**nepossiver** · 10-12-2012, 12:03 PM

Originally posted by ssing View Post

*incidence of chimeric transcripts

hi ssing,

how do you calculate chimeric transcripts? Do you have a reference genome? My problem is, I don't, and I don't know of a good way to find chimeric contigs in my assemblies.

thanks

**student-t** · 05-12-2015, 05:59 PM

There're a few solutions to calculating metrics for an assembly.

1. https://github.com/ajmazurie/velvet-stats
2. Biopieces
3.http://korflab.ucdavis.edu/datasets/...athon_stats.pl
4. abyss-fac

I don't recommend 1-3. The documentation is bad, I didn't have the time to go through the source code. Biopieces required a multi-stage workflow, which I think it's a very stupid idea.

Use abyss-fac, don't waste your time. On a Mac, install it via "brew install abyss"

**Brian Bushnell** · 05-12-2015, 06:24 PM

Old thread, but BBMap has a stats.sh program that will summarize basic assembly stats (N50, L50, distribution of contig sizes, GC%, etc); it's very fast even on assemblies with millions of contigs, and extremely easy to use:

stats.sh contigs.fasta

For more advanced statistics, particularly if you have a reference and are evaluating different assembly methodologies, I recommend Quast because it also does alignment to the reference to calculate the number of misassemblies. Also, even if you don't have a reference, it does neat things like gene prediction. Not sure how that feature would work on a transcriptome, though.

**nepossiver** · 05-13-2015, 07:41 AM

Originally posted by Brian Bushnell View Post

For more advanced statistics, particularly if you have a reference and are evaluating different assembly methodologies, I recommend Quast because it also does alignment to the reference to calculate the number of misassemblies.

Their (excellent, I love SPAdes and QUAST) group is developing rnaQUAST, to evaluate transcriptome assemblies. Version 0.1.1 (current version at the time of my message) has a bug, though, reference transcriptome file naming has to strictly follow:

Code:

name.extension

I could not use a reference which had:

Code:

name.middle.extension

**bastianwur** · 06-04-2015, 07:03 AM

There are tools like CGAL and RSEM-EVAL, which calculate the likelyhood of the reads belonging to the actual assembly. That might help when you're having more than 1.

Since sometimes the size of the assembly can vary too, I also like to have an estimate of the genome size beforehand, tools to use are kmerspectrumanalyzer or kmergenie.

And depending on how fragmented you can/want to get with the data: A most likely correct genome (not necessarily contigous) will be to take the consensus from all your assemblies, and break the contigs if they're not agreeing.

<s>If you arrive at a chromosome, and you have a prokaryote, then you need to take a look at the GC skew of the chromosome to detect obvious misassemblies.</s> scratch that, didn't see the transcriptome part.
EDIT: Eh, no strike through tags in this forum?

**maasha** · 06-25-2015, 03:56 AM

I should say Biopieces is pretty nifty for this task:

Google Code Archive - Long-term storage for Google Code Project Hosting.

https://code.google.com/p/biopieces/wiki/HowTo#Howto_analyze_assembled_contigs

You simply do:

Code:

read_fasta -i contigs.fna |
grab -e "SEQ_LEN>=200" |
analyze_assembly -x

and get:

Code:

N50: 9082
MAX: 52038
MIN: 200
MEAN: 4170
TOTAL: 3057214
COUNT: 733
---

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

De novo transcriptome quality metrics?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News