Seqanswers Leaderboard Ad

**GenoMax** · 02-01-2016, 10:41 AM

As you have discovered first hand, doing annotation is hard, no matter what tool you use. Ultimately annotation requires careful inspection of results, weighing of evidence before making a final judgement.

Can you tell us what kind of genome you are working with (haploid, diploid etc, # of chromosomes, percentage of repeat sequence). How does your assembly compare to the close relative (in terms of # of contigs, N50 etc) that you refer to?

If there is a closely related species the has been available/annotated then one of the reasons your annotation looks poor could be that your assembly is not very good (unless the closely related genome has theirs wrong). You may want to take a fresh look at redoing the assembly in that case.

**moldach** · 02-01-2016, 12:16 PM

More information

Hi Genomax,

I am working with a diploid eukaryotic transcriptome (not genome) from a coral species in the Acropora spp. complex.

There is another Acropora species which there is an available genome for [avg. sequence length ~1700bp; N50=~2200bp].

However, N50 is often misleading as it measures the continuity of contigs and not their accuracy; in transcriptome assembly the optimal contig is not known a priori and therefore carries little information . Similarly, for transcriptome assembly, these reference-free measures, as well as others (e.g. median contig length and number of contigs) can be misleading, or even meaningless, and should be avoided .

Therefore, I assessed the quality of my transcriptome assembly using Transrate; Transrate uses a reference genome/transcriptome to compare the quality of assembly. Because the A. digitifera genome is not annotated I used the annotated transcriptome of A. millepora. For my assembly, Transrate showed an initial score of 0.1316, and an optimized score of 0.2336 in Trinity. For comparison, approximately 50% of the de novo assemblies from the NCBI Transcriptome Shotgun Assembly database produce an overall score of 0.22 and optimized score of 0.35.

So my assembly is somewhat sub-optimal

**GenoMax** · 02-01-2016, 03:20 PM

Have you done searches against the annotated transcriptome from the Matz lab (blastn searches in addition to tblastx perhaps)? That would be your best bet to find quick homologies. You may have already done that though to get to the point where you are at.

Depending on how much time you want to spend on this you could try extending the searches to refseq_genomic (and other databases) but it would be a lot of work to pore through the results and make informed decisions. You will only get so far with just searches.

**moldach** · 02-03-2016, 10:16 AM

Originally posted by GenoMax View Post

Have you done searches against the annotated transcriptome from the Matz lab (blastn searches in addition to tblastx perhaps)? That would be your best bet to find quick homologies. You may have already done that though to get to the point where you are at.

I have not done searches against the annotated transcriptome with blastn. What exactly needs to be done when doing a blastn in addition to a blastx search? Would you just concatenate the resulting output file of both blastx and blastn?

Originally posted by GenoMax View Post

Depending on how much time you want to spend on this you could try extending the searches to refseq_genomic (and other databases) but it would be a lot of work to pore through the results and make informed decisions. You will only get so far with just searches.

I'm thinking time-wise that for this project I only want to spend enough time extending annotations via a complementary blastn search. However, this is a graduate project that really should have been wrapped up by now.

I've been talking with a lab about potentially providing support for assembling/annotating a number of transcriptomes and one of the concerns was the poor annotation results. So really, in the sake of making myself more employable, it would be very helpful if you could elaborate a bit on doing extended searches to refseq_genomic.

Do you know how common this is with non-model organism assembly?

What are we talking about in terms of time spent vs rewards? - obviously an assembly is never 100% complete, but there comes a point at which the returns will not be sufficient to justify the time/cost.

What other databases besides refseq_genomic could be used?

You mentioned poring through the results and make informed decisions. This is don't quite understand. Do you mean that some annotations will be erroneous? Maybe a hypothetical example would help

Thank you very much

**Markiyan** · 02-08-2016, 03:31 AM

Looks like the de novo transcriptome assembly needs to be properly done first...

Dear Moldach,

It looks like the denovo assembly needs to be done properly first.
Assumming you were using illumina:
For that you really need to start from cDNA library with 350-600 bp fragment size, than sequence it on the miseq or hiseq in 2x250 or 2x300 bp run mode (read the illumina cDNA library prep protocol, fragmentation section).
Or do PacBio's isoseq...
(If you did Illumina 1x75 bp or 1x100bp - it would not cut it very well...)
Than process you data through the flash or panda (preassembly), and than do an incremental pure de novo assembly starting from 10k read and going up.

Check the most abundant transcripts for completeion, and add them to the "vector.seq" database, so they wouldn't interfere with the next round of the assembly for the less abundant things.

You can use MIRA or any other assembler in the est mode (can also try with CLC or DNASTAR's ngen).

Than combine the final edition of the vector.seq database with your final contigs and:
1. use it as reference for mapping reads to it (to get the relative abundance)
2. annotate your reference by blastx

I wouln't rely on any reference based methods if the similarity between the beasts is less than 95% on the DNA level.

Markiyan.

**moldach** · 02-08-2016, 03:43 PM

Originally posted by Markiyan View Post

It looks like the denovo assembly needs to be done properly first.
Assumming you were using illumina:
For that you really need to start from cDNA library with 350-600 bp fragment size, than sequence it on the miseq or hiseq in 2x250 or 2x300 bp run mode

I used HiSeq250 Paired-end reads, this was contracted out by BGI Hong Kong (unfortunately the details of their library prep aren't available so I'm not sure what fragment size of cDNA library they used).

I assembled using two libraries to capture time-specific isoforms. Each library had roughly 15 million reads, so a total of 31 million reads were used for transcriptome assembly. I know that good annotation starts with a good assembly (**** in=**** out) - i get it. So obviously suggesting > 100 million reads for a de novo assembly is good advice for future experimental design, however, our lab only had that much money so it is what it is.

I'm really looking for ways to improve this assembly, but thanks for you kind suggestions.

Originally posted by Markiyan View Post

I wouln't rely on any reference based methods if the similarity between the beasts is less than 95% on the DNA level.

OK so only one published genome exists for this genus so how would I know how similar species would be on a DNA level?

**GenoMax** · 02-08-2016, 04:38 PM

Originally posted by moldach View Post

What are we talking about in terms of time spent vs rewards? - obviously an assembly is never 100% complete, but there comes a point at which the returns will not be sufficient to justify the time/cost.

I don't have an informed answer since that would depend on the data. Generally for an annotation project it would be ideal to have some genomic DNA data to give at least low pass coverage. Having genomic DNA would allow you to assemble that data/build gene models and to see how much of the potentially expressible component you are recovering in your transcriptome data.

It is quite possible that you have reached (or are close to) that point where the return on time investment is not going to be worth it, with the assemblies you have. Since you are not going to generate additional data what you have is what you have.

What other databases besides refseq_genomic could be used?

You could use genpept, trembl. Also other searches such as psi-blast/delta-blast.

You mentioned poring through the results and make informed decisions. This is don't quite understand. Do you mean that some annotations will be erroneous? Maybe a hypothetical example would help

You probably realize that automated blast searches (or any similarity searches) are only going to take you a part of the way. At some point you may need to do more thorough searches (with same query) with different parameters (e.g. similarity matrices). Once you have potential hits, you would need to manually look at regions of homology (and they may not be extensive, think a conserved active site), construct/edit sequence alignments, to see if you can extend the annotation from a well known protein from a distant species to yours. This type of work is pretty tedious/time consuming and not something you want to do as a side project. Hope that helps.

**moldach** · 02-08-2016, 06:10 PM

Thank you very much Genomax and Markiyan for your time and valuable feedback.

**Markiyan** · 02-09-2016, 04:37 AM

Originally posted by moldach View Post

I used HiSeq250 Paired-end reads, this was contracted out by BGI Hong Kong (unfortunately the details of their library prep aren't available so I'm not sure what fragment size of cDNA library they used).

I assembled using two libraries to capture time-specific isoforms. Each library had roughly 15 million reads, so a total of 31 million reads were used for transcriptome assembly.
I'm really looking for ways to improve this assembly, but thanks for you kind suggestions.

So the input dataset theoretically looks quite good. Assuming the data is clean of the adapters (fastqc it first!).

I would still try doing iterative cDNA assembly approach, because it helps grealty with removal of all those spurious links to highly expressed transcripts from the low expressed transcripts by the chimeric reads. Even if you have only 2-5% of them, they still can cause a lot of trouble, because one would expect at least 3-4 orders of magnitude dynamic range, so 10^4 more expressed template would have a lot of chimera links to low expressed ones.

If you assemble only 10K or so reads at first, you would get the most expressed ones, than you can remove them from the next iterations, so highly expressed chimeric part would be simply masked off instead of confusing the assembler.
Increase your dataset by 5-50X at a time (avoid getting contigs with more than 500X coverage).

One can use nearly any DNA assembler for this (I've done exactly this with snail transcriptome in 2009 (done with 454 flx) using sff2phd & PHRAP over 3 iterations) and got way better results than from the newbler v2.0 in the cDNA mode over a single pass.
PS: results were evaluated in the consed.

Originally posted by moldach View Post

OK so only one published genome exists for this genus so how would I know how similar species would be on a DNA level?

Get the published genome fasta file and try 2 things:
1. formatdb into a blast database and blastn / tblastx some denovo transcriptome contigs against it
2. simply try mapping your reads against it using bwa or similar.
PS: pay attention to fasta_ID's, not all mappers like default NCBI format!
Also see what the % of mapped reads and "SNP" density to give you some roughf idea of similarity.

Markiyan.

Topics	Statistics	Last Post
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, 06-03-2024, 06:55 AM	0 responses 12 views 0 likes	Last Post by seqadmin 06-03-2024, 06:55 AM
Genetic Mosaicism More Prevalent Than Previously Thought by seqadmin Started by seqadmin, 05-30-2024, 03:16 PM	0 responses 26 views 0 likes	Last Post by seqadmin 05-30-2024, 03:16 PM
Comprehensive Sequencing of Great Ape Sex Chromosomes Yields Insights into Evolution and Genetic Variability by seqadmin Started by seqadmin, 05-29-2024, 01:32 PM	0 responses 29 views 0 likes	Last Post by seqadmin 05-29-2024, 01:32 PM
New Toolkit Enhances Plant Mitochondrial Genome Research by seqadmin Started by seqadmin, 05-24-2024, 07:15 AM	0 responses 216 views 0 likes	Last Post by seqadmin 05-24-2024, 07:15 AM

Seqanswers Leaderboard Ad

Announcement

Selecting the best database for blasting

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News