Seqanswers Leaderboard Ad

**Mark** · 09-27-2012, 04:09 AM

Can you be more specific about what type of of data you have (16S?) and what you have already done?

**chris_bioinfo** · 09-27-2012, 04:14 AM

Dear Mark,

Its genomic data and I've already done taxonomic classification of the species that are present in the sample. using MEGAN. I have mapped reads to all bacteria NCBI database using bowtie, imported that sam file in MEGAN and got a nice tree view. but now I want to find out the novel species that have already been sequenced, from the reads which have not been aligned at all.

**Mark** · 09-27-2012, 04:25 AM

Well, given your approach (bowtie vs nucleotide database) it seems likely that your hits should be very close matches. To see the next tier of taxonomic relatedness you might try aligning you reads using blast (or another such tool) to do translated searches against a comprehensive protein database. Note that when you do this and examine the taxonomic assignments made by MEGAN, the hits identified are often significant yet still far from exact (much more so than when using bowtie) thus implying the presence of potentially novel species.

**chris_bioinfo** · 09-27-2012, 04:47 AM

Im sorry Mark if Im wrong since Im new in metagenomics, but as far as I understand, if its a meta-transcriptome data then I should use tblastx and sear against nr database, right? what I feel is, this is genomic data, so matching similarity with nt database would solve the purpose..

and I tried doing standalone blast as well, but i have tremendous number of reads, 18 million paired end illumina reads, 36 million in total, so blast ran for four days and still running so I had to stop it and then I opted for bowtie2. I am confident that this is not a memory problem since I am running it on cluster which has more than 210 GB ram..

I'm truly thankful to your replies.

Best,
Christopher

**Mark** · 09-27-2012, 09:19 AM

Hi Chris

Actually, you would use blastx vs a protein database. tblastx is where both the query and the subject are translated and searched in protein space. This might also work but is even more computationally demanding than blastx.
I think you probably do want to search in protein space as it is more sensitive since amino acid sequence evolves more slowly than nucleotide sequence.

Yes, running a tool like blast on that much NGS data is burdensome unless you have prolonged access to a large cluster. One alternative that would still allow you to search in protein space is rapsearch2. It achieves 50-100X speedups over blastx with only limited loss in sensitivity. Parallelizing its execution may provide you with the speed you need to get the job done.

Mark

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 23 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

novel species discovery metagenomics

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News