Seqanswers Leaderboard Ad

**GenoMax** · 05-08-2015, 07:15 AM

Answer below only relates to "how do I make protein blast database".

Go here: http://www.uniprot.org/uniprot/?quer...D+reviewed:yes

Click download button to download 3246 "reviewed" D. melanogaster proteins in fasta format. Build the blast database.

OR

ftp://ftp.ncbi.nlm.nih.gov/genomes/D.../RELEASE_5_48/

Get the *.faa files from the chromosome directories, cat the files into one and make blast database.

or

Get the protein sequence from: ftp://ftp.ensembl.org/pub/release-79....pep.all.fa.gz Build database.

or

You may also be able to subset the drosophila data using the blastdb_alias tool: http://www.ncbi.nlm.nih.gov/books/NBK279693/

**hlyates** · 05-08-2015, 07:20 AM

Originally posted by GenoMax View Post

Answer below only relates to "how do I make protein blast database".

Go here: http://www.uniprot.org/uniprot/?quer...D+reviewed:yes

Click download button to download 3246 "reviewed" D. melanogaster proteins in fasta format. Build the blast database.

OR

ftp://ftp.ncbi.nlm.nih.gov/genomes/D.../RELEASE_5_48/

Get the *.faa files from the chromosome directories, cat the files into one and make blast database.

or

Get the protein sequence from: ftp://ftp.ensembl.org/pub/release-79....pep.all.fa.gz Build database.

or

You may also be able to subset the drosophila data using the blastdb_alias tool: http://www.ncbi.nlm.nih.gov/books/NBK279693/

In order to further learn, if I wanted to apply this to another species in the future, do all three have a searchable database? I assume so sir. Thank you for your assistance.

Also, is there a way to do this with non-redundant proteins (nr)? That is, would it be possible to make something from All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects? I'm sorry if I am asking stupid question, but I learn best when talking to experts.

**GenoMax** · 05-08-2015, 07:36 AM

No question is stupid. As you said one only learns by asking questions.

blastdb_alias tool should be able to do it from refseq_protein blast index by sub-setting (I think that would be the database to use).

Let me point out a subtle distinction.

Links above give you "known" proteins/transcriptome from Drosophila which is non-redundant information. If you were to download protein sequence file from taxonomy browser: http://www.ncbi.nlm.nih.gov/protein?term=txid7227[Organism] then you are likely to get redundant entries.

**hlyates** · 05-08-2015, 12:22 PM

Originally posted by GenoMax View Post

No question is stupid. As you said one only learns by asking questions.

blastdb_alias tool should be able to do it from refseq_protein blast index by sub-setting (I think that would be the database to use).

Let me point out a subtle distinction.

Links above give you "known" proteins/transcriptome from Drosophila which is non-redundant information. If you were to download protein sequence file from taxonomy browser: http://www.ncbi.nlm.nih.gov/protein?term=txid7227[Organism] then you are likely to get redundant entries.

Hmmm. This is going in a more complicated direction than I originally thought. Which is okay, but just want to be sure we are on the right track.

I talked to my colleague is who isn't knowledgeable about how to do this, but does know what I need to do. He stated that I need to do positive filtering with blastx by doing a search on protein database. He guess I can download them by fruit fly (aka taxid?) proteins.

What would be the easiest way to do this I guess? I thought it would be the nr database blastx option that we see on the ncbi database as seen here, but only for the fruit fly.

I'll rephrase my ramblings more concisely with the following.

1. Do the filtering on the protein database for fruit fly (I have no idea how to do this)

2. Run makeblastdb on the filtered.fasta file?

Code:

makeblastdb filteredproteinfruitfly.fasta -dbtype nucl -hash_index -out myfiltered_db

**GenoMax** · 05-08-2015, 02:46 PM

Originally posted by hlyates View Post

I'll rephrase my ramblings more concisely with the following.

Do the filtering on the protein database for fruit fly (I have no idea how to do this)

This is becoming a case of blind leading the blind (since I don't exactly know what you are trying to achieve).

I am interpreting this to mean that if your sequence has a hit in the Drosophila protein database (positive filtering, wording from your post) you want to keep it. Is that correct?

If so this is straightforward. If you want *every* Drosophila melanogaster protein then use the tax browser download (link posted before) or you could use nr (but you may not be able to filter on tax id since NCBI does not build the blast indexes to include that info. I remember going over this a few weeks ago with either you or someone else on the forum).

Originally posted by hlyates View Post

Run makeblastdb on the filtered.fasta file?

Code:

makeblastdb filteredproteinfruitfly.fasta -dbtype nucl -hash_index -out myfiltered_db

Is this the set with hits to Drosophila? Then do what?

**GenoMax** · 05-08-2015, 02:57 PM

If all you are interested in is sequences that "hit" Drosophila then I would suggest using BBSplit. Even if it is the other way around (i.e. sequences that are not Drosophila) that should still work.

What is the size distribution of your input fasta?

**hlyates** · 05-10-2015, 06:09 AM

Originally posted by GenoMax View Post

This is becoming a case of blind leading the blind (since I don't exactly know what you are trying to achieve).

I am interpreting this to mean that if your sequence has a hit in the Drosophila protein database (positive filtering, wording from your post) you want to keep it. Is that correct?

That is correct.

If so this is straightforward. If you want *every* Drosophila melanogaster protein then use the tax browser download (link posted before) or you could use nr (but you may not be able to filter on tax id since NCBI does not build the blast indexes to include that info. I remember going over this a few weeks ago with either you or someone else on the forum).

This I do understand how to do, I think. I merely download it as a .fasta file, yes? You did teach me that trick and it was pretty cool.

Is this the set with hits to Drosophila? Then do what?

Right, that is my question. Once I download the file. Do I use the makedb command first on the downloaded file from the browser? I intend to run blastx on the db file I create. Based on the ncbi documentation, this is the approach I think we must take.

Again, thanks geno max.

**GenoMax** · 05-10-2015, 07:27 AM

Originally posted by hlyates View Post

Right, that is my question. Once I download the file. Do I use the makedb command first on the downloaded file from the browser? I intend to run blastx on the db file I create. Based on the ncbi documentation, this is the approach I think we must take.

Again, thanks geno max.

Order of operations:

1. Download all Drosophila proteins from tax browser link (since this is what your collaborator seems to want you to do) as multi-fasta format file.
2. Make blast database using the fasta file.
3. Blastx with your sequences using parameters you want (e-value cutoff etc). I would just choose the tabular output format since you can grab the sequence ID's that show a hit from this table.
4. Use faFilter utility (http://hgdownload.soe.ucsc.edu/admin...86_64/faFilter) to get a subset that contains sequences from your list that hit Drosophila proteins.

**hlyates** · 05-10-2015, 06:03 PM

Originally posted by GenoMax View Post

Order of operations:

1. Download all Drosophila proteins from tax browser link (since this is what your collaborator seems to want you to do) as multi-fasta format file.
2. Make blast database using the fasta file.
3. Blastx with your sequences using parameters you want (e-value cutoff etc). I would just choose the tabular output format since you can grab the sequence ID's that show a hit from this table.
4. Use faFilter utility (http://hgdownload.soe.ucsc.edu/admin...86_64/faFilter) to get a subset that contains sequences from your list that hit Drosophila proteins.

I follow perfectly well until step #4. Why is this step necessary? Since we are blasting against all fruit fly proteins (step #1-2), then doesn't blastx give us the subset of proteins that contains sequences from our list?

Or am I misunderstanding and faFilter will read my output and only report the significant results?

In the meantime, I'll read up on faFilter and maybe can answer my own question, but super appreciate the opportunity to converse with a professional. I'm a one dog show at the moment, so to speak.

**GenoMax** · 05-11-2015, 01:14 AM

Step 3 will report sequences from your set that hit a Drosophila protein. Some sequences may hit more than one protein record (and vice versa) since you are starting with *every* Drosophila protein based on the tax ID.

I included #4 on assumption that you will want to divide your own squence file into two pools. One that is represented in #3 and other that does not. faFilter is an easy utility to do that.

**hlyates** · 05-11-2015, 06:21 AM

Originally posted by GenoMax View Post

Step 3 will report sequences from your set that hit a Drosophila protein. Some sequences may hit more than one protein record (and vice versa) since you are starting with *every* Drosophila protein based on the tax ID.

I included #4 on assumption that you will want to divide your own squence file into two pools. One that is represented in #3 and other that does not. faFilter is an easy utility to do that.

I have a few wrap up questions Geno Max. Then I will go on my own as much as possible. Your expert advice has already optimized my approach so much. I doubt I can thank you enough.

1. I think your quote makes sense. Basically, the faFilter will divide sequences that got a hit and others that did not. Yes?

2. I have one last question (with a followup) this is related to blastn output. Rather than make a another question, I want to ask you it here. Last night my blastn output finally finished. I used a cut off of 1E-42. I am seeing records like this:

Code:

***** No hits found *****



Lambda      K        H
    1.33    0.621     1.12

Gapped
Lambda      K        H
    1.28    0.460    0.850

Effective search space used: 3617178691534


Query= HISEQ:154:C47MGACXX:8:1101:5855:2180 1:N:0:CGATGT

Length=101
                                                                      Score                                                    E
Sequences producing significant alignments:                          (Bits)  Val                                               ue

gb|AF479582.1|  Boophilus microplus paramyosin mRNA, complete cds       182   6e                                               -43


>gb|AF479582.1| Boophilus microplus paramyosin mRNA, complete cds
Length=2922

 Score =   182 bits (98),  Expect = 6e-43
 Identities = 100/101 (99%), Gaps = 0/101 (0%)
 Strand=Plus/Minus

Query  1     GTTCGCGTTGGAAGCGGCGCACGCGGGTCAAGTTCTGCTGGCTCAGACCCTCCTGTTCGT  60
             |||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||
Sbjct  2512  GTTCGCGTTGGAAGCGGCGCACGCGGGTCAGGTTCTGCTGGCTCAGACCCTCCTGTTCGT  2453

Query  61    TGAGCTGTCGCTTGTAGACCTTGACCTTCTCGTTGAGCTTC  101
             |||||||||||||||||||||||||||||||||||||||||
Sbjct  2452  TGAGCTGTCGCTTGTAGACCTTGACCTTCTCGTTGAGCTTC  2412



Lambda      K        H
    1.33    0.621     1.12

Gapped
Lambda      K        H
    1.28    0.460    0.850

Effective search space used: 3617178691534


Query= HISEQ:154:C47MGACXX:8:1101:5762:2190 1:N:0:CGATGT

Length=101

Why is it claiming no hit was found? Since my cutoff value is 1E-42, is it not true that 6E-43 is smaller and hence categorizes as a hit? Am I misunderstanding the output here? I thought it should state ***alignment***?

3. As an expert, what would you suggest I use to parse the blastn output blast.txt? I did not use tabular output for blastn. I did some googling and there is bioperl and biopython, but would feel comfortable asking an expert such as yourself if there is better one out there? I just want to obtain my sequence, the species it hits with, and e-value. Or, I can write my own python script, but afraid I would be reinventing the wheel?

After this, I plan on reading up on BLOSUM62 again and other things to deepen my understanding. Again, I am very humbled by your help. You are helping me gain an understanding of were my skill set and knowledge should be. I need more experience too. It is my dream to become better.

**kmcarr** · 05-11-2015, 06:28 AM

Originally posted by hlyates View Post

Code:

***** No hits found *****



Lambda      K        H
    1.33    0.621     1.12

Gapped
Lambda      K        H
    1.28    0.460    0.850

Effective search space used: 3617178691534


Query= HISEQ:154:C47MGACXX:8:1101:5855:2180 1:N:0:CGATGT

Length=101
                                                                      Score                                                    E
Sequences producing significant alignments:                          (Bits)  Val                                               ue

gb|AF479582.1|  Boophilus microplus paramyosin mRNA, complete cds       182   6e                                               -43


>gb|AF479582.1| Boophilus microplus paramyosin mRNA, complete cds
Length=2922

 Score =   182 bits (98),  Expect = 6e-43
 Identities = 100/101 (99%), Gaps = 0/101 (0%)
 Strand=Plus/Minus

Query  1     GTTCGCGTTGGAAGCGGCGCACGCGGGTCAAGTTCTGCTGGCTCAGACCCTCCTGTTCGT  60
             |||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||
Sbjct  2512  GTTCGCGTTGGAAGCGGCGCACGCGGGTCAGGTTCTGCTGGCTCAGACCCTCCTGTTCGT  2453

Query  61    TGAGCTGTCGCTTGTAGACCTTGACCTTCTCGTTGAGCTTC  101
             |||||||||||||||||||||||||||||||||||||||||
Sbjct  2452  TGAGCTGTCGCTTGTAGACCTTGACCTTCTCGTTGAGCTTC  2412



Lambda      K        H
    1.33    0.621     1.12

Gapped
Lambda      K        H
    1.28    0.460    0.850

Effective search space used: 3617178691534


Query= HISEQ:154:C47MGACXX:8:1101:5762:2190 1:N:0:CGATGT

Length=101

Why is it claiming no hit was found? Since my cutoff value is 1E-42, is it not true that 6E-43 is smaller and hence categorizes as a hit? Am I misunderstanding the output here? I thought it should state ***alignment***?

In standard text based BLAST outputs like this each report starts with the "Query=" line and ends ends with the "Effective search space…" line.

The "No hits found" is referring to the Query immediately prior to "HISEQ:154:C47MGACXX:8:1101:5855:2180 1:N:0:CGATGT" Look just above the "***** No hits found *****" line and you should see another line that starts with "Query=". That is the query (raw sequence) for which no hit was found.

**GenoMax** · 05-11-2015, 06:51 AM

Originally posted by hlyates View Post

I think your quote makes sense. Basically, the faFilter will divide sequences that got a hit and others that did not. Yes?

That is correct.

Why is it claiming no hit was found? Since my cutoff value is 1E-42, is it not true that 6E-43 is smaller and hence categorizes as a hit? Am I misunderstanding the output here? I thought it should state ***alignment***?

@kmcarr already addressed this question. See note about e-value below.

3. As an expert, what would you suggest I use to parse the blastn output blast.txt? I did not use tabular output for blastn. I did some googling and there is bioperl and biopython, but would feel comfortable asking an expert such as yourself if there is better one out there? I just want to obtain my sequence, the species it hits with, and e-value. Or, I can write my own python script, but afraid I would be reinventing the wheel?

If you know python then just use whatever code/module that suits your needs. A google search should bring up plenty of options.

By now, hopefully you have read up a bit on the e-value. This comes from Karlin-Altschul statistics (http://www.ncbi.nlm.nih.gov/BLAST/tu...ltschul-1.html) and ultimately depends on the size of the database you are searching against. In general you want to search against the smallest database that makes sense to get most meaningful e-values.

**hlyates** · 05-11-2015, 09:14 AM

Originally posted by kmcarr View Post

In standard text based BLAST outputs like this each report starts with the "Query=" line and ends ends with the "Effective search space…" line.

The "No hits found" is referring to the Query immediately prior to "HISEQ:154:C47MGACXX:8:1101:5855:2180 1:N:0:CGATGT" Look just above the "***** No hits found *****" line and you should see another line that starts with "Query=". That is the query (raw sequence) for which no hit was found.

Fantastic. Many salutations and thanks. Can you please point me to the NCBI docs which talk about this? Dr. Google didn't report a hit back for me. Would like to learn more about the formatting.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How to blastx on Drosophila melanogaster (taxid:7227) with raw data?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News