Hello, I am looking for help with running a Reciprocal BLAST. I cannot query any of my transcripts using blastdbcmd because I get this error message: “Error: [blastdbcmd] Skipped TRANSCRIPT_4865”. I appreciate any insight that can help resolve this issue!
I am using high quality FASTA files with compiled transcripts of the subspecies I am working with. For context, I ran LORDEC on my transcript files to reduce redundancy. Now, I am working on Reciprocal Blasting to identify comparable contigs between subspecies.
This is the script I am using for the Reciprocal BLAST (macOS):
#Making a database
makeblastdb -in ~/Desktop/tegula_blast/Hq_transcripts_2.fa -dbtype 'nucl' -out ~/Desktop/tegula_blast/Tfunebralis_DB -parse_seqids
#Output message
Building a new DB, current time: 03/06/2024 11:23:46
New DB name: /Users/lanigleason/Desktop/tegula_blast/Tfunebralis_DB
New DB title: /Users/lanigleason/Desktop/tegula_blast/Hq_transcripts_2.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 99388 sequences in 2.75653 seconds.
#Blast for all possible pairwise comparisons
blastn -query ~/Desktop/tegula_blast/Hq_transcripts_1.fa -db ~/Desktop/tegula_blast/Tfunebralis_DB -out ~/Desktop/tegula_blast/eiseni_to_funebralis.txt -evalue 1E-20 -outfmt 6 -max_target_seqs 1
#Output message
Warning: [blastn] Examining 5 or more matches is recommended
(See attached photo for output file)
#Retrieve subset of assembly using blastdbcmd
Here, I took the queried transcripts with matches, from previous blastn output (eiseni_to_funebralis.txt) and made a list of the transcript names (eiseni_names.txt).
An observation I made here is that there are multiple matches listed in the blastn output file. My original fasta file with transcripts does not include multiple copies, so this has to be a product of the blastn command. I am wondering if this is a possible reason for the error in the next step.
blastdbcmd -db ~/Desktop/tegula_blast/Tfunebralis_DB -dbtype 'nucl' -entry_batch ~/Desktop/tegula_blast/eiseni_names.txt -out ~/Desktop/tegula_blast/eiseni_to_funebralis.subset.reciprocal.fasta
#Output message
Error: [blastdbcmd] Skipped transcript_4866
Error: [blastdbcmd] Skipped transcript_9439
Error: [blastdbcmd] Skipped transcript_9439
Error: [blastdbcmd] Skipped transcript_13586
Error: [blastdbcmd] Skipped transcript_17909
Error: [blastdbcmd] Skipped transcript_38053
Error: [blastdbcmd] Skipped transcript_38053
Error: [blastdbcmd] Skipped transcript_22088
Error: [blastdbcmd] Skipped transcript_34418
Error: [blastdbcmd] Skipped transcript_45393
Error: [blastdbcmd] Skipped transcript_45393
Error: [blastdbcmd] Skipped transcript_13587
Error: [blastdbcmd] Skipped transcript_13587
Error: [blastdbcmd] Skipped transcript_13587
etc…
Most of the transcripts are skipped in this step. As seen, all duplicates of the same transcript are skipped too.
I have tried using a smaller subset of transcripts and removing the duplicate matches to run the reciprocal blast but I encounter the same errors.
I also tried looking up database identifiers that I could be missing but I am not sure how to use these identifiers to query with blastdbcmd.
blastdbcmd -db ~/Desktop/tegula_blast/Tfunebralis_DB -dbtype 'nucl' -entry all -out ~/Desktop/tegula_blast/eiseni_to_funebralis.subset.reciprocal.all.fasta -outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"
#Example output
OID: 0 GI: N/A ACC: transcript_3756 IDENTIFIER: lcl|transcript_3756
OID: 1 GI: N/A ACC: transcript_9791 IDENTIFIER: lcl|transcript_9791
OID: 2 GI: N/A ACC: transcript_7816 IDENTIFIER: lcl|transcript_7816
OID: 3 GI: N/A ACC: transcript_1853 IDENTIFIER: lcl|transcript_1853
Please let me know any ways to reduce the amount of transcripts skipped by blastdbcmd. Any recommended parameter changes and other explanations are welcomed. Thank you!
I am using high quality FASTA files with compiled transcripts of the subspecies I am working with. For context, I ran LORDEC on my transcript files to reduce redundancy. Now, I am working on Reciprocal Blasting to identify comparable contigs between subspecies.
This is the script I am using for the Reciprocal BLAST (macOS):
#Making a database
makeblastdb -in ~/Desktop/tegula_blast/Hq_transcripts_2.fa -dbtype 'nucl' -out ~/Desktop/tegula_blast/Tfunebralis_DB -parse_seqids
#Output message
Building a new DB, current time: 03/06/2024 11:23:46
New DB name: /Users/lanigleason/Desktop/tegula_blast/Tfunebralis_DB
New DB title: /Users/lanigleason/Desktop/tegula_blast/Hq_transcripts_2.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 99388 sequences in 2.75653 seconds.
#Blast for all possible pairwise comparisons
blastn -query ~/Desktop/tegula_blast/Hq_transcripts_1.fa -db ~/Desktop/tegula_blast/Tfunebralis_DB -out ~/Desktop/tegula_blast/eiseni_to_funebralis.txt -evalue 1E-20 -outfmt 6 -max_target_seqs 1
#Output message
Warning: [blastn] Examining 5 or more matches is recommended
(See attached photo for output file)
#Retrieve subset of assembly using blastdbcmd
Here, I took the queried transcripts with matches, from previous blastn output (eiseni_to_funebralis.txt) and made a list of the transcript names (eiseni_names.txt).
An observation I made here is that there are multiple matches listed in the blastn output file. My original fasta file with transcripts does not include multiple copies, so this has to be a product of the blastn command. I am wondering if this is a possible reason for the error in the next step.
blastdbcmd -db ~/Desktop/tegula_blast/Tfunebralis_DB -dbtype 'nucl' -entry_batch ~/Desktop/tegula_blast/eiseni_names.txt -out ~/Desktop/tegula_blast/eiseni_to_funebralis.subset.reciprocal.fasta
#Output message
Error: [blastdbcmd] Skipped transcript_4866
Error: [blastdbcmd] Skipped transcript_9439
Error: [blastdbcmd] Skipped transcript_9439
Error: [blastdbcmd] Skipped transcript_13586
Error: [blastdbcmd] Skipped transcript_17909
Error: [blastdbcmd] Skipped transcript_38053
Error: [blastdbcmd] Skipped transcript_38053
Error: [blastdbcmd] Skipped transcript_22088
Error: [blastdbcmd] Skipped transcript_34418
Error: [blastdbcmd] Skipped transcript_45393
Error: [blastdbcmd] Skipped transcript_45393
Error: [blastdbcmd] Skipped transcript_13587
Error: [blastdbcmd] Skipped transcript_13587
Error: [blastdbcmd] Skipped transcript_13587
etc…
Most of the transcripts are skipped in this step. As seen, all duplicates of the same transcript are skipped too.
I have tried using a smaller subset of transcripts and removing the duplicate matches to run the reciprocal blast but I encounter the same errors.
I also tried looking up database identifiers that I could be missing but I am not sure how to use these identifiers to query with blastdbcmd.
blastdbcmd -db ~/Desktop/tegula_blast/Tfunebralis_DB -dbtype 'nucl' -entry all -out ~/Desktop/tegula_blast/eiseni_to_funebralis.subset.reciprocal.all.fasta -outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"
#Example output
OID: 0 GI: N/A ACC: transcript_3756 IDENTIFIER: lcl|transcript_3756
OID: 1 GI: N/A ACC: transcript_9791 IDENTIFIER: lcl|transcript_9791
OID: 2 GI: N/A ACC: transcript_7816 IDENTIFIER: lcl|transcript_7816
OID: 3 GI: N/A ACC: transcript_1853 IDENTIFIER: lcl|transcript_1853
Please let me know any ways to reduce the amount of transcripts skipped by blastdbcmd. Any recommended parameter changes and other explanations are welcomed. Thank you!