Hi there,
Im hoping I can get some help from this forum as my programming skills arn't up to pr to tackle this.
I am comparing one genome (bacterial) against a custom database of closely related strains, to then look at common point mutations/adaptations in homologous proteins to which should be reflective of where the isolates came from (this is to give context, I hope it made sense). In general the rough workflow of this is :
1. blastp my genome against the custom database and get the top 5 hits
2. compare parameters such as GC content, Arginine content, hydrophobic residues etc. between an ENTIRE protein in my genome of interest and its top 5 matches.
3. Calculate which proteins are significantly different from their matches and in what way.
I have the ability and scripts to do all these steps separately when running on a tester file of just 20 genes. However, my genome has 4000 + genes, and this simply must be automated.
To do this, it seems critical that I must be able: to:
a) bin my top 5 hits with the query sequence
b) output the fasta files in a manner that can allow for downstream analysis and especially computation of these.
To accomplish this; I think an output file of
>Query_protein1_full sequence
agagagagaga
> Hit1_protein1_full sequence
agagagag
> Hit2_protein1_full sequence
agagagag
> Hit3_protein1_full sequence
agagagag
> Hit4_protein1_full sequence
agagagag
> Hit5_protein1_full sequence
agagagag
> Query_protein2_full sequence
agagagag
> Hit1_protein2_full sequence
agagagag
etc etc would work. Does anyone have any ideas on how I could script this out? This is above my programming skills, though I am learning.
Cheers
Im hoping I can get some help from this forum as my programming skills arn't up to pr to tackle this.
I am comparing one genome (bacterial) against a custom database of closely related strains, to then look at common point mutations/adaptations in homologous proteins to which should be reflective of where the isolates came from (this is to give context, I hope it made sense). In general the rough workflow of this is :
1. blastp my genome against the custom database and get the top 5 hits
2. compare parameters such as GC content, Arginine content, hydrophobic residues etc. between an ENTIRE protein in my genome of interest and its top 5 matches.
3. Calculate which proteins are significantly different from their matches and in what way.
I have the ability and scripts to do all these steps separately when running on a tester file of just 20 genes. However, my genome has 4000 + genes, and this simply must be automated.
To do this, it seems critical that I must be able: to:
a) bin my top 5 hits with the query sequence
b) output the fasta files in a manner that can allow for downstream analysis and especially computation of these.
To accomplish this; I think an output file of
>Query_protein1_full sequence
agagagagaga
> Hit1_protein1_full sequence
agagagag
> Hit2_protein1_full sequence
agagagag
> Hit3_protein1_full sequence
agagagag
> Hit4_protein1_full sequence
agagagag
> Hit5_protein1_full sequence
agagagag
> Query_protein2_full sequence
agagagag
> Hit1_protein2_full sequence
agagagag
etc etc would work. Does anyone have any ideas on how I could script this out? This is above my programming skills, though I am learning.
Cheers
Comment