I'd like to calculate the tetranucleotide frequencies of my contigs and the bin them based on shared tetranucleotide frequencies. Can anyone recommend a program or script to calculate this, and possibly tools for binning?
My data is a low-diversity environmental metagenome. It consists of 150 bp paired end Illumina library generated on a HiSeq 2500, aligned in MIRA.
What I understand so far:
The occurrence of certain four-base sequences (ATAT, GGCA, TAAC) is not random within an organism, but biased toward certain tetranucleotides. By calculating the frequencies of every possible tetranucleotide, one can group contigs which likely came from the same organism, even though the aligner failed to align them. Once binned in their own file, a repeat alignment run may be able to align these contigs to build longer contigs than those produced in the original run.
So far, the only publicly available script I've found is this one, however I don't know if I can install it on the computer I'm running computation on:
I've been using the following sources for guidance:
Dick, G. J., Andersson, A. F., Baker, B. J., Simmons, S. L., Thomas, B. C., Yelton, a P., & Banfield, J. F. (2009). Community-wide analysis of microbial genome sequence signatures. Genome biology, 10(8), R85. doi:10.1186/gb-2009-10-8-r85
Lesniewski, R. A., Jain, S., Anantharaman, K., Schloss, P. D., & Dick, G. J. (2012). The metatranscriptome of a deep-sea hydrothermal plume is dominated by water column methanotrophs and lithotrophs. The ISME journal, 6(12), 2257–68. doi:10.1038/ismej.2012.63
My data is a low-diversity environmental metagenome. It consists of 150 bp paired end Illumina library generated on a HiSeq 2500, aligned in MIRA.
What I understand so far:
The occurrence of certain four-base sequences (ATAT, GGCA, TAAC) is not random within an organism, but biased toward certain tetranucleotides. By calculating the frequencies of every possible tetranucleotide, one can group contigs which likely came from the same organism, even though the aligner failed to align them. Once binned in their own file, a repeat alignment run may be able to align these contigs to build longer contigs than those produced in the original run.
So far, the only publicly available script I've found is this one, however I don't know if I can install it on the computer I'm running computation on:
I've been using the following sources for guidance:
Dick, G. J., Andersson, A. F., Baker, B. J., Simmons, S. L., Thomas, B. C., Yelton, a P., & Banfield, J. F. (2009). Community-wide analysis of microbial genome sequence signatures. Genome biology, 10(8), R85. doi:10.1186/gb-2009-10-8-r85
Lesniewski, R. A., Jain, S., Anantharaman, K., Schloss, P. D., & Dick, G. J. (2012). The metatranscriptome of a deep-sea hydrothermal plume is dominated by water column methanotrophs and lithotrophs. The ISME journal, 6(12), 2257–68. doi:10.1038/ismej.2012.63
Comment