Hello All,
I'm a beginner in Metaproteomics. I've a very large collection of Fasta files (around 100K) which I want to join to a single fasta file. Note that some of the files have a very large number of sequences (~ 2Mil, whole taxonomic family of organisms).
Total size of the sequences is 140GB.
I've access to a High Performance university Computer Cluster. I'm wondering if a simple command like "cat *.fasta > Joined.faa" will efficiently work for this volume of data, or, I need some better method?
eventually I want to run CD-HIT on the concatenated sequence file
I'm a beginner in Metaproteomics. I've a very large collection of Fasta files (around 100K) which I want to join to a single fasta file. Note that some of the files have a very large number of sequences (~ 2Mil, whole taxonomic family of organisms).
Total size of the sequences is 140GB.
I've access to a High Performance university Computer Cluster. I'm wondering if a simple command like "cat *.fasta > Joined.faa" will efficiently work for this volume of data, or, I need some better method?
eventually I want to run CD-HIT on the concatenated sequence file