I am using CD-HIT to reduce redundacy in a dataset of 20405 peptides, CD-HIT seem to work fine but it identifies only 18404 peptides as shown in the output code below:
I am sure that the fasta is correctly formated in the form:
>header
SEQUENCE
Also the command:
returns the correct number of peptides
Does anybody know any way to fix this?
Code:
Program: CD-HIT, V4.8.1 (+OpenMP), Nov 13 2019, 13:22:53 Command: cd-hit -i BD_final_con_nombres.fasta -o BDpos.fa -c 0.9 -g 1 -T 0 -M 0 -n 5 Started: Mon Dec 16 16:51:57 2019 ================================================================ Output ---------------------------------------------------------------- total number of CPUs in the system is 12 Actual number of CPUs to be used: 12 total seq: 18404 longest and shortest : 300 and 11 Total letters: 737624 Sequences have been sorted Approximated minimal memory consumption: Sequence : 3M Buffer : 12 X 10M = 129M Table : 2 X 65M = 131M Miscellaneous : 0M Total : 263M Table limit with the given memory limit: Max number of representatives: 744016 Max number of word counting entries: 14908239 # comparing sequences from 0 to 1314 .---------- new table with 840 representatives # comparing sequences from 1314 to 2534 ---------- 994 remaining sequences to the next cycle ---------- new table with 187 representatives # comparing sequences from 1540 to 2744 ---------- 1023 remaining sequences to the next cycle ---------- new table with 117 representatives # comparing sequences from 1721 to 2912 ---------- 1010 remaining sequences to the next cycle ---------- new table with 110 representatives # comparing sequences from 1902 to 3080 ---------- 996 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 2084 to 3249 ---------- 962 remaining sequences to the next cycle ---------- new table with 123 representatives # comparing sequences from 2287 to 3438 ---------- 953 remaining sequences to the next cycle ---------- new table with 116 representatives # comparing sequences from 2485 to 3622 ---------- 958 remaining sequences to the next cycle ---------- new table with 117 representatives # comparing sequences from 2664 to 3788 ---------- 935 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 2853 to 3963 ---------- 932 remaining sequences to the next cycle ---------- new table with 124 representatives # comparing sequences from 3031 to 4129 ---------- 891 remaining sequences to the next cycle ---------- new table with 113 representatives # comparing sequences from 3238 to 4321 ---------- 700 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 3621 to 4676 ---------- 844 remaining sequences to the next cycle ---------- new table with 115 representatives # comparing sequences from 3832 to 4872 ---------- 822 remaining sequences to the next cycle ---------- new table with 154 representatives # comparing sequences from 4050 to 5075 ---------- 760 remaining sequences to the next cycle ---------- new table with 127 representatives # comparing sequences from 4315 to 5321 ---------- 768 remaining sequences to the next cycle ---------- new table with 138 representatives # comparing sequences from 4553 to 5542 ---------- 737 remaining sequences to the next cycle ---------- new table with 118 representatives # comparing sequences from 4805 to 5776 ---------- 727 remaining sequences to the next cycle ---------- new table with 111 representatives # comparing sequences from 5049 to 6002 ---------- 707 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 5295 to 6231 ---------- 651 remaining sequences to the next cycle ---------- new table with 127 representatives # comparing sequences from 5580 to 6496 ---------- 629 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 5867 to 6762 ---------- 563 remaining sequences to the next cycle ---------- new table with 115 representatives # comparing sequences from 6199 to 7070 ---------- 585 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 6485 to 7336 ---------- 521 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 6815 to 7642 ---------- 545 remaining sequences to the next cycle ---------- new table with 116 representatives # comparing sequences from 7097 to 7904 ---------- 514 remaining sequences to the next cycle ---------- new table with 127 representatives # comparing sequences from 7390 to 8176 ---------- 550 remaining sequences to the next cycle ---------- new table with 110 representatives # comparing sequences from 7626 to 8395 ---------- 551 remaining sequences to the next cycle ---------- new table with 123 representatives # comparing sequences from 7844 to 8598 ---------- 529 remaining sequences to the next cycle ---------- new table with 118 representatives # comparing sequences from 8069 to 8807 ---------- 465 remaining sequences to the next cycle ---------- new table with 139 representatives # comparing sequences from 8342 to 9060 ---------- 438 remaining sequences to the next cycle ---------- new table with 140 representatives # comparing sequences from 8622 to 9320 ---------- 431 remaining sequences to the next cycle ---------- new table with 130 representatives # comparing sequences from 8889 to 9568 ---------- 392 remaining sequences to the next cycle ---------- new table with 117 representatives # comparing sequences from 9176 to 9835 ---------- 377 remaining sequences to the next cycle ---------- new table with 114 representatives # comparing sequences from 9458 to 10097 ---------- 364 remaining sequences to the next cycle ---------- new table with 130 representatives # comparing sequences from 9733 to 10352 ---------- 373 remaining sequences to the next cycle ---------- new table with 122 representatives # comparing sequences from 9979 to 10580 .......... 10000 finished 5044 clusters ---------- 326 remaining sequences to the next cycle ---------- new table with 113 representatives # comparing sequences from 10254 to 10836 ---------- 296 remaining sequences to the next cycle ---------- new table with 124 representatives # comparing sequences from 10540 to 11101 ---------- 285 remaining sequences to the next cycle ---------- new table with 107 representatives # comparing sequences from 10816 to 11358 ---------- 260 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 11098 to 11619 ---------- 245 remaining sequences to the next cycle ---------- new table with 130 representatives # comparing sequences from 11374 to 11876 ---------- 277 remaining sequences to the next cycle ---------- new table with 157 representatives # comparing sequences from 11599 to 12085 ---------- 246 remaining sequences to the next cycle ---------- new table with 146 representatives # comparing sequences from 11839 to 12307 ---------- 223 remaining sequences to the next cycle ---------- new table with 146 representatives # comparing sequences from 12084 to 12535 ---------- 225 remaining sequences to the next cycle ---------- new table with 128 representatives # comparing sequences from 12310 to 12745 ---------- 225 remaining sequences to the next cycle ---------- new table with 117 representatives # comparing sequences from 12520 to 12940 ---------- 184 remaining sequences to the next cycle ---------- new table with 108 representatives # comparing sequences from 12756 to 13159 ---------- 190 remaining sequences to the next cycle ---------- new table with 131 representatives # comparing sequences from 12969 to 13357 ---------- 180 remaining sequences to the next cycle ---------- new table with 122 representatives # comparing sequences from 13177 to 13550 ---------- 154 remaining sequences to the next cycle ---------- new table with 129 representatives # comparing sequences from 13396 to 13753 ---------- 167 remaining sequences to the next cycle ---------- new table with 102 representatives # comparing sequences from 13586 to 13930 ---------- 149 remaining sequences to the next cycle ---------- new table with 115 representatives # comparing sequences from 13781 to 14111 ---------- 143 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 13968 to 14284 ---------- 99 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 14185 to 14486 ---------- 112 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 14374 to 14661 ---------- 69 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 14592 to 14864 ---------- 78 remaining sequences to the next cycle ---------- new table with 118 representatives # comparing sequences from 14786 to 15044 ---------- 76 remaining sequences to the next cycle ---------- new table with 115 representatives # comparing sequences from 14968 to 15213 ---------- 72 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 15141 to 15374 ---------- 51 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 15323 to 15543 ---------- 53 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 15490 to 15698 ---------- 9 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 15689 to 15882 ....................---------- new table with 89 representatives # comparing sequences from 15882 to 16062 ---------- 1 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 16061 to 16228 ---------- 2 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 16226 to 16381 ---------- 11 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 16370 to 16515 ..................---------- new table with 90 representatives # comparing sequences from 16515 to 16649 ...................---------- new table with 77 representatives # comparing sequences from 16649 to 16774 ..................---------- new table with 73 representatives # comparing sequences from 16774 to 16890 ...................---------- new table with 57 representatives # comparing sequences from 16890 to 16998 ..................---------- new table with 56 representatives # comparing sequences from 16998 to 17098 ..................---------- new table with 59 representatives # comparing sequences from 17098 to 17191 ...................---------- new table with 63 representatives # comparing sequences from 17191 to 17277 .................---------- new table with 47 representatives # comparing sequences from 17277 to 17357 ................---------- new table with 49 representatives # comparing sequences from 17357 to 17431 ..................---------- new table with 42 representatives # comparing sequences from 17431 to 18404 .....................---------- new table with 536 representatives 18404 finished 9584 clusters Approximated maximum memory consumption: 265M writing new database writing clustering information program completed !
>header
SEQUENCE
Also the command:
Code:
grep -c '>' BD_final_con_nombres.fasta
Does anybody know any way to fix this?