Hello,
I am trying to assemble the genome of Gonioctena intermedia. Its size was estimated to +/- 1.6 Gb with flow cytometry and we think it should be highly polymorphic.
I am now trying to estimate these parameters from the data we received (Illumina HiSeq 2500) with BBTools by executing the following command:
with K_PARAM values going from 17 to 31.
First question: I remark that the estimation of the genome size, as well as the estimation of the het rate, varies depending on the value of k. When k increases, the genome size increases and the het rate decreases (see the attached image). So my question is: which is the right k value to estimate the genome size?
Second question: for different values of k, the ploidy varies between 1 and 2 (see the attached image). I don't understand why it happens. Also, the main peak is estimated to +/- 63 (depends again on k) and for all the executions I see the following pattern (for X = 63): first peak at X/2, the second at X and then at X*2 and X*3. Is that just a coincidence, or can I conclude something from this observation? I suppose that X corresponds to the homozygous peak and X/2 to the heterozygous peak, but what is the interpretation of X*2 and X*3?
Third question: we expect the genome to be highly polymorphic and I think that the problem related to ploidy sometimes estimated to 1 could be explained by the highly polymorphic content of the genome. But the program indicates that the het rate is relatively low. Isn't that strange?
I also attach graphs that I obtained for k=17 and k=31, and the output of BBTools for k=29 and k=31 below:
Thank you in advance for the help!
I am trying to assemble the genome of Gonioctena intermedia. Its size was estimated to +/- 1.6 Gb with flow cytometry and we think it should be highly polymorphic.
I am now trying to estimate these parameters from the data we received (Illumina HiSeq 2500) with BBTools by executing the following command:
Code:
./kmercountexact.sh in1=./R1.fastq in2=./R2.fastq k=${K_PARAM} khist=./${K_PARAM}.khist peaks=./${K_PARAM}.peaks
First question: I remark that the estimation of the genome size, as well as the estimation of the het rate, varies depending on the value of k. When k increases, the genome size increases and the het rate decreases (see the attached image). So my question is: which is the right k value to estimate the genome size?
Second question: for different values of k, the ploidy varies between 1 and 2 (see the attached image). I don't understand why it happens. Also, the main peak is estimated to +/- 63 (depends again on k) and for all the executions I see the following pattern (for X = 63): first peak at X/2, the second at X and then at X*2 and X*3. Is that just a coincidence, or can I conclude something from this observation? I suppose that X corresponds to the homozygous peak and X/2 to the heterozygous peak, but what is the interpretation of X*2 and X*3?
Third question: we expect the genome to be highly polymorphic and I think that the problem related to ploidy sometimes estimated to 1 could be explained by the highly polymorphic content of the genome. But the program indicates that the het rate is relatively low. Isn't that strange?
I also attach graphs that I obtained for k=17 and k=31, and the output of BBTools for k=29 and k=31 below:
Code:
#k 29 #unique_kmers 7181618041 #main_peak 61 #genome_size 2535937314 #haploid_genome_size 2535937314 #fold_coverage 31 #haploid_fold_coverage 31 #ploidy 1 #percent_repeat 91.476 #start center stop max volume 14 31 41 13493009 216159759 41 61 102 36090010 856406178 102 123 167 1151778 47402871 167 181 453 287488 30543532 1136 1139 1225 6345 520083 1225 1227 1332 5562 539154 1332 1338 1397 4677 289466 1397 1399 1445 4244 198202 1445 1446 1450 4028 19938 1450 1452 1475 4003 98402 1475 1476 1493 3934 69066 1493 1498 3745 3922 3396667
Code:
#k 31 #unique_kmers 7495157208 #main_peak 61 #genome_size 2619186233 #haploid_genome_size 1309593116 #fold_coverage 30 #haploid_fold_coverage 61 #ploidy 2 #het_rate 0.00574 #percent_repeat 23.602 #start center stop max volume 14 30 41 14545750 233034146 41 61 101 37606843 883987911 101 121 166 1142508 46978440 166 179 448 285056 29563222 1067 1069 1095 6768 183168 1095 1101 1125 6301 187407 1125 1127 1156 6163 184603 1156 1160 1247 5811 495762 1247 1248 1276 5184 143577 1276 1277 1343 4874 307763 1343 1344 1401 4568 244932 1401 1405 1513 4012 424762 1513 1520 3799 3623 3191287
Comment