No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genome size estimation with BBTools: what is the right k value?


    I am trying to assemble the genome of Gonioctena intermedia. Its size was estimated to +/- 1.6 Gb with flow cytometry and we think it should be highly polymorphic.

    I am now trying to estimate these parameters from the data we received (Illumina HiSeq 2500) with BBTools by executing the following command:

    ./ in1=./R1.fastq in2=./R2.fastq k=${K_PARAM} khist=./${K_PARAM}.khist peaks=./${K_PARAM}.peaks
    with K_PARAM values going from 17 to 31.

    First question: I remark that the estimation of the genome size, as well as the estimation of the het rate, varies depending on the value of k. When k increases, the genome size increases and the het rate decreases (see the attached image). So my question is: which is the right k value to estimate the genome size?

    Second question: for different values of k, the ploidy varies between 1 and 2 (see the attached image). I don't understand why it happens. Also, the main peak is estimated to +/- 63 (depends again on k) and for all the executions I see the following pattern (for X = 63): first peak at X/2, the second at X and then at X*2 and X*3. Is that just a coincidence, or can I conclude something from this observation? I suppose that X corresponds to the homozygous peak and X/2 to the heterozygous peak, but what is the interpretation of X*2 and X*3?

    Third question: we expect the genome to be highly polymorphic and I think that the problem related to ploidy sometimes estimated to 1 could be explained by the highly polymorphic content of the genome. But the program indicates that the het rate is relatively low. Isn't that strange?

    I also attach graphs that I obtained for k=17 and k=31, and the output of BBTools for k=29 and k=31 below:

    #k      29
    #unique_kmers   7181618041
    #main_peak      61
    #genome_size    2535937314
    #haploid_genome_size    2535937314
    #fold_coverage  31
    #haploid_fold_coverage  31
    #ploidy 1
    #percent_repeat 91.476
    #start  center  stop    max     volume
    14      31      41      13493009        216159759
    41      61      102     36090010        856406178
    102     123     167     1151778 47402871
    167     181     453     287488  30543532
    1136    1139    1225    6345    520083
    1225    1227    1332    5562    539154
    1332    1338    1397    4677    289466
    1397    1399    1445    4244    198202
    1445    1446    1450    4028    19938
    1450    1452    1475    4003    98402
    1475    1476    1493    3934    69066
    1493    1498    3745    3922    3396667
    #k      31
    #unique_kmers   7495157208
    #main_peak      61
    #genome_size    2619186233
    #haploid_genome_size    1309593116
    #fold_coverage  30
    #haploid_fold_coverage  61
    #ploidy 2
    #het_rate       0.00574
    #percent_repeat 23.602
    #start  center  stop    max     volume
    14      30      41      14545750        233034146
    41      61      101     37606843        883987911
    101     121     166     1142508 46978440
    166     179     448     285056  29563222
    1067    1069    1095    6768    183168
    1095    1101    1125    6301    187407
    1125    1127    1156    6163    184603
    1156    1160    1247    5811    495762
    1247    1248    1276    5184    143577
    1276    1277    1343    4874    307763
    1343    1344    1401    4568    244932
    1401    1405    1513    4012    424762
    1513    1520    3799    3623    3191287
    Thank you in advance for the help!
    Attached Files

  • #2
    The peak-calling and ploidy estimation in KmerCountExact are not very sophisticated, but I certainly expect them to do a better job than that! To my eye this is a very obvious diploid; not sure why the ploidy sometimes is estimated at 1; I'll have to look into that. The X*2 and X*3 peaks are 2-copy and 3-copy repeats, which are very pronounced; looks like the organism is fairly repetitive - they are normally much smaller.

    It does look to me from the picture that the het rate is pretty low, though (assuming that X=63 is indeed the haploid genomic peak).

    So, as for your questions...

    1) None of them are exactly right, but certainly all the kmer lengths that predict ploidy of 1 are wrong! I'd use the estimates from K=31. The model for calculating genome size and het rate assumes all SNPs are at least K bases from each other. As you increase K (or increase the het rate) that starts to become less correct, which is why you can't use a super-long K usefully for these calculations, but K=31 is still pretty short. Also, as you increase K, semi-repetitive sequences are better resolved as being unique (which is why the estimated genome size increases with K), so there's a tradeoff between high and low K.

    2) Answered.

    3) The het rate is based on the assumption that all heterozygous events are SNPs or very short indels. If the polymorphism is more along the lines of rearrangements, the estimate would be pretty incorrect...


    • #3
      Thank you Brian for this complete and very useful answer! So, according to these estimations, the main complexity in assembly of this genome will be its repetitive content rather than its degree of polymorphism. It's a very useful information.


      Latest Articles


      • seqadmin
        Advanced Tools Transforming the Field of Cytogenomics
        by seqadmin

        At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
        09-26-2023, 06:26 AM
      • seqadmin
        How RNA-Seq is Transforming Cancer Studies
        by seqadmin

        Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
        09-07-2023, 11:15 PM





      Topics Statistics Last Post
      Started by seqadmin, 09-29-2023, 09:38 AM
      0 responses
      Last Post seqadmin  
      Started by seqadmin, 09-27-2023, 06:57 AM
      0 responses
      Last Post seqadmin  
      Started by seqadmin, 09-26-2023, 07:53 AM
      1 response
      Last Post seed_phrase_metal_storage  
      Started by seqadmin, 09-25-2023, 07:42 AM
      0 responses
      Last Post seqadmin