Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CpG sites in Bismark

    Hello

    im trying to calculate the percentage of covered CpG sites in my RRBS library and compare it with total CpG sites in reference genome. i got splitting report from Bismark (see bellow)

    q1- could i say CpG sites in my RRBS library are equal to number of Total methylated C's in CpG context + number of Total C to T conversions in CpG context (around 19 million) ? if No how i can find total CpG sites in RRBS library?

    q2- i downloaded pig CGI annotation and counted all CpG sites but the total was around 2 million. sound very low for me. how i can find the actual number of CpG sites in reference genome?

    q3- is there a way to determine CpG sites per chromosome and compare it with CpG sites in each chromosome of reference genome?


    Final Cytosine Methylation Report
    =================================
    Total number of C's analysed: 141645338

    Total methylated C's in CpG context: 7904886
    Total methylated C's in CHG context: 50683
    Total methylated C's in CHH context: 107717

    Total C to T conversions in CpG context: 12298571
    Total C to T conversions in CHG context: 35912924
    Total C to T conversions in CHH context: 85370557

  • #2
    Hi Hedi,

    q1- could i say CpG sites in my RRBS library are equal to number of Total methylated C's in CpG context + number of Total C to T conversions in CpG context (around 19 million) ? if No how i can find total CpG sites in RRBS library?
    No, I'm afraid you can’t say that. The numbers reported are the overall numbers of methylation calls performed for the entire run, and have nothing to do with the number of genomic positions covered. If you want to find out how many Cs were covered in your experiment you generate a coverage file where each line corresponds to a covered C position. So the number of lines in the file (zcat file.cov.gz | wc -l) is the number of positions covered in your experiment.

    q2- i downloaded pig CGI annotation and counted all CpG sites but the total was around 2 million. sound very low for me. how i can find the actual number of CpG sites in reference genome?
    You could use
    Code:
    bam2nuc
    (part of Bismark) to find out the number of Cs, or CpGs, in the genome. Here is the output for the Sscrofa11.1 build (genome-wide).

    Code:
    A       717891230
    AA      237125812
    AC      124343360
    AG      171421615
    AT      185000140
    C       517402066
    CA      178358877
    CC      136906913
    CG      30619972
    CT      171516061
    G       517706165
    GA      147162051
    GC      108922386
    GG      136983938
    GT      124637555
    T       719048243
    TA      155244114
    TC      147229152
    TG      178680414
    TT      237894187
    CGIs are only a small, albeit CG-rich, fraction of the genome, so 2M doesn’t sound too bad.


    q3- is there a way to determine CpG sites per chromosome and compare it with CpG sites in each chromosome of reference genome?
    I would suggest you use SeqMonk for this kind of work. You need to keep in mind though that RRBS only expects to cover ~1-2% of the genome at very specific positions, so getting an idea about how many CpG were covered per chromosome is almost certainly not anything you should be interested in.

    Comment


    • #3
      thank you for your advice and help. in methylkit using following command you can get coverage as well. but im wondering is it CpG coverage or read coverage? they used both definitions in their tutorial (https://www.bioconductor.org/package...ics_on_samples) . is it different with your suggested way of CpG coverage calculation?

      getCoverageStats(my.methRaw[[1]],plot = F,both.strands = FALSE)
      read coverage statistics per base
      summary:
      Min. 1st Qu. Median Mean 3rd Qu. Max.
      10.00 12.00 15.00 28.25 20.00 131376.00

      thanks again

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Best Practices for Single-Cell Sequencing Analysis
        by seqadmin



        While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
        Yesterday, 07:15 AM
      • seqadmin
        Latest Developments in Precision Medicine
        by seqadmin



        Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

        Somatic Genomics
        “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
        05-24-2024, 01:16 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 06:58 AM
      0 responses
      13 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 08:18 AM
      0 responses
      20 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 08:04 AM
      0 responses
      18 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 06-03-2024, 06:55 AM
      0 responses
      13 views
      0 likes
      Last Post seqadmin  
      Working...
      X