Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to handle duplicates data?

    I wonder how do you handle duplicates data (e.g. ChIP-seq) that were performed on the two biological replicates.

    Do you map them individually first and get their mapping location in the genome, transfer them to some format like bed files and then merge the bed files?

    Or do you merge the two fastq files first and then map the one fastq file?

    Thanks!

  • #2
    You could consider removing those duplicates using SAMTOOLS (rmdup).

    Comment


    • #3
      Originally posted by kushald View Post
      You could consider removing those duplicates using SAMTOOLS (rmdup).
      I don't think he/she's referring to PCR duplicates, rather biological replicates.

      gene_x, could you explain your experiment in a bit more detail? More specifically, describe how many samples you've sequenced, how many lanes of data you have for each sample, and how many libraries were generated for each sample (if there is more than one lane of data for each sample), and then please briefly describe what you're hoping to do (find peaks in both samples, find peaks present in one set of samples vs. another, etc).

      Comment


      • #4
        Originally posted by Heisman View Post
        I don't think he/she's referring to PCR duplicates, rather biological replicates.

        gene_x, could you explain your experiment in a bit more detail? More specifically, describe how many samples you've sequenced, how many lanes of data you have for each sample, and how many libraries were generated for each sample (if there is more than one lane of data for each sample), and then please briefly describe what you're hoping to do (find peaks in both samples, find peaks present in one set of samples vs. another, etc).
        I just realized my latest post is roughly asking the same question..

        to Heisman, I wanted to reanalyze/replicate some ENCODE ChIP-seq data and I couldn't be sure how they did it based on their description.. basically to find peaks in a duplicated sequencing data.

        Comment


        • #5
          I see the similarity between this and your other post I just responded to:

          if the same sample was sequenced multiple times with the same exact library then the only difference in the data would be due to differences in the sequencing itself. In this case you could align each lane of data separately and give them separate RG IDs but the same LB and SM (library and sample) IDs.

          if the same sample was sequenced multiple times with different libraries (ie, you prepped the sample twice), then you can do the above but make sure the LB ID is different in addition to the RG ID.

          If you have completely different samples that are true biological replicates, then you probably don't want to merge the raw or aligned data at all; rather you'll want to probably call peaks on the two samples separately and then compare the results in some capacity (ie, using IDR: https://sites.google.com/site/anshul...e/projects/idr)

          Honestly, I'm not experienced enough with peak calling to give great advice, but the above should be solid.

          Comment


          • #6
            Originally posted by Heisman View Post
            I see the similarity between this and your other post I just responded to:

            if the same sample was sequenced multiple times with the same exact library then the only difference in the data would be due to differences in the sequencing itself. In this case you could align each lane of data separately and give them separate RG IDs but the same LB and SM (library and sample) IDs.

            if the same sample was sequenced multiple times with different libraries (ie, you prepped the sample twice), then you can do the above but make sure the LB ID is different in addition to the RG ID.

            If you have completely different samples that are true biological replicates, then you probably don't want to merge the raw or aligned data at all; rather you'll want to probably call peaks on the two samples separately and then compare the results in some capacity (ie, using IDR: https://sites.google.com/site/anshul...e/projects/idr)

            Honestly, I'm not experienced enough with peak calling to give great advice, but the above should be solid.
            What does RG, LB ID stand for and what are they?

            Comment


            • #7
              Originally posted by gene_x View Post
              What does RG, LB ID stand for and what are they?
              RG = read group
              LB = library
              SM = sample
              PL = platform
              ID = ID (identification, haha)

              So RG:ID is shorthand for read group ID, for example.

              If you're getting into this stuff for the first time and it's not a one-off, I'd glance/read through this: http://samtools.sourceforge.net/SAM1.pdf

              The importance of library is when removing duplicate reads. If you sequence the same sample with different libraries, you don't want to remove reads that appear as duplicates between different libraries (because they are from different biological template strands). If you sequence the same library multiple times, though, then if reads appear as duplicates people do typically want to remove them as they are more likely due to PCR amplification of the same original biological template strand (some exceptions here particularly if you have high coverage).

              Read group can be important independent of library if some of the sequencing runs were of bad quality, and because a lot of the software of the GATK toolset uses/requires RG to be set.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-25-2024, 11:49 AM
              0 responses
              15 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-24-2024, 08:47 AM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              62 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Working...
              X