Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to handle duplicates data?

    I wonder how do you handle duplicates data (e.g. ChIP-seq) that were performed on the two biological replicates.

    Do you map them individually first and get their mapping location in the genome, transfer them to some format like bed files and then merge the bed files?

    Or do you merge the two fastq files first and then map the one fastq file?

    Thanks!

  • #2
    You could consider removing those duplicates using SAMTOOLS (rmdup).

    Comment


    • #3
      Originally posted by kushald View Post
      You could consider removing those duplicates using SAMTOOLS (rmdup).
      I don't think he/she's referring to PCR duplicates, rather biological replicates.

      gene_x, could you explain your experiment in a bit more detail? More specifically, describe how many samples you've sequenced, how many lanes of data you have for each sample, and how many libraries were generated for each sample (if there is more than one lane of data for each sample), and then please briefly describe what you're hoping to do (find peaks in both samples, find peaks present in one set of samples vs. another, etc).

      Comment


      • #4
        Originally posted by Heisman View Post
        I don't think he/she's referring to PCR duplicates, rather biological replicates.

        gene_x, could you explain your experiment in a bit more detail? More specifically, describe how many samples you've sequenced, how many lanes of data you have for each sample, and how many libraries were generated for each sample (if there is more than one lane of data for each sample), and then please briefly describe what you're hoping to do (find peaks in both samples, find peaks present in one set of samples vs. another, etc).
        I just realized my latest post is roughly asking the same question..

        to Heisman, I wanted to reanalyze/replicate some ENCODE ChIP-seq data and I couldn't be sure how they did it based on their description.. basically to find peaks in a duplicated sequencing data.

        Comment


        • #5
          I see the similarity between this and your other post I just responded to:

          if the same sample was sequenced multiple times with the same exact library then the only difference in the data would be due to differences in the sequencing itself. In this case you could align each lane of data separately and give them separate RG IDs but the same LB and SM (library and sample) IDs.

          if the same sample was sequenced multiple times with different libraries (ie, you prepped the sample twice), then you can do the above but make sure the LB ID is different in addition to the RG ID.

          If you have completely different samples that are true biological replicates, then you probably don't want to merge the raw or aligned data at all; rather you'll want to probably call peaks on the two samples separately and then compare the results in some capacity (ie, using IDR: https://sites.google.com/site/anshul...e/projects/idr)

          Honestly, I'm not experienced enough with peak calling to give great advice, but the above should be solid.

          Comment


          • #6
            Originally posted by Heisman View Post
            I see the similarity between this and your other post I just responded to:

            if the same sample was sequenced multiple times with the same exact library then the only difference in the data would be due to differences in the sequencing itself. In this case you could align each lane of data separately and give them separate RG IDs but the same LB and SM (library and sample) IDs.

            if the same sample was sequenced multiple times with different libraries (ie, you prepped the sample twice), then you can do the above but make sure the LB ID is different in addition to the RG ID.

            If you have completely different samples that are true biological replicates, then you probably don't want to merge the raw or aligned data at all; rather you'll want to probably call peaks on the two samples separately and then compare the results in some capacity (ie, using IDR: https://sites.google.com/site/anshul...e/projects/idr)

            Honestly, I'm not experienced enough with peak calling to give great advice, but the above should be solid.
            What does RG, LB ID stand for and what are they?

            Comment


            • #7
              Originally posted by gene_x View Post
              What does RG, LB ID stand for and what are they?
              RG = read group
              LB = library
              SM = sample
              PL = platform
              ID = ID (identification, haha)

              So RG:ID is shorthand for read group ID, for example.

              If you're getting into this stuff for the first time and it's not a one-off, I'd glance/read through this: http://samtools.sourceforge.net/SAM1.pdf

              The importance of library is when removing duplicate reads. If you sequence the same sample with different libraries, you don't want to remove reads that appear as duplicates between different libraries (because they are from different biological template strands). If you sequence the same library multiple times, though, then if reads appear as duplicates people do typically want to remove them as they are more likely due to PCR amplification of the same original biological template strand (some exceptions here particularly if you have high coverage).

              Read group can be important independent of library if some of the sequencing runs were of bad quality, and because a lot of the software of the GATK toolset uses/requires RG to be set.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X