Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unusually high duplicated Reads in Mate Pair Library

    Hi,

    Recently we received one lane of HiSeq 8kb Mate-Pair reads with 200million 100bp reads.The data is intended for de-novo assembly/scaffolding for ~800MB genome. Initial FastQC assessment indicates good data quality except reported UNUSUALLY high duplicated reads, which is 96.77%! Please find attached the relevant FastQC images.

    Searching for relevant posts revealed other reported duplication level as high as 80-85%, which could be due to the PCR bias. The sequencing service provider assured us that this level of duplication is common for illumina mate pair libraries. When we used this data with our existing data(illumina 2 lanes of 400bp PE + 1 lane of 700bp PE) we have either worse results than before (using CLC)or just minimal improvements (using SoapDenovo) in terms of N50, no. of contigs/scaffolds etc.
    Now we wonder:
    Is it common to have such high duplication level?
    Do we need to discard duplicated reads? If yes, best tools? (rmdup? picard?)
    and finally the Strategy to improve the assembly with the data we have.

    Thanks for your advice.

    Cheers.
    Attached Files
    Last edited by fahmida; 05-21-2013, 05:14 PM. Reason: wrong title

  • #2
    I'm not sure whether I read this right but it seems that _most_ of your reads appear much more than 10 times. Hence, once your remove all the duplicates, you will get down from 200M reads to maybe 10M unique ones, and this will surely be too little to assemble your genome. Also, a whopping 12% of the reads map to the adapter (and if I understand correctly, this means that you have been sequencing primer dimers rather than your genome).

    So, your sequencing provider needs to come up with a better excuse than claiming that this would be "common".

    Comment


    • #3
      It is common for mate pairs. You are *supposed* to get adapters (mate-pair linkers) due to the library prep. You need to pre-process the reads with something like http://genomes.sdsc.edu/downloads/deloxer/ before using them for assembly.

      Comment


      • #4
        Originally posted by fahmida View Post
        Is it common to have such high duplication level?
        Do we need to discard duplicated reads?
        Mate pair libraries are naturally very low diversity, and the larger the initial fragmentation, the lower the final library diversity. For an 8kbp library I am not terribly surprised by the duplication level you have observed. You have simply reach the saturation depth of this library. It is not common to sequence an entire HiSeq lane for one mate pair library as you do not need deep coverage from you mate pairs; they are only needed to scaffold contigs built from your deep, paired end coverage.

        You should also be aware that FastQC is only considering one read of the pair in calculating the duplication rate. When you perform a proper duplicate analysis which considers both members of the read pair the duplication rate will drop.

        Yes, you should remove duplicates. I normally use picard tools.

        Originally posted by Simon Anders View Post
        Also, a whopping 12% of the reads map to the adapter...
        Simon, FastQC reports the percentage of the contaminating sequence so it is 0.1164%, or 0.001164 as a fraction.
        Last edited by kmcarr; 05-22-2013, 04:56 AM. Reason: Added comment about paired end duplicates.

        Comment


        • #5
          Okay, then better ignore my post. Seems I know much less about mate-pair libraries than I thought. ;-)

          Comment


          • #6
            Thanks for your comments and suggestions Simon, kopi-o and kmcarr. I am in the middle of running picard's MarkDuplicate, hopefully it'll give a realistic estimate of actual duplication level. Also, if possible, in our next HiSeq run I am planning to have 3kb and 5kb mate pairs in one lane.

            p.s. got the MarkDuplicate result, attached here.
            Attached Files

            Comment


            • #7
              Originally posted by fahmida View Post
              Thanks for your comments and suggestions Simon, kopi-o and kmcarr. I am in the middle of running picard's MarkDuplicate, hopefully it'll give a realistic estimate of actual duplication level. Also, if possible, in our next HiSeq run I am planning to have 3kb and 5kb mate pairs in one lane.

              p.s. got the MarkDuplicate result, attached here.
              fahimda,

              The stats you provided show only ~1% of the read pairs were mapped. Why so low?

              Comment


              • #8
                Originally posted by kmcarr View Post
                fahimda,

                The stats you provided show only ~1% of the read pairs were mapped. Why so low?
                I am also puzzled by that and trying to gather an explanation! Using Bowtie's default parameters mate-pair reads are mapped to ~500,000 contigs generated from the first round of assembly (using 3 lanes paired-end).

                bowtie -t -S -p 20 --chunkmbs 50000 --un unaligned_8kbMatePair_reads.fastq 741_QFABtrim_denovo -1 M-Int741_1.fastq -2 M-Int741_2.fastq aln-pe.sam

                Could it be due to the fragmented nature of the contigs or reads having only partial match?

                Comment


                • #9
                  Originally posted by fahmida View Post
                  I am also puzzled by that and trying to gather an explanation! Using Bowtie's default parameters mate-pair reads are mapped to ~500,000 contigs generated from the first round of assembly (using 3 lanes paired-end).

                  bowtie -t -S -p 20 --chunkmbs 50000 --un unaligned_8kbMatePair_reads.fastq 741_QFABtrim_denovo -1 M-Int741_1.fastq -2 M-Int741_2.fastq aln-pe.sam

                  Could it be due to the fragmented nature of the contigs or reads having only partial match?
                  Are your reads reverse-forward still, as is typical of mate-pair seqs? Should you add --rf as an option?

                  Comment


                  • #10
                    Originally posted by Wallysb01 View Post
                    Are your reads reverse-forward still, as is typical of mate-pair seqs? Should you add --rf as an option?
                    Thanks for pointing that. I've repeated the alignment, this time with bowtie2 with following parameters:
                    bowtie2 -t -p 20 -N 1 -I 4000 -X 9000 --rf --un unaligned_8kbMatePair_reads.fastq -x 741_QFABtrim_denovo -1 M-Int741_1.fastq -2 M-Int741_2.fastq -S bowtie2.aln.sam

                    And the got the following output:

                    200340177 reads; of these:
                    200340177 (100.00%) were paired; of these:
                    194552274 (97.11%) aligned concordantly 0 times
                    5639453 (2.81%) aligned concordantly exactly 1 time
                    148450 (0.07%) aligned concordantly >1 times
                    ----
                    194552274 pairs aligned concordantly 0 times; of these:
                    33711784 (17.33%) aligned discordantly 1 time
                    ----
                    160840490 pairs aligned 0 times concordantly or discordantly; of these:
                    321680980 mates make up the pairs; of these:
                    133741497 (41.58%) aligned 0 times
                    82899936 (25.77%) aligned exactly 1 time
                    105039547 (32.65%) aligned >1 times
                    66.62% overall alignment rate

                    Comment


                    • #11
                      Hmm, I guess the discordinate maps are just the regular PE reads that come along as contamination with mate pair prep. Was this also after you trimmed adapter sequences?

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-25-2024, 11:49 AM
                      0 responses
                      19 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-24-2024, 08:47 AM
                      0 responses
                      18 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      62 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X