Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • duplicated reads in fastQC

    Hi, I have some duplication issues as suggest in fastQC. The duplication levels in my samples average about 60%. I read some old posts. The link below seems to sugested removal of duplicated reads, while the sequencing facility suggested otherwise. It seems to me that the duplication will affect the accurate counts of the transcripts. Any thoughts?

    What software can do this duplication removal? I check out the fastX, it doesn't seem have that functionality. Suggestions?

    thanks!


    Originally posted by GenoMax View Post

  • #2
    Originally posted by JQL View Post
    It seems to me that the duplication will affect the accurate counts of the transcripts.
    To some extent you would expect duplication in a transcriptome (or even small genome) project. It depends on your sequencing coverage and the size of the transcriptome/genome.

    As a thought experiment, let's say that the size of your transcriptome is 100,000,000 bases. That means that at the best you can have 100M unique sequences. If you sequence 200M bases (cheap to do!) then you would expect a 2x duplication level.

    All sorts of caveats plus 'and-also's in the above but the general idea is that with modern sequencing it is quite easy to overwhelm the uniqueness of reads and start picking up duplicates.

    Comment


    • #3
      Originally posted by JQL View Post

      What software can do this duplication removal? I check out the fastX, it doesn't seem have that functionality. Suggestions?

      thanks!
      PRINSEQ (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi) can do removal of duplicate (or n-plicate) sequences.

      Comment


      • #4
        Thanks for your thoughts. I think I would agree with you. I would probably leave the duplicates alone then.

        Originally posted by westerman View Post
        To some extent you would expect duplication in a transcriptome (or even small genome) project. It depends on your sequencing coverage and the size of the transcriptome/genome.

        As a thought experiment, let's say that the size of your transcriptome is 100,000,000 bases. That means that at the best you can have 100M unique sequences. If you sequence 200M bases (cheap to do!) then you would expect a 2x duplication level.

        All sorts of caveats plus 'and-also's in the above but the general idea is that with modern sequencing it is quite easy to overwhelm the uniqueness of reads and start picking up duplicates.

        Comment


        • #5
          thanks GenoMax for the link.
          I may experiment a little bit. Remove the duplicates and rerun the fastQC and see what happens.


          Originally posted by GenoMax View Post
          PRINSEQ (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi) can do removal of duplicate (or n-plicate) sequences.

          Comment


          • #6
            I just went through this myself with some recent transcriptome data that FastQC showed to be highly redundant.
            Like westerman said it depends on what you are trying to do, but if you are going to use the data for an assembly I'd suggest looking into the digital normalization procedure. This will reduce the amount of redundant data you feed into the assembler and make assembly much more efficient. Of course if you are trying to analyze for differential expression you will ultimately need to retain all of the duplicates.

            Comment


            • #7
              I am currently only interested in differential expressions.

              thanks for sharing your thoughts.

              Originally posted by NRP View Post
              I just went through this myself with some recent transcriptome data that FastQC showed to be highly redundant.
              Like westerman said it depends on what you are trying to do, but if you are going to use the data for an assembly I'd suggest looking into the digital normalization procedure. This will reduce the amount of redundant data you feed into the assembler and make assembly much more efficient. Of course if you are trying to analyze for differential expression you will ultimately need to retain all of the duplicates.

              Comment


              • #8
                Another related question:

                While I agree it is probably better to leave the duplicated sequences alone for differential expression study, there are also some over-represented sequences (ORS) in my samples. In fastQC report, some of those top ORS are shown to be adapter seqeunces, others shown to have no hits. They probably don't accounts for large percentage of duplicated sequences (5% maybe?), do you guys remove those adaptor sequences?

                Comment


                • #9
                  Yes, I had that issue as well. I think it is best to trim those. I used trim galore for that & it worked quite well.

                  Comment


                  • #10
                    Hi,

                    I just want to add that we need to also consider the potential sources of the duplication. Is it due to high coverage or PCR-amplification during library prep. It is never a clean cut but you need to assess which one is more dominant as they have different impacts to certain quantitation studies.

                    Best regards,
                    Douglas

                    Comment


                    • #11
                      I have looked into fastx clipper which is supposed to trim the adapter sequence. But I have also read some earlier posts here that suggested that fastx clipper didn't work well. http://seqanswers.com/forums/showthr...=fastx+clipper

                      In my case, fastQC suggests I have 4.7% (out of 4M sampled) of the adapter sequence "GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCC". But after running fastx_clipper with option -C to remove the above 51-base adapter seq, I lost 4,752,644. I have a total of ~23M reads -- thats about 20% of reads. It seems either I have done something wrong or the program still has bugs. Any suggestions?

                      I haven't tried trim galore yet.


                      Originally posted by NRP View Post
                      Yes, I had that issue as well. I think it is best to trim those. I used trim galore for that & it worked quite well.

                      Comment


                      • #12
                        I've never tried fastx clipper, but in trim galore you can specify the sequence to trim & adjust the match stringency so that might help.

                        Comment


                        • #13
                          grep -c ADAPTER found 1M adapter, which is about 4.4%, consistent with the fastQC report. Not sure how fastx clipper found and removed 4.7M adapter sequences.

                          I guess, Trim Galore seems to be a better option.

                          Originally posted by JQL View Post
                          I have looked into fastx clipper which is supposed to trim the adapter sequence. But I have also read some earlier posts here that suggested that fastx clipper didn't work well. http://seqanswers.com/forums/showthr...=fastx+clipper

                          In my case, fastQC suggests I have 4.7% (out of 4M sampled) of the adapter sequence "GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCC". But after running fastx_clipper with option -C to remove the above 51-base adapter seq, I lost 4,752,644. I have a total of ~23M reads -- thats about 20% of reads. It seems either I have done something wrong or the program still has bugs. Any suggestions?

                          I haven't tried trim galore yet.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Best Practices for Single-Cell Sequencing Analysis
                            by seqadmin



                            While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                            06-06-2024, 07:15 AM
                          • seqadmin
                            Latest Developments in Precision Medicine
                            by seqadmin



                            Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                            Somatic Genomics
                            “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                            05-24-2024, 01:16 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 07:24 AM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-13-2024, 08:58 AM
                          0 responses
                          11 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-12-2024, 02:20 PM
                          0 responses
                          16 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-07-2024, 06:58 AM
                          0 responses
                          184 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X