Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • brentp
    Member
    • Apr 2010
    • 72

    increase in 3' %T after filtering BS-Treated reads

    What I see (and displayed in attached images) is that after filtering the first end of a set of paired end reads (the file like _1.fastq), there's is an increase in %T at the 3' end. This only occurs on the first (_1) reads, not the second end reads.

    I noticed this on some data of my own and pulled a few files down from the sequence read archive and I found some (not all) that show the same pattern. I'm using fastqc to show the images, but I also tested with the fastx toolkit plotting. I'm using fastx toolkit to do the filtering but I've also used a custom script. So those can be ruled out.

    Here's what I do. (that fastq files is from some study that uses BS-Seq and paired -end):

    Code:
    wget ftp://ftp.ncbi.nlm.nih.gov/sra/Submissions/SRA012/SRA012457/SRX019113/SRR039814_1.fastq.bz2
    bunzip2 SRR039814_1.fastq.bz2
    
    /usr/local/src/fastqc/FastQC/fastqc SRR039814_1.fastq
    
    fastq_quality_trimmer -Q 33 -t 20 -l 30 -i SRR039814_1.fastq > SRR039814_1.trim.fastq
    
    /usr/local/src/fastqc/FastQC/fastqc SRR039814_1.trim.fastq
    before filtering, the per-base-sequence content image from fastqc looks like the image labelled as such below. Even before filtering, there is some increase in %T at the final base of the read.

    in the image named post_filter_per_base_sequence_content, you can see that at the 3' end of the read, the %T increases greatly.
    Any ideas on why this would happen?
    Attached Files
  • fkrueger
    Senior Member
    • Sep 2009
    • 627

    #2
    We have seen different kinds of artefacts happening towards the ends of BS-data (especially for long Illumina reads), most often the number of Cs increases drastically which is paralleled by a drop in Ts. The imbalance in base composition in BS-reads are clearly affecting the way the Illumina pipeline is calling bases towards later cycles.

    It is difficult to tell exactly what is going on without seeing the rest of the picture, such as the FastQC per base sequence quality plot. I suspect that the overall basecall quality decrease substantially after cycle 60 or so (which it always does from what we have seen for BS-Seq datasets so far). Thus, your quality trimming script might reduce your sequences to varying lengths, leaving only few reads with their original 75bp read length. These few full length reads would then make up a much higher proportion as in the original untrimmed dataset, and thus you see the sequence bias increase rather than decrease by your trimming step. Might it be possible that the insert size for some reads is too short and you start sequencing the read_2 adapter which happens to be rich in T and poor in A? (normally there should be a correlation between T and C but not T and A.....).

    What we normally do prior to aligning BS-treated reads with Bismark is trim all sequences to a length which has still good quality scores AND doesn't show and kind of weird sequence bias, normally down to 50bp to be sure. 50 bp is plenty of sequence to do very good bisulfite mapping (normally 60-70%), and in addition you have paired-end reads which will further increase mapping efficiency by around 2-4% (if you do paired-end reads and the read length is very long (75+) you might read an overlapping bit of sequence in the middle from both sides, which effectively doesn't give you any additional qualitative methylation information anyway).

    I hope this helps, if I was unclear please contact me again.

    Kind regards,
    Felix

    Comment

    • brentp
      Member
      • Apr 2010
      • 72

      #3
      hi felix, thanks for the reply.
      indeed, the quality does drop after 50. but there are still plenty of reads that extend to 76bp, so it's not sampling error. in addition, i seem this same patter for many of the _1 ends from BS-Seq on the short read archive. i hadn't thought about the adaptor being the cause, i'll look into it.

      i also trim before using MethylCoder, but just per-read, havent tried trimming all reads to a set length. maybe i'll set the max-length to 72 which would remove the portion with increased T.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Pathogen Surveillance with Advanced Genomic Tools
        by seqadmin




        The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
        Yesterday, 11:48 AM
      • seqadmin
        New Genomics Tools and Methods Shared at AGBT 2025
        by seqadmin


        This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

        The Headliner
        The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
        03-03-2025, 01:39 PM
      • seqadmin
        Investigating the Gut Microbiome Through Diet and Spatial Biology
        by seqadmin




        The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
        02-24-2025, 06:31 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 03-20-2025, 05:03 AM
      0 responses
      26 views
      0 reactions
      Last Post seqadmin  
      Started by seqadmin, 03-19-2025, 07:27 AM
      0 responses
      33 views
      0 reactions
      Last Post seqadmin  
      Started by seqadmin, 03-18-2025, 12:50 PM
      0 responses
      25 views
      0 reactions
      Last Post seqadmin  
      Started by seqadmin, 03-03-2025, 01:15 PM
      0 responses
      190 views
      0 reactions
      Last Post seqadmin  
      Working...