Quality-, adapter- and RRBS-trimming with Trim Galore!

pig_raffles replied

11-12-2020, 04:58 PM
Strange peak at beginning of M-bias plots

Hi,

I have generated RRBS read data, which I have filtered and trimmed using Trim Galore and then aligned to a reference genome using Bismark. Three independent sequencing libraries were created with samples randomly mixed among the three sequencing libraries.

All the individuals from one of the libraries have a characteristic peak in their m-bias plots (see attached Bad_M-bias image), where there is a spike in methylation in the first 5 nucleotide positions, before settling at a level of around 60 % CpG methylation. The M-bias plots from samples from the other libraries show relatively stable CpG methylation levels at 60% with no spike at the beginning (see attached Good_M-bias image).

This unusual M-bias profile is also accompanied by a drop in q-value in the same nucleotide position for all the samples from this one specific library.

I had previously ignored this issue as the drop in quality was not severe but this has lead to some issues with downstream analyses (for example SNP-calling from the RRBS reads) that has lead me to revisit my read QC and all these issues concern samples from this particular library.

To this end, does anyone know what could be causing these issues with samples from this specific library and how to go about solving this problem? Would simply trimming the 5' end of the reads in Trim Galore! for all samples regardless of library remedy this problem or would the issues be deeper than this?

Thanks in advance!
Attached Files

Good_M-bias.png (8.0 KB, 0 views)

Bad_M-bias.png (9.9 KB, 0 views)
Leave a comment:
fkrueger replied

01-18-2019, 02:24 AM
Trim Galore tries to identify read-through adapter contamination, which is the kind of contaminant that will prevent your sequences from aligning at all, or might even cause mis-alignments. For the standard Illumina adapter, this sequence always starts with AGATCGGAAGAGC...

It does not attempt to find or remove any other kind of contaminant or over-represented sequence, because they are normally not as harmful. E.g. an over-represented sequence such as ATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGGCCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAA might be present in the library, but it will simply not align to a genome, thereby filtering itself out at the alignment stage.

Last edited by fkrueger; 01-18-2019, 02:24 AM. Reason: typo
Leave a comment:
badhik replied

01-17-2019, 04:05 PM
Yes. Thank you.

Also, after trimming my fastq files using default parameters on Trim galore, I continue to see truseq adapters (under the overrepresented sequences tab). eg. ATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGGCCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAA

Does Trim_galore not recognize all adapters? Or should I explicitly provide these adapters to Trim_galore?
Leave a comment:
fkrueger replied

01-17-2019, 01:47 AM
The quality trimming is performed by a sliding window approach across the read like the one that is used by BWA. Copied below is the text from the Cutadapt --help:

-q 3'CUTOFF Trim low-quality bases from 3' ends of reads before adapter removal. …The algorithm is the same as the one used by BWA (see documentation).

In some cases this may mean that if the quality briefly drops below the quality threshold but then comes back up again, the trimming algorithm decides that it’s not too bad after all.

I hope this clears things up?
Leave a comment:
badhik replied

01-17-2019, 12:11 AM
Hi all,

I am trimming Illumina 1.9 encoded data with Trim-galore, and after Fastqc, the box-plot whiskers under the Per base sequence quality goes all the way to 13 or 14 Phred score.

Here is what I used:
trim_galore --rrbs --paired --length 20 -q 28 --illumina

Why am I getting such a result?

Thanks
Leave a comment:
fkrueger replied

04-18-2017, 01:04 AM
Originally posted by pig_raffles View Post

I am new to the bioinformatic analysis of RRBS data. I am using Trim Galore! to QC and adapter trim my RRBS read data. I have generated single-end 75bp reads on an Illumina NextSeq.

The default minimum read length parameter in Trim Galore! is 20 bp but I was wondering if there were any practical considerations for alignment/mapping of reads to take into account when choosing a minimum read length and if anyone had any tips on optimizing this parameter?

Very short reads generally don't tend to align uniquely in bisulfite-seq mapping because the three letter alignment allows more ambiguous alignments. In that sense the shortness of reads sorts itself out in a way. Some programs however don't like it (or didn't like it in the past) when the sequence entry is extremely short or even empty, which is why we are introducing a short (but arbitrary) cutoff. I hope this helps.
Leave a comment:
pig_raffles replied

04-17-2017, 01:17 PM
Choosing minimum RRBS read length in Trim Galore!

I am new to the bioinformatic analysis of RRBS data. I am using Trim Galore! to QC and adapter trim my RRBS read data. I have generated single-end 75bp reads on an Illumina NextSeq.

The default minimum read length parameter in Trim Galore! is 20 bp but I was wondering if there were any practical considerations for alignment/mapping of reads to take into account when choosing a minimum read length and if anyone had any tips on optimizing this parameter?
Leave a comment:
Diadema replied

08-05-2016, 01:44 AM
That is indeed how I merged them. Thank you!
Leave a comment:
fkrueger replied

08-05-2016, 12:55 AM
As long as you merged the R1 and R2 files in the same order (e.g. R1_rep1 R1_rep2, R2_rep1 R2_rep2) it shouldn't matter if you run Trim Galore on the merged files directly or run it first and merge then. All the best!
Leave a comment:
Diadema replied

08-04-2016, 02:56 PM
Run Trim Galore! before or after merging technical replicates

I'm quite new to NGS. We just did 4 lanes (2 lanes twice) of Illumina HiSeq Rapid Run 2x51 RNA sequencing of 24 samples. The bcl to fastq conversion was run for us, so every sample has 4 R1 forward fastq files and 4 R2 reverse files. I merged the technical replicates (merged the 4 R1 files, then merged the 4 R2 files) doing a basic command line cat and append. I also ran FastQC on the individual technical replicates, as well as on the merged files. I now plan to upload my files to the Galaxy pipeline for the remainder of the QA/QC and analysis, and was going to start with Trim Galore. But now I'm wondering if Trim Galore needs to work on the original unmerged technical replicates rather than the merged files. E.g., the quality at the beginning of all our reads was spiky, possibly indicating sequencing of the same sequence, and may need to be trimmed; but can trimming the first n bases of each of the 4 files still be done after the files have been merged? So do I upload the unmerged fastq files and run Trim Galore, and then merge them, or upload the merged files and run Trim Galore? Thank you.
Leave a comment:
fkrueger replied

06-28-2016, 01:37 AM
Hi Guorong,

Great that it is working. My thoughts to your other problem are, as I have outlined above already, that you should absolutely not be doing what you are suggesting here. The sequence you are after is the sequence from the start of the read until you hit the small RNA adapter which starts with TGGAATTCT... Everything after that is either adapter that binds to the flowcell or something else you don't want to keep. In any case, the sequence on the 3' end should not align to a genome anyway.

Code:

-g ADAPTER, --front=ADAPTER Sequence of an adapter that was ligated to the 5' end.

Illumina sequencing does not add any adapter to the 5' end that ends up being sequenced, hence trimming using the option -a is what you want to do. In my opinion if you just run

Code:

trim_galore --trim-n file

you would get exactly what you are looking for.
Leave a comment:
xuguorong replied

06-27-2016, 09:04 AM
Hi Felix,

Thank you so much for your new release!
The new features definitely can remove all Ns from the reads! Awesome!

For the question 1, I want to try run cutadapt three times to keep the longer reads.
1: cutadapt -a adapter -q 10 -m 17 --trim-n -o $inputFile".trim.3.fastq" $inputFile".fastq"
2: cutadapt -g adapter -q 10 -m 17 --trim-n -o $inputFile".trim.5.fastq" $inputFile".fastq"
3: cat $inputFile".trim.3.fastq" $inputFile".trim.5.fastq" > $inputFile".trim.fastq"
4: cutadapt -b adapter -q 10 -m 17 --trim-n -o $inputFile".trim.final.fastq" $inputFile".trim.fastq"
5: then keep only one read and delete other one read with the same fastq ID.

The reason why I need to run 3 times is the first run cutadapt will trim the 3' adapter string, then the second run cutadapt will trim the 5' adapter string. After these two runs, some reads in $inputFile".trim.3.fastq" may still have 5' adapter string and some reads in $inputFile".trim.5.fastq" may have 3' adapter string. After I merged these two resulting files, then I run the third run cutadapt to cut either 3' and 5' adapter strings. Since I merged two fastq files and it will have some identical reads, I then scan the $inputFile".trim.final.fastq" to keep only one read and delete the other one with the same fastq ID.

Do you have any suggestions about this solution?

Thanks!
Guorong
Leave a comment:
fkrueger replied

06-27-2016, 04:11 AM
Hi Guorong,

I have added the option --trim-n now that should do just what you need. It also adds a few other features:

- Added option '--max_n COUNT' to remove all reads (or read pairs) exceeding this limit of tolerated Ns. In a paired-end setting it is sufficient if one read exceeds this limit. Reads (or read pairs) are removed altogether and are not further trimmed or written to the unpaired output.

- Enabled option '--trim-n' to remove Ns from both end of the reads. Does currently not work for RRBS-mode.

- Added new option '--max_length <INT>' which reads that are longer than <INT> bp after trimming. This is only advised for smallRNA sequencing to remove non-small RNA sequences.

- Replaced 'zcat' with 'gunzip -c' so that older versions of Mac OSX do not append a .Z to the end of the file and subsequently fail because the file is not present. Dah...

- Fixed a typo in adapter auto-detection warning message.

I have moved Trim Galore to Github where you can clone the latest development version: https://github.com/FelixKrueger/TrimGalore.
Leave a comment:
fkrueger replied

06-24-2016, 12:07 PM
To 1) The way the sequencing normally works is that you sequence the first base after the 5' adapter, then you sequence the fragment of interest and then you sequence into the adapter on the 3' end. You don't just get the keep the sequences that appears longer and juicier, but you need to keep the sequence of the fragment you wanted to sequence, here the 7bp. Maybe this sequence is just a not very representative example of your entire run because 7bp is also not a typical length of miRNA. I would suggest you run Trim Galore on the file once and then look at the sequence length distribution to see if the majority of the sequences is between 20 and 24bp long.

To 2) I can add it to my list, not quite sure if when I can address it though (we've got a Brexit to stomach right now...)

Cheers, Felix
Leave a comment:
xuguorong replied

06-24-2016, 10:57 AM
Hi Felix,

Thank you so much for your response!

For the question 1:
After trimming, the length of the left sequence is only 7nt but the length of the right sequence is 21nt. Obviously I want to keep the 21nt sequence and ignore the 7nt sequence because it is too short. I am not sure if I can directly run Cutadapt using -g option to keep the 21nt sequence instead of 7nt sequence.

For the question 2:
Sure, a single N cannot make a difference for mapping. But for miRNA seq alignment, it is better to remove the unknown nucleotides before alignment because of the sensitivity.
Leave a comment:

Previous 1 2 3 4 10 template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News