Quality-, adapter- and RRBS-trimming with Trim Galore!

fkrueger replied

01-19-2016, 11:45 PM
That would work, too, yes. Let me know if you've got any further questions.
Leave a comment:
Rob Weeks replied

01-19-2016, 11:41 PM
Never mind, I realised that decompressing my file and then running trim_galore will bypass zcat. It then works.

Cheers

Rob

EDIT: thanks Felix. I wrote this before I saw your response - I was "never minding" my question not your response!

I have a solution which works; I will decompress before running trim_galore

Last edited by Rob Weeks; 01-19-2016, 11:46 PM. Reason: crossed responses
Leave a comment:
fkrueger replied

01-19-2016, 11:38 PM
Originally posted by Rob Weeks View Post

I have only just begun to look at RRBS data. I am trying to use trim_galore to quality trim and adaptor trim my sequences. I am doing this in OS X.
Now when I run 'trim_adaptor filename.fastq.gz' it returns an error due to 'zcat: can't stat: filename.fastq.gz (filename.fastq.gz.Z): No such file or directory".
This is apparently a problem only in OS X, but it is not clear to me how I can get around this problem.

Any ideas would be appreciated

Cheers

This problem is indeed known and has to do with the version of zcat which is installed on your system. On very old versions it would alter the filename passed to it to append a .Z if there wasn’t one there, so although you’re trying to read a file called Sample1_PE_R1.fastq.gz it’s actually trying to read Sample1_PE_R1.fastq.gz.Z (which doesn’t exist).

I have changed the way Trim Galore reads from files from using a cat stream to using gunzip -c and it seems to work well. I can send you a copy of this tonight as I am at the Festival of Genomics in London all day if you send me an email. Alternatively you could try to change the filename of your input to end in .gz.Z and try that?
Good luck, Felix
Leave a comment:
Rob Weeks replied

01-19-2016, 08:09 PM
I have only just begun to look at RRBS data. I am trying to use trim_galore to quality trim and adaptor trim my sequences. I am doing this in OS X.
Now when I run 'trim_adaptor filename.fastq.gz' it returns an error due to 'zcat: can't stat: filename.fastq.gz (filename.fastq.gz.Z): No such file or directory".
This is apparently a problem only in OS X, but it is not clear to me how I can get around this problem.

Any ideas would be appreciated

Cheers
Leave a comment:
fkrueger replied

12-22-2015, 06:02 AM
Hi whargrea, the absence of documentation for parallelization does indeed mean that reads are trimmed by calling a single instance of Cutadapt at a time. Since trimming is a one-off process that doesn't really take that long (a matter of hours) compared to the data collection process (often a matter of several days) or other downstream operations (up to several weeks?) we don't tend to bother about it very much. The easiest solution would probably to run all your 48 trims in parallel (even though this might be quite intense on the disc I/O part), or try to find another trimmer that supports parallel trimming natively.
Leave a comment:
whargrea replied

12-21-2015, 01:56 PM
Hi,

I've been going through the documentation and searching forum threads etc. looking to see if trim_galore can be run in a multi-core multi-thread manner. So far the total lack of information in this regard seems to point towards it not having such a capability.

I'm not sure if this is the appropriate place to ask but I was wondering why this is the case? I have 48 files of ~120mil reads each that I need to perform trimming on and being able to parallelize would greatly boost the speed at which this could be done. It seems to me that since each read is trimmed independently trimming software should easily scale to any number of cores. Am I correct in this assumption or am I missing something?

Cheers.
Leave a comment:
Alex852013 replied

12-04-2015, 06:50 AM
Thanks a lot

Thanks a lot, i guess i can go on on my own now!
Leave a comment:
fkrueger replied

12-03-2015, 08:55 AM
Trim Galore should derive its output files from the filenames, so this will only redirect any other output to the screen to a file, so not overly useful but it won't harm.

The trimming algorithm to trim qualities is described in the Cudatapt option -q:

Code:

-q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF Trim low-quality bases from 5' and/or 3' ends of reads before adapter removal. If one value is given, only the 3' end is trimmed. If two comma-separated cutoffs are given, the 5' end is trimmed with the first cutoff, the 3' end with the second. [B]The algorithm is the same as the one used by BWA (see documentation).[/B] (default: no trimming)

This means that the qualities are assessed in windows over the read, and trimmed at a position where the score is lowest. If I understand this correctly then a read may temporarily 'dip' below the threshold you have selected, but allow the sequence to survive it the quality comes back up afterwards. So occasionally you might get a few scores that are lower than 20bp, but I personally wouldn't too worried about it as most downstream programs have their own means of dealing with low quality basecalls.
Leave a comment:
GenoMax replied

12-03-2015, 08:43 AM
Originally posted by Alex852013 View Post

Hello everybody,

This is the line i used for trimming on unix command line.
trim_galore ../name_R1_001.fastq ../name_R2_001.fastq -q 20 --paired --phred33 > trim_BAC-1_S9_R1_001.fastq

Therefore i either missinterpret something or i did something wrong.
May please someone tell me what it is?
Thanks a lot, Alex

You appear to be running trim_galore incorrectly. Instead of trying to redirect the output (>) to a file you need to specify an output directory location by using a -o directory_path.

@felix will confirm. I don't use trim_galore.

Edit: Looking at trim_galore manual -o is not strictly needed. Program will use the current directory by default.

Edit2: @felix clarified the effect of output redirect in the post below.

Last edited by GenoMax; 12-03-2015, 08:59 AM.
Leave a comment:
Alex852013 replied

12-03-2015, 08:34 AM
Understand the quality trimming

Hello everybody,

it is the first time i try to use trim_galore for quality trimming of paired end reads.
I checked for the sequencing settings with testformat.sh from BBMap which gives me:
sanger fastq raw single-ended 150bp
I'm not sure why there single-ended comes as an output, since it was paired-end.

Before i did the quality trimming, i checked with FastQC.
The programm didn't find adapter sequences any more (i guess they were already cut by the sequencing service) and showed the following pictures

Picture before quality trimming:

This is the line i used for trimming on unix command line.
trim_galore ../name_R1_001.fastq ../name_R2_001.fastq -q 20 --paired --phred33 > trim_BAC-1_S9_R1_001.fastq

Picture after quality trimming:

I had expected, that everything with a quality below 20 would be cut. Therefore i either missinterpret something or i did something wrong.
May please someone tell me what it is?
Thanks a lot, Alex

Last edited by Alex852013; 12-03-2015, 08:43 AM.
Leave a comment:
fkrueger replied

11-28-2015, 03:52 PM
Oh dear, you should never post such things on the internet... but I'm glad it helped!
Leave a comment:
bluepoison replied

11-28-2015, 03:48 PM
Hi Felix,

Thanks a lot for quick response. It was really helpful for me.

I just performed a short experiment. Just wanted to share with you. I randomly pooled 1M reads, and made 3 following versions:
version 1: without any trimming
version 2: trim with Trim Galore with default settings
version 3: trim with Trim Galore with default settings and trim 'GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG' with cutadapt.

Results in terms of efficiency after aligning with bismark b2:
version1: 39.7%
version2: 58.9%
version3: 58.2%

When I checked the qualities in FASTQC, even in version 3, it gave some very short (less than 10bp)overrepresented sequences as 'no hit'. So I guess it will always give some overrepresented sequences anyway but I have to understand very well what am I trimming.

One notable thing here is that the efficiency has not improved from version 2 to version 3. Most of the overrepresented sequences has the first part as 'GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG' and second part as the basic standard Illumina paired-end adapter. So those sequences are already rejected from the alignment just after the doing the version 2. That's why version 3 hasn't change that much.

btw I saw several posts containing 'felix is a great guy!'. Now its making a lot more sense. thanks again!

Originally posted by fkrueger View Post

Hi bluepoison,

The sequence you are seeing overrepresented is most likely some kind of adapter dimer because the sequence is lacking the leading A which it would get as a result of A-tailing the fragments. It is not normally required to trim adapter dimers specifically because they won't align to a reference genome anyway. You need to keep in mind though that the mapping efficiency will look worse because adapter primers won't align.

It would be sufficient for Cutadapt as well as Trim Galore to just specify the first couple of bp, here GATCGGAAGAGCG, in order to trim all lengths of the occurring sequence. As I mentioned above I would not bother though because these sequences won't align anywhere anywhere.

Just generally, the overrepresented sequences plot in FastQC is meant as a quick guide for you to spot sequences that are present in more than 0.1% of case but doesn't mean you should remove all of them from your sequenced library - especially not if you don't actually know what the sequence is. It might be a biological effect after all.

In short: running Trim Galore in default mode will almost certainly do the right thing. Cheers, Felix
Leave a comment:
fkrueger replied

11-28-2015, 01:46 PM
Hi bluepoison,

The sequence you are seeing overrepresented is most likely some kind of adapter dimer because the sequence is lacking the leading A which it would get as a result of A-tailing the fragments. It is not normally required to trim adapter dimers specifically because they won't align to a reference genome anyway. You need to keep in mind though that the mapping efficiency will look worse because adapter primers won't align.

It would be sufficient for Cutadapt as well as Trim Galore to just specify the first couple of bp, here GATCGGAAGAGCG, in order to trim all lengths of the occurring sequence. As I mentioned above I would not bother though because these sequences won't align anywhere anywhere.

Just generally, the overrepresented sequences plot in FastQC is meant as a quick guide for you to spot sequences that are present in more than 0.1% of case but doesn't mean you should remove all of them from your sequenced library - especially not if you don't actually know what the sequence is. It might be a biological effect after all.

In short: running Trim Galore in default mode will almost certainly do the right thing. Cheers, Felix
Leave a comment:
bluepoison replied

11-28-2015, 09:46 AM
Hi all,

This is my first sequencing data analysing. I am having difficulties trimming the adapters/contaminants from the reads. I have got 50bp single paired read. I checked in fastqc that there are overrepresented sequences which are part of 'Illumina Paired End Adapter 2'. But If I trim using the whole 'Illumina Paired End Adapter 2', still there will be plenty of overrepresented sequences left!
Q1) On that case what how much should I trim?

I have these overrepresented sequence,
GATCGGAAGAGCGGTTCAGCAGG
GATCGGAAGAGCGGTTCAGCAGGA
GATCGGAAGAGCGGTTCAGCAGGAA
GATCGGAAGAGCGGTTCAGCAGGAAT
GATCGGAAGAGCGGTTCAGCAGGAATG
GATCGGAAGAGCGGTTCAGCAGGAATGC
GATCGGAAGAGCGGTTCAGCAGGAATGCC
GATCGGAAGAGCGGTTCAGCAGGAATGCCG
GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG (Illumina Paired End Adapter 2)

Also I have another sequence which all the 'no hit' contains! That sequence is 'GTTATTTTTTTGTTTTAGTTTTT'. I looked at the contaminant file and there is no match for this.
Q2)Should I trim this sequence without even actually knowing from which this sequence is coming from?

I planned to trim all the sequences from bigger to smaller using cudadapt because there is no way to trim multiple adapters at a time in trim galore. But later I will also use trim galore for quality trimming.
Q3)Is there any way to minimize these steps?

All the scenarios described above is true for all the seven samples I analysed. Also there is know way to know the actual adapters used from the dataset.

Thanks a lot!
Leave a comment:
fkrueger replied

06-18-2015, 06:16 AM
I think if a single or few bases dip but then it recovers the read will actually survive. This is a sliding window model which isn't super harsh to the data.
Leave a comment:

Previous 1 2 3 4 5 6 7 10 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News