cutadapt: guidance on rationale for --overlap=LENGTH values

mgg

Member

Join Date: Nov 2011

Posts: 12
- Share
- Tweet
#1

cutadapt: guidance on rationale for --overlap=LENGTH values

12-29-2011, 05:28 AM

Hi,

I'm getting up to speed with Marcel Martin's cutadapt for removing adapter sequences from Illumina libraries.

I'd appreciate some expert input on what values might be reasonable for the -O (--overlap) parameter. As explained in the excellent help pages, this parameter defines the minumum overlap between the input adapter sequence and read sequences:

Code:

-O LENGTH, --overlap=LENGTH Minimum overlap length. If the overlap between the read and the adapter is shorter than LENGTH, the read is not modified.This reduces the no. of bases trimmed purely due to short random adapter matches (default: 3).

My initial run has used the default. I get trim data like this:

Code:

Adapter 'GATCGGAAGAGCACACGTCTGAACTCCAGTCAC', length 33, was trimmed 205113 times. Histogram of adapter lengths length count 3 95613 4 12912 5 5869 6 4173 7 3809 8 3323 9 3183 10 2934 11 2938 12 2464 13 2213 14 2041 15 1803 16 1650 17 1489 18 1334 19 1269 20 1147 21 1031 22 909 23 859 24 819 25 701 26 656 27 536 28 528 29 488 30 386 31 368 32 351 33 47317

I can see this is not sensible.

With length=3 I get 95613 reads removed from my library; a good proportion of these must be spurious (i.e. by chance).
With length=33 I get 47317 reads removed. These have a better chance of not being spurious.

Somewhere between the boundaries here (3 .. 33) there must be a 'sensible' value for --offset. How do I identify it? Using what rationale? I could just opt for 33, the length of the adapter. But that would discount the probability that those of length 32 are also genuine (this library has an abrupt shift from 32 with 351 reads to 33 with the 47317, but not all my libraries look like this).

How might I go about this??

TIA
mgg
Tags: cutadapt, illumina, overlap
mmartin

Member

Join Date: Aug 2009

Posts: 75
- Share
- Tweet
#2

01-05-2012, 10:23 AM

My strategy so far was to not worry too much about the bases that get lost due to random matches. It depends on your data, but although 94613 looks large, you lose “only” 95613x3 bp, which may not be that bad.

However, the “count” column in your histogram decreases montonically from length 3 to 32. This is different from what I see in my data. One explanation is that your adapter almost never appears partially – it's either fully there or not at all and all matches from length 3 to 32 are, in fact, spurious. In that case, you can safely set --overlap to 33.

I'll probably change the output that cutadapt prints to make this all a bit clearer. Perhaps helpful would be print the number of bases removed and to give an estimate of how many of those were removed due to chance alone.
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

cutadapt: guidance on rationale for --overlap=LENGTH values

Comment

Latest Articles

ad_right_rmr

News