Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • mgg
    Member
    • Nov 2011
    • 12

    cutadapt: guidance on rationale for --overlap=LENGTH values

    Hi,

    I'm getting up to speed with Marcel Martin's cutadapt for removing adapter sequences from Illumina libraries.

    I'd appreciate some expert input on what values might be reasonable for the -O (--overlap) parameter. As explained in the excellent help pages, this parameter defines the minumum overlap between the input adapter sequence and read sequences:
    Code:
      -O LENGTH, --overlap=LENGTH
           Minimum overlap length. If the overlap between the read and the adapter is
    shorter than LENGTH, the read is not modified.This reduces the no. of bases
     trimmed purely due to short random adapter matches (default: 3).
    My initial run has used the default. I get trim data like this:

    Code:
    Adapter 'GATCGGAAGAGCACACGTCTGAACTCCAGTCAC', length 33, was trimmed 205113 times.
    
    Histogram of adapter lengths
    length  count
    3       95613
    4       12912
    5       5869
    6       4173
    7       3809
    8       3323
    9       3183
    10      2934
    11      2938
    12      2464
    13      2213
    14      2041
    15      1803
    16      1650
    17      1489
    18      1334
    19      1269
    20      1147
    21      1031
    22      909
    23      859
    24      819
    25      701
    26      656
    27      536
    28      528
    29      488
    30      386
    31      368
    32      351
    33      47317
    I can see this is not sensible.

    With length=3 I get 95613 reads removed from my library; a good proportion of these must be spurious (i.e. by chance).
    With length=33 I get 47317 reads removed. These have a better chance of not being spurious.

    Somewhere between the boundaries here (3 .. 33) there must be a 'sensible' value for --offset. How do I identify it? Using what rationale? I could just opt for 33, the length of the adapter. But that would discount the probability that those of length 32 are also genuine (this library has an abrupt shift from 32 with 351 reads to 33 with the 47317, but not all my libraries look like this).

    How might I go about this??

    TIA
    mgg
  • mmartin
    Member
    • Aug 2009
    • 73

    #2
    My strategy so far was to not worry too much about the bases that get lost due to random matches. It depends on your data, but although 94613 looks large, you lose “only” 95613x3 bp, which may not be that bad.

    However, the “count” column in your histogram decreases montonically from length 3 to 32. This is different from what I see in my data. One explanation is that your adapter almost never appears partially – it's either fully there or not at all and all matches from length 3 to 32 are, in fact, spurious. In that case, you can safely set --overlap to 33.

    I'll probably change the output that cutadapt prints to make this all a bit clearer. Perhaps helpful would be print the number of bases removed and to give an estimate of how many of those were removed due to chance alone.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Pathogen Surveillance with Advanced Genomic Tools
      by seqadmin




      The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
      03-24-2025, 11:48 AM
    • seqadmin
      New Genomics Tools and Methods Shared at AGBT 2025
      by seqadmin


      This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

      The Headliner
      The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
      03-03-2025, 01:39 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 03-20-2025, 05:03 AM
    0 responses
    41 views
    0 reactions
    Last Post seqadmin  
    Started by seqadmin, 03-19-2025, 07:27 AM
    0 responses
    49 views
    0 reactions
    Last Post seqadmin  
    Started by seqadmin, 03-18-2025, 12:50 PM
    0 responses
    36 views
    0 reactions
    Last Post seqadmin  
    Started by seqadmin, 03-03-2025, 01:15 PM
    0 responses
    191 views
    0 reactions
    Last Post seqadmin  
    Working...