No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimming Nextera XT Sequence Data

    I've got some MiSeq data from Nextera XT prepped libraries. I figure that it will be necessary to trim adapters and/or transposase sequences from the data and I'm hoping someone can assist me. I'd like to use Scythe for trimming.

    Scythe requires an adapters/contaminants file (FASTA format) as input, and I'm confused as to how to construct this file. Illumina provides the following information on the sequences:

    Nextera® transposase sequences
    (a) Read 1 -->
    (d) Read 2 -->
    Nextera® Index Kit - PCR primers
    (c) i5 Index read -->
    <-- i7 Index read (b)

    Could I simply use the following adapter FASTA file content? (I've combined the adapter and transposase sequences into single string using the overlapping region.) What do I replace the [i5] and [i7] barcode tags with?


    Thanks for any help you can provide!

  • #2
    A list of all Nextera adapter sequences, complete with bar codes, is packaged with BBTools in fasta format, in the resources directory (nextera.fa.gz).

    I recommend using BBDuk rather than Scythe, but of course, I'm biased. Anyway, the adapter sequences will work with any adapter-trimming program.


    • #3
      Very helpful. Thanks Brian!

      Another question. I notice from various reading I have done on this forum and elsewhere that many people do not use the entire adapter/barcode/whatever-contaminant sequence when trimming. Often I see them put just the first 8, 10, 12 bases into their contaminant file. Is there an advantage/disadvantage to using the truncated sequence over the full sequence(s)? Do most trimmers simply look for the specified sequence and trim that and everything after?

      EDIT: middle of the night grammar


      • #4
        I am not really sure what most trimmers do, but when looking for a full-sequence match, in the presence of error, you will trim more adapters with an 8bp sequence than a 12bp sequence. Of course, you will also incur more false-positives.

        Some people trim as little as 1bp, allowing up to 1 mismatch. That will, of course, shorten all reads by a minimum of 1... and I think that's a bad approach unless a single adapter base is devastating - in which case, I think it's better to reorganize your experiment so that a single adapter base will not be devastating.

        BBDuk matches full-length kmers in the middle of the read, and at the very end of the read, when there are fewer than K bases left, it will match kmers from the ends of adapters down to the "mink" setting. So, providing longer adapter sequences is generally advantageous. You can set "mink" to 8bp if you want, which will allow similar sensitivity but better specificity to using an 8-bp adapter sequence.

        Generally, though, I recommend 11bp as a minimum for mink - meaning, a match for the last 11bp of a read (with a hamming distance of 0 or 1), and a kmer length of 23 for nonterminal kmers. If you use 8bp adapter sequences and trim wherever you see a match for them, you have a 4^8 = 2^16 = 1/65356 chance of a spurious match, even if you require an exact match. That means that for assembly, you will on average not get any contigs longer than 64kbp! Which is terrible.

        It depends on what your goal is, though. When looking for super-rare 1/100 rate mutations, trimming adapters as much as possible may be wise, even if you lose data in a biased way.

        P.S. I forgot to mention, BBDuk's "tbo" flag will allow you to trim reads with even 1bp of adapter sequence with very little risk of false-positives, by finding looking for where the reads overlap. It requires paired reads, but it will work even with unknown adapter sequences.
        Last edited by Brian Bushnell; 02-19-2015, 06:34 PM.


        Latest Articles


        • seqadmin
          Advanced Tools Transforming the Field of Cytogenomics
          by seqadmin

          At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
          09-26-2023, 06:26 AM
        • seqadmin
          How RNA-Seq is Transforming Cancer Studies
          by seqadmin

          Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
          09-07-2023, 11:15 PM





        Topics Statistics Last Post
        Started by seqadmin, 09-29-2023, 09:38 AM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 09-27-2023, 06:57 AM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 09-26-2023, 07:53 AM
        1 response
        Last Post seed_phrase_metal_storage  
        Started by seqadmin, 09-25-2023, 07:42 AM
        0 responses
        Last Post seqadmin