Announcement

Collapse
No announcement yet.

skewer: A fast and sensitive adapter trimmer for paired-end reads

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • skewer: A fast and sensitive adapter trimmer for paired-end reads

    Hello everyone,

    We've implemented a novel tool named skewer for adapter trimming. It is aimed for preprocessing Illumina Paired-end/single-end reads at the moment. The main features are as follows:
    * Allow full-length adapter sequence trimming for higher specificity;
    * Allow indel errors when finding adapter sequence;
    * Very fast: Internally it uses a novel local alignment algorithm that has not been published. In single thread mode, it can process a pair of compressed files, whose uncompressed sizes were about 12G bytes each, in about 30 minutes; it is even faster in multi-thread mode, but the speedup is limited due to the parallelism is only made in the sequence alignment part.
    * Quality values aware. It evaluates alignments based on sequence qualities.
    * Paired information aware. It is more accurate in case of processing paired-end reads.

    If you are interested in using it, please download it from
    https://sourceforge.net/projects/skewer/

    Any feedback or feature requests are welcome!
    Last edited by relipmoc; 09-23-2013, 04:53 PM. Reason: :)

  • #2
    Hi,

    I have to trim full-length adapter sequences with zero number of mismatches. I do not want to trim reads on any other criteria at this point.

    I am using the following command line:
    ./skewer-0.1.99-linux-x86_64 -x ACACTCTTTCCCTACACGACGCTCTTCCGATCT -y GATCGGAAGAGCGGTTCA
    GCAGGAATGCCGAG -r 0 -d 0 -o exact_trim_15 -t 8 read_1.fastq paired_read2.fastq

    Log file includes:
    Parameters used:
    -- 3' end adapter sequence (-x): ACACTCTTTCCCTACACGACGCTCTTCCGATCT
    -- paired 3' end adapter sequence (-y): GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
    -- maximum error ratio allowed (-r): 0.000
    -- maximum indel error ratio allowed (-d): 0.000
    -- minimum read length allowed after trimming (-l): 18
    -- file format (-f): Sanger/Illumina 1.8+ FASTQ (auto detected)
    -- number of concurrent threads (-t): 8
    Tue Jan 14 02:18:14 2014 >> started

    Tue Jan 14 02:19:33 2014 >> done (78.699s)
    47656840 read pairs processed; of these:
    0 ( 0.00%) short read pairs filtered out after trimming by size control
    0 ( 0.00%) empty read pairs filtered out after trimming by size control
    47656840 (100.00%) read pairs available; of these:
    3202 ( 0.01%) trimmed read pairs available after processing
    47653638 (99.99%) untrimmed read pairs available after processing

    Length distribution of reads after trimming:
    length count percentage
    97 1 0.00%
    98 4 0.00%
    99 3197 0.01%
    100 47653638 99.99%


    My questions are:
    1) The 3197 read pairs trimmed, given the input parameter settings, are they really trimmed just based on exact full-length adapter sequence match? any default parameter that I should be aware of?
    2) What is the overlap length for adapter detection in paired-end mode? is it like initial 17 bp of the total length? Is there a way I can change this?
    3) How can I change the number of mismatches to detect the adapter region in the read? Let's say if I want to allow only 2 mismatches (instead of zero mismatches) in the full-length adapter sequence?
    4) How can I specify multiple adapter sequences for read 1 and read 2 data files?

    I would appreciate your help! Thank you!

    Comment


    • #3
      Thank you so much for your feedback!

      Quick answers to your questions:
      1) The searching process is based on exact full-length adapter sequence, but for the 3197 read pairs, only the last nucleotides were identified as the first nucleotides of corresponding adapter sequences. In current implementation, adapter sequence longer than 64 nt will be cut to 64 nt before processing.

      2) There's no need to specify the overlap length in paired-end mode. The program knows how to do it correctly.

      3) The program only provides a parameter of error ratio (by -r) and detect the most possible adapter location by a statistical scheme which takes into account the quality values. If you just want to specify the number of maximum allowed mismatches in the full-length adapter sequence, you can use fq2fa.sh to transfer the FASTQ files to FASTA files, and specify the maximum allowed error ratio (-r) as 2/33=0.06. For small RNA adapter trimming, it is something like the following command:
      $ fq2fa.sh srnaReads.fq | skewer -x TCGTATGCCGTCTTCTGCTTGAAAAAAA -L 30 -r 0.06 -o trimmed -

      4) For multiple adapter sequences, you just need to specify two FASTA files which contain adapter sequences, and input something like:
      $ skewer -x adapters1.fa -y adapters2.fa flowcell1_lane7_pair1.fastq.gz flowcell1_lane7_pair2.fastq.gz
      Attached Files
      Last edited by relipmoc; 01-14-2014, 08:46 AM.

      Comment


      • #4
        Thank you for your prompt response!

        I am sorry, I couldn't quite get the "In current implementation, adapter sequence longer than 64 nt will be cut to 64 nt before processing"? I don't think I have adapter more than 62 bp so then why its looking for last few nucleotides (3 I guess here?)?

        Comment


        • #5
          skewer: A fast and sensitive adapter trimmer for paired-end reads

          Also, what is the base quality value threshold used by the tool to be considered as a mismatch? in "3) The program only provides a parameter of error ratio (by -r) and detect the most possible adapter location by a statistical scheme which takes into account the quality values"

          Thanks!

          Comment


          • #6
            As I said, "there's no need to specify the overlap length in paired-end mode", actually there's no parameter or default parameter for the overlap length in paired-end mode.

            The 64 nt statement is irrelevant to your question. I just misunderstood your question "any default parameter that I should be aware of". ^_^

            "why its looking for last few nucleotides (3 I guess here?)". Unfortunately your guess is not the truth. It's by chance that you got this result.

            Originally posted by BhariD View Post
            Thank you for your prompt response!

            I am sorry, I couldn't quite get the "In current implementation, adapter sequence longer than 64 nt will be cut to 64 nt before processing"? I don't think I have adapter more than 62 bp so then why its looking for last few nucleotides (3 I guess here?)?

            Comment


            • #7
              There's no base quality value threshold. That's all integrated into the statistical scheme. Since we have not published the paper, I can not tell you the details at the moment. Sorry for that!

              Originally posted by BhariD View Post
              Also, what is the base quality value threshold used by the tool to be considered as a mismatch? in "3) The program only provides a parameter of error ratio (by -r) and detect the most possible adapter location by a statistical scheme which takes into account the quality values"

              Thanks!
              Last edited by relipmoc; 01-14-2014, 05:37 PM.

              Comment


              • #8
                Hi relipmoc,

                I have a couple of questions:

                1) How does skewer handle partial matches? For example if I have a sequence that goes SEQUENCE-ADAPTER-BARCODE, and I just input ADAPTER, will I end up with SEQUENCE?

                2) Why is this sequence not being trimmed? Does skewer only match the entire adapter sequence?

                @test_truseq/1
                CGATGATCAAGACCCAAGTGTGAGATTACGGAGATCGGAA
                +
                IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
                @test_truseq/2
                CGATGATCAAGACCCAAGTGTGAGATTACTCAGATCGGAA
                +
                IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

                ~/tmp/skewer-0.1.104-linux-x86_64 -x AGATCGGAAGAG -y AGATCGGAAGAG test_cutadapt_1.fastq test_cutadapt_2.fastq

                Thanks! I've been looking around for a faster trimmer and was hoping skewer would be the solution.

                Comment


                • #9
                  Originally posted by relipmoc View Post
                  There's no base quality value threshold. That's all integrated into the statistical scheme. Since we have not published the paper, I can not tell you the details at the moment. Sorry for that!
                  That's not a good way to get people to use your software!

                  Comment


                  • #10
                    Originally posted by roryk View Post
                    1) How does skewer handle partial matches? For example if I have a sequence that goes SEQUENCE-ADAPTER-BARCODE, and I just input ADAPTER, will I end up with SEQUENCE?
                    The answer is Yes. However, if you want an improved specificity, you'd better use ADAPTER-BARCODE as the adapter sequence. Furthermore, if you want to demultiplex the reads, you can specify the --barcode option.

                    Originally posted by roryk View Post
                    2) ... Does skewer only match the entire adapter sequence?
                    The answer is No. skewer can detect partially matched adapter sequence at the 3' end (or 5' end if '-e 5' is specified).

                    Originally posted by roryk View Post
                    2) Why is this sequence not being trimmed? ...
                    @test_truseq/1
                    CGATGATCAAGACCCAAGTGTGAGATTACGGAGATCGGAA
                    +
                    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
                    @test_truseq/2
                    CGATGATCAAGACCCAAGTGTGAGATTACTCAGATCGGAA
                    +
                    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
                    The sequences are not being trimmed because they are not as skewer expected. Is this from real data? Or could you explain why the paired sequences before adapter sequences are not reverse complementary to each other? Are they from mate-pair sequencing instead of paired-end sequencing?

                    Originally posted by roryk View Post
                    ~/tmp/skewer-0.1.104-linux-x86_64 -x AGATCGGAAGAG -y AGATCGGAAGAG test_cutadapt_1.fastq test_cutadapt_2.fastq
                    There's no need to specify -y, if pair1 and pair2 share the same adapter sequence.

                    Originally posted by roryk View Post
                    Thanks! I've been looking around for a faster trimmer and was hoping skewer would be the solution.
                    My pleasure! Hope it will make your work easier.

                    Comment


                    • #11
                      Originally posted by frozenlyse View Post
                      That's not a good way to get people to use your software!
                      For those people who want to know technique details, I have to say sorry to them. However, you can't wait too long. I'll inform you once our submission is accepted. Thanks!

                      Comment


                      • #12
                        Originally posted by relipmoc View Post
                        For those people who want to know technique details, I have to say sorry to them. However, you can't wait too long. I'll inform you once our submission is accepted. Thanks!
                        Any chance you can release OSX binaries?

                        Comment


                        • #13
                          I have a question:

                          How does skewer handle the situation only one read of a pair survives and the other one does not? I didn't see singletons in the output files.

                          Comment


                          • #14
                            I have a question:

                            How does skewer handle the situation only one read of a pair survives and the other one does not? I didn't see singletons in the output files.

                            Comment


                            • #15
                              Originally posted by kidaaaa View Post
                              I have a question:

                              How does skewer handle the situation only one read of a pair survives and the other one does not? I didn't see singletons in the output files.
                              Thank you for your question! Now skewer only output those pairs that are concordantly trimmed or untouched.

                              Comment

                              Working...
                              X