Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FASTXtoolkit adapter trimming

    Hi All

    I recently downloaded the FASTX toolkit and tried to use it for trimming fastq reads of adapter sequences. This did not work, the tool simply discarded any reads containing adapter sequences though this is not seemingly its function according to the documentation. I wrote to the help contact for the tool but recieved no response (see below for details). Has anyone used this tool for this purpose successfully?

    Thanks for your help

    Mark

    #############################################
    Hello

    I recently downloaded the FASTX toolkit (fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2) and attempted to use the fastx_clipper tool. I created a test fastq file (3 of the four sequences contain the default adapter CCTTAAGG):

    @test1
    CCTTAAGGAAAAAAAAAAGGGGGGGGGG
    +test1
    HHHHHHHHHHHHHHHHHHHHHHHHHHHH
    @test2
    CCTTAAGGAAAAAAAAAGGGGGGGGGGG
    +test2
    HHHHHHHHHHHHHHHHHHHHHHHHHHHH
    @test3
    AGAGAGAGAGAGAGAGAGAGAGAGAGAG
    +test3
    HHHHHHHHHHHHHHHHHHHHHHHHHHHH
    @test4
    CCTTAAGGTTGACGTGATCGACACCTGG
    +test4
    [[[[[[[[[[[[[[[[[[[[[[[[[[[[

    And then executed the command (as shown on FASTX toolkit website)

    -bash-3.2$ fastx_clipper -v -i test.fastq -a CCTTAAGG
    @test3
    AGAGAGAGAGAGAGAGAGAGAGAGAGAG
    +test3
    HHHHHHHHHHHHHHHHHHHHHHHHHHHH
    Clipping Adapter: CCTTAAGG
    Min. Length: 5
    Input: 4 reads.
    Output: 1 reads.
    discarded 0 too-short reads.
    discarded 3 adapter-only reads.
    discarded 0 N reads.

    As you can see, the three reads that contain the adapter are discarded as “adapter-only reads” which (in my way of looking at things) they are not nor are they too short (default <=5) after any trimming. What is going on here? Does this tool actually trim reads or only discard them if they are found. If the former would you please tell me what I am doing incorrectly? Also if the former, is it possible to supply the tool with multiple adapters to trim?

    Thanks for your help

    Mark

  • #2
    I can't help you with the FASTX toolkit, but here is how to do it with Biopieces (www.biopieces.org).


    Code:
    read_fastq -i test.fastq | remove_adaptor -a CCTTAAGG -r before
    SCORES: HHHHHHHHHHHHHHHHHHHH
    SEQ: AAAAAAAAAAGGGGGGGGGG
    ADAPTOR_POS: 0
    SEQ_LEN: 20
    SEQ_NAME: test1
    ---
    SCORES: HHHHHHHHHHHHHHHHHHHH
    SEQ: AAAAAAAAAGGGGGGGGGGG
    ADAPTOR_POS: 0
    SEQ_LEN: 20
    SEQ_NAME: test2
    ---
    SCORES: HHHHHHHHHHHHHHHHHHHHHHHHHHHH
    SEQ: AGAGAGAGAGAGAGAGAGAGAGAGAGAG
    ADAPTOR_POS: -1
    SEQ_LEN: 28
    SEQ_NAME: test3
    ---
    SCORES: [[[[[[[[[[[[[[[[[[[[
    SEQ: TTGACGTGATCGACACCTGG
    ADAPTOR_POS: 0
    SEQ_LEN: 20
    SEQ_NAME: test4
    ---

    Use grab to get the entries that were trimmed and finally use write_fastq to create a new file:

    Code:
    read_fastq -i test.fastq | remove_adaptor -a CCTTAAGG -r before | grab -e 'ADAPTOR_POS>=0' | write_fastq -o test_trimmed.fastq -x

    Cheers,


    Martin
    Last edited by maasha; 11-25-2010, 06:24 AM.

    Comment


    • #3
      Oh, and if you want to trim multiple adaptors either process the fastq file several times or just use remove_adaptor multiple times:

      Code:
      read_fastq -i test.fastq |
      remove_adaptor -a CCTTAAGG -r before |
      remove_adaptor -a GACACCTGG -r after
      
      SCORES: HHHHHHHHHHHHHHHHHHHH
      SEQ: AAAAAAAAAAGGGGGGGGGG
      SEQ_NAME: test1
      SEQ_LEN: 20
      ADAPTOR_POS: -1
      ---
      SCORES: HHHHHHHHHHHHHHHHHHHH
      SEQ: AAAAAAAAAGGGGGGGGGGG
      SEQ_NAME: test2
      SEQ_LEN: 20
      ADAPTOR_POS: -1
      ---
      SCORES: HHHHHHHHHHHHHHHHHHHHHHHHHHHH
      SEQ: AGAGAGAGAGAGAGAGAGAGAGAGAGAG
      SEQ_NAME: test3
      SEQ_LEN: 28
      ADAPTOR_POS: -1
      ---
      SCORES: [[[[[[[[[[[
      SEQ: TTGACGTGATC
      SEQ_NAME: test4
      SEQ_LEN: 11
      ADAPTOR_POS: 11
      ---


      M
      Last edited by maasha; 11-25-2010, 07:09 AM.

      Comment


      • #4
        Hi Mark,

        Based on my understanding, the fastx_clipper first finds the adaptor seqeunce you give and then trims off adaptor and nucleotide sequenes after the adaptor. I think fastx_clipper is designed for removeing adaptor after the insert seqeunces. And this is why in your test fastq file, reads of test 1, 2 and 4 were considered as adaptor-only reads.

        I think if what you want is to remove 5' end adaptor in front of the insert seuqences, the fastx_trimmer might be able to help.

        Best wishes,
        gghl

        Comment


        • #5
          We rewrote a lot of fastx's toolkit stuff, and posted it here: https://code.google.com/p/ea-utils/. It attempts to do things like adapter removal, trimming, etc... without as much configuration by detecting presence of adapters located in a common file.

          Comment


          • #6
            Thanks I'll give it a try

            I noticed at your site a tool for stitching pe reads called fastq-join. It doesn't appear to be available yet. When will it be?

            Comment


            • #7
              Originally posted by Mark View Post
              Thanks I'll give it a try

              I noticed at your site a tool for stitching pe reads called fastq-join. It doesn't appear to be available yet. When will it be?
              You can just grab the code... it's POSIX C++ and should compile easily:



              g++ -O3 fastq-join.c -o fastq-join

              Comment


              • #8
                Note: I made a change recently to properly use the "better quality base" in the overlapping region... there was a bug in it that someone pointed out. If you're using it, you'll want the newer version.

                Comment


                • #9
                  Question about fastq-mcf

                  Hi,

                  I encountered an issue when using fastq-mcf on my GA2 generated 1x36 reads, and wondering if you could shed some light.

                  So I made my fasta file with all the TruSeq adapter sequences in there, and ran fastq-mcf using that file, -P Phred scale set to 33 for my files are in Sanger fastq format. All other parameters were left as default.

                  After trimming was completed, the outfile reports removing about 10 million reads out of 24 million.

                  I run the trimmed file through FastQC, and under the "over-represented sequences" tab, I see that partial adapter sequences (e.g. starting from bp #2) are still over-represented in my file, which suggests that they were not trimmed.

                  My question is, does fastq-mcf remove partial matches to adapter sequences provided, as well as full? If so, am I doing something wrong with the way I am using the tool?

                  I am pretty new to bioinformatics, so sorry if this is a stupid question...

                  Thank you!

                  Comment


                  • #10
                    1. It does remove partial matches. It searches only from one "end" of the file. The default settings are very conservative, so if it's removing 10 million reads, that's an enormous number - you may want to change the settings to be more aggressive for that data.

                    2. Can you post the summary output... it should say why sequences were removed/clipped, and why etc.

                    3. Until very recently, GAII's output base-64 by default, not 33, so you may want to double-check that.

                    EXAMPLE OUTPUT:

                    Code:
                    Scale used: 2.2
                    Threshold used: 101 out of 40000
                    Adapter ILMN RT_primer_rc (TCGTATGCCGTCTTCTGCTTG): counted 193 at the 'end' of 'example.fastq', clip set to 6
                    Adapter FLUIDIGM Index-SP (AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG): counted 1063 at the 'end' of 'example.fastq', clip set to 4
                    Files: 1
                    Total reads: 250000
                    Too short after clip: 53
                    Clipped 'end' reads: Count: 16612, Mean: 18.12, Sd: 17.44
                    Trimmed 24474 reads by an average of 10.81 bases on quality < 10
                    Last edited by earonesty; 06-14-2011, 06:43 AM.

                    Comment


                    • #11
                      Originally posted by earonesty View Post
                      1. It does remove partial matches. It searches only from one "end" of the file. The default settings are very conservative, so if it's removing 10 million reads, that's an enormous number - you may want to change the settings to be more aggressive for that data.

                      2. Can you post the summary output... it should say why sequences were removed/clipped, and why etc.

                      3. Until very recently, GAII's output base-64 by default, not 33, so you may want to double-check that.

                      EXAMPLE OUTPUT:

                      Code:
                      Scale used: 2.2
                      Threshold used: 101 out of 40000
                      Adapter ILMN RT_primer_rc (TCGTATGCCGTCTTCTGCTTG): counted 193 at the 'end' of 'example.fastq', clip set to 6
                      Adapter FLUIDIGM Index-SP (AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG): counted 1063 at the 'end' of 'example.fastq', clip set to 4
                      Files: 1
                      Total reads: 250000
                      Too short after clip: 53
                      Clipped 'end' reads: Count: 16612, Mean: 18.12, Sd: 17.44
                      Trimmed 24474 reads by an average of 10.81 bases on quality < 10


                      Hi earonesty,

                      Thanks for getting back to me!
                      Here is an example of the output I received:

                      Scale used: 2.2
                      Threshold used: 101 out of 40000
                      Adapter TruSeq-Adapter1 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                      Adapter TruSeq-Adapter2 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                      Adapter TruSeq-Adapter3 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                      Adapter TruSeq-Adapter4 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG): counted 10159 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                      Adapter TruSeq-Adapter5 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                      Adapter TruSeq-Adapter6 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                      Adapter TruSeq-Adapter7 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                      Adapter TruSeq-Adapter8 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                      Adapter TruSeq-Adapter9 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                      Adapter TruSeq-Adapter10 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                      Adapter TruSeq-Adapter11 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                      Adapter TruSeq-Adapter12 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                      Files: 1
                      Total reads: 21964185
                      Too short after clip: 35672
                      Clipped 'start' reads: Count: 13283064, Mean: 1.58, Sd: 1.15
                      Trimmed 394023 reads by an average of 7.15 bases on quality < 10


                      So far, what I understand is that my samples probably have a lot of adapter-pair ligations in there without any genomic insert. This leads to the entirety of my 36bp read being a portion of the index/adapter. And since the adapter is much longer than 36bp, i think those reads are not being removed. e.g.:

                      My sample DNA: ADAPTER1adapter2
                      read: dapte

                      I say this because when I put the cleaned up reads through FastQC again, I see that all the "Over-represented Sequences" that are TruSeq Indexes are still present in my file.

                      I've managed to resolve this particular issue by basically copying in the sequence given by FastQC as the overrepresented sequence, and using those in a fasta file as the adapter sequences. It works well for my case, so maybe there is nothing wrong with the toolkit, and it's just my particular sample?

                      Yes, I understand that Illumina reads use Phred64, but I always convert directly to Sanger Phred33 as soon as I get my files, which is why I put the -P 33 option in there.

                      Thanks,

                      Angela

                      Comment


                      • #12
                        Originally posted by angelawu View Post
                        Hi earonesty,

                        Thanks for getting back to me!
                        Here is an example of the output I received:

                        Scale used: 2.2
                        Threshold used: 101 out of 40000
                        Adapter TruSeq-Adapter1 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                        Adapter TruSeq-Adapter2 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                        Adapter TruSeq-Adapter3 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                        Adapter TruSeq-Adapter4 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG): counted 10159 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                        Adapter TruSeq-Adapter5 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                        Adapter TruSeq-Adapter6 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                        Adapter TruSeq-Adapter7 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                        Adapter TruSeq-Adapter8 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                        Adapter TruSeq-Adapter9 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                        Adapter TruSeq-Adapter10 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                        Adapter TruSeq-Adapter11 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                        Adapter TruSeq-Adapter12 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
                        Files: 1
                        Total reads: 21964185
                        Too short after clip: 35672
                        Clipped 'start' reads: Count: 13283064, Mean: 1.58, Sd: 1.15
                        Trimmed 394023 reads by an average of 7.15 bases on quality < 10


                        So far, what I understand is that my samples probably have a lot of adapter-pair ligations in there without any genomic insert. This leads to the entirety of my 36bp read being a portion of the index/adapter. And since the adapter is much longer than 36bp, i think those reads are not being removed. e.g.:

                        My sample DNA: ADAPTER1adapter2
                        read: dapte

                        I say this because when I put the cleaned up reads through FastQC again, I see that all the "Over-represented Sequences" that are TruSeq Indexes are still present in my file.

                        I've managed to resolve this particular issue by basically copying in the sequence given by FastQC as the overrepresented sequence, and using those in a fasta file as the adapter sequences. It works well for my case, so maybe there is nothing wrong with the toolkit, and it's just my particular sample?

                        Yes, I understand that Illumina reads use Phred64, but I always convert directly to Sanger Phred33 as soon as I get my files, which is why I put the -P 33 option in there.

                        Thanks,

                        Angela
                        - Your adapter file seems to have the same sequence over and over? I'm not sure how that will affect things. TruSeq-Adapter2 is the same as TruSeq-Adapter1.... etc. Try just using 1 per unique sequence. This probably won't help.

                        - Out of 40000 reads, 10000 had an exact match for 15 base pairs of adapter sequence. That's a lot. So when it says "clip set to 1" it will clip any matching subsequence.

                        - It only discarded 35672 reads and only a few bases. That's surprising to me considering the number of sequences it found in the subsample with exact matches. I would expect a higher rate of discards, and a higher number of mean bases clipped.

                        - This is a situation where I wish I could see about 100K reads from your sample and just run it a few times to see what happened why it did that. It should be walking the adapter along the sequence looking for the best match. It seems to be stopping early on....or perhaps the sequences that match the adapter are somewhere else (at the end...?) and it guessed wrong (you can force -e)

                        - There's also an undocumented "-d" option that spits out lots of debug info that I find useful.

                        Comment


                        • #13
                          Oh, the adapter sequences are not identical. If you look closely at the middle portion of the sequences, there is a barcode in the middle that is different for each sequence. But I also do not think this would make any difference...

                          In any case, I think I have a solution to my particular application, so I don't know how much time I want to spend debugging this, but thanks for reminding me of the -d option, which will surely come in handy later on as well. The -e option may be the trick, since the barcode only begins in the middle of the adapter sequence?

                          Thanks once again!

                          Comment


                          • #14
                            I think the barcode in the middle was making it odd. Also, I think your solution is great.

                            Comment


                            • #15
                              I have tryed your ea-utils. But it seems as the same FASTXtoolkit adapter trimming. ea-utils also remove the whole read which contained adapter.

                              Originally posted by earonesty View Post
                              We rewrote a lot of fastx's toolkit stuff, and posted it here: https://code.google.com/p/ea-utils/. It attempts to do things like adapter removal, trimming, etc... without as much configuration by detecting presence of adapters located in a common file.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              103 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              112 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              1 response
                              115 views
                              0 likes
                              Last Post EmiTom
                              by EmiTom
                               
                              Started by seqadmin, 09-26-2024, 12:57 PM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X