Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Great tool...

    Hi,

    This is a great tool by the way. I was hoping for someone to have already implemented exactly this tool as I was scratching my head as to do it myself. Thanks a lot.

    May I make a couple of suggestions in terms of functionality. Would it be possible for you to add in a feature that only keeps trimmed reads if they are above a certain length (e.g. If this parameter was set to 20 and original sequence length is 36 and 17bp adapter was trimmed, then the sequence would not be included in the output because only 19bp of sequence would be remaining; could be set to 0 as default). Also, it would be useful to track which sequence was trimmed as adapter and where in the original sequence it was trimmed from in terms of location. Maybe an optional dump .fastq file would help for this which would contain the trimmed adapter sequence and additional information as to where in the original sequence it was found and how many mismatches were allowed (e.g. if a sequence is 36bp and adapter is found at 1 to 15bp with 0 mismatches, then maybe you could append this information to the '+' line in the fastq file as 1_15_0; the rest of the fields for a fastq sequence entry i.e. '@' would be the same). With the adapter .fastq output it would then be possible to parse the adapter sequence as required.

    These are just suggestions by the way. I can see this tool becoming very useful to me and I have already introduced it to all of the bioinformaticians in my lab.

    Look forward to reading you response.

    Comment


    • #17
      Error rate...

      Please correct me if I am wrong but I have played around with cutadapt and this is what I understood from the error rate. It can be calculated by multiplying the error rate by the length of the adapter found. For example, if you set your error rate to 0.1 (default) and you find an adapter sequence of length 10bp then 1 mismatch is permitted. However, if you adapter sequence is below 9bp then 0 mismatches would be permitted at the same error rate since 0.9 rounds down to 0. I am not sure how this applies to insertions and deletions but I found it to be the case with mismatches.

      Hope that helps.

      Comment


      • #18
        Hello and thanks for the feedback! Your suggestions are very reasonable. In fact, discarding reads that are too short after trimming is already on my to do list. I'll hopefully be able to implement that very soon (it's simple but I need to find the time). The annotated FASTQ file has a bit lower priority for me, but I have added your suggestion to the bugtracker so I won't forget it.

        Your observation regarding the error rate is correct. The error rate is calculated over the part of the adapter that is actually found in the sequence. Also, when there are insertions or deletions, these count as one error, but currently any gap within the adapter increases its length by one. I'm thinking about changing that behavior. I'll also add a better explanation to the README file.

        Comment


        • #19
          I have added the option "--minimum-length" (or simply -m) to cutadapt. Download version 0.7 to get the feature, or retrieve the source from Subversion.

          Comment


          • #20
            I have released cutadapt version 0.8. Important changes are:
            • The default behavior now is to assume that an adapter has been ligated to the 3' end. This should be the correct behavior for at least the SOLiD small RNA protocol (SREK) and also for the Illumina protocol. See the README for details.
            • A different scoring function improves trimming: Some reads that should have been trimmed weren't.
            • 20% faster on my test data set

            Comment


            • #21
              cutadapt 0.8

              Hi Martin,

              I got a chance to use the new version and have couple of questions with regard to the output. I am posting the output below and my questions are in the bottom.

              OUTPUT:

              Maximum error rate: 12.00%
              Processed reads: 8847400
              Trimmed reads: 5519819 ( 62.4%)
              Too short reads: 0 ( 0.0% of processed reads)
              Total time: 775.97 s
              Time per read: 0.09 ms
              === Adapter 1 ===
              Adapter '330201030313112312', length 18, was trimmed 5519819 times.
              Histogram of adapter lengths
              length count
              3 279189
              4 472804
              5 658292
              6 662309
              7 516419
              8 294151
              9 287150
              10 245650
              11 474506
              12 697873
              13 112506
              14 39956
              15 33765
              16 35982
              17 29564
              18 679703

              My questions are
              1. Does the output file contains only trimmed reads (5519819 reads) or all the reads?
              2. Is it possible to write the reads without the adaptor sequence to a file?
              3. Do you normally keep the reads without the adaptor sequence for further analysis?

              Thank you for all your help.

              Happy Holidays,

              Neel

              Comment


              • #22
                Hello Neel,
                thanks for trying out the new version.

                Originally posted by naluru View Post
                1. Does the output file contains only trimmed reads (5519819 reads) or all the reads?
                The output file contains all reads.

                2. Is it possible to write the reads without the adaptor sequence to a file?
                This is currently not possible, but someone has already sent me a patch implementing this feature. I will add it to the next version.

                3. Do you normally keep the reads without the adaptor sequence for further analysis?
                For small RNA sequencing, we did keep those reads initially. This was mainly done out of curiosity and because we wanted to know what else we had sequenced besides small RNA. For the final expression profile, we counted only reads mapping to small RNA anyway so it did not make a difference.

                Comment


                • #23
                  cutadapt v0.9 released

                  Hi and thanks to all who use this tool and especially those who have given me feedback. I have released cutadapt 0.9, which adds some small, but nice to have features:

                  * Use --too-short-output and --untrimmed-output to redirect too short or untrimmed reads to a separate file, based on patch by Paul Ryvkin (thanks!).
                  * With --maximum-length, reads longer than a specified length can be discarded.
                  * Added the --length-tag option, which helps to fix read lengths in FASTA/Q comment lines (e.g., 'length=123' becomes 'length=58' after trimming) (requested by Paul Ryvkin)
                  * Added -q/--quality-cutoff option for trimming low-quality ends (uses the same algorithm as BWA, but works also in color space)

                  cutadapt is now in the Python Package Index. You should be able to simply install it with "easy_install cutadapt" or (if you prefer pip) with "pip install cutadapt".

                  Comment


                  • #24
                    cutadapt - loss of first base in SOLiD reads

                    Hi Martin,

                    I am not quite sure if my issue is related to cutadapt but I just thought I will check with you as I used it to trim the adapter.

                    I used cutadapt to trim the adapter using cutadapt 0.8 version and then used fastqfilter (you sent me this a while ago) to trim reads with negative qualities.
                    Then I used BWA to align the reads to the genome.

                    What I noticed is that lot of reads do not have the first base in them? I am not sure if it happened during adapter trimming or during filtering with fastqfilter. I noticed this because when I use another software (CLC Bio Genomics Workbench), I see the first base. This has always been the first base of mature miRNA.

                    Have you ever noticed it in your analysis or has anyone reported this issue?
                    I didn't use any quality control option while aligning the reads with BWA.

                    Any suggestions or comments will be highly appreciated. If you need any further details, I will be happy to provide them.

                    Thank you for all your help.

                    Neel

                    Comment


                    • #25
                      Hello Neel,

                      this is actually a limitation of BWA (which it has "inherited" from MAQ). BWA cannot make use of the primer base (a "T" is in the beginning of all SOLiD reads that I have seen), and it can also not use the first color. The script solid2fastq.pl that is included with MAQ and BWA therefore removes those two first characters of each read, and cutadapt does the same if you use the --bwa option.

                      We do have some code in our group to "reattach" that missing nucleotide, perhaps I can polish that and make it usable for others. I'll check that next week.

                      Marcel

                      Comment


                      • #26
                        Can the tool be used on data that is not generated with SOLiD machines (such as 454 and Illumina)? If yes, how does it compare to alternatives such as SeqTrim, SeqClean, or TagCleaner?

                        Comment


                        • #27
                          Thank you, Marcel. It will be really helpful if you could provide that code. Do you know how SHRiMP and Mosiak assembler deal with it?

                          Thank you,
                          Neel

                          Comment


                          • #28
                            robs: cutadapt was developed for SOLiD and 454 data and also works with Illumina reads.

                            cutadapt is focused on command-line users who have a data file from a second-generation sequencing machine and want to simply remove one or more know adapter sequences from that file. There is probably some overlap in functionality to the tools you mention. TagClean and SeqClean were published after I have implemented cutadapt, and SeqTrim was unknown to me. Also, SeqClean and SeqTrim seem to be primarily for the analysis of Sanger sequencing data. I cannot say how easy it is to get them to work with second-generation data. SeqTrim, for example, seems to not be able to cope with FASTQ files.

                            Comment


                            • #29
                              I have no experience with the adapter removal and just wonder if cutadapt is able to predict the adapter sequences from FASTQ?

                              Comment


                              • #30
                                Originally posted by ttnguyen View Post
                                I have no experience with the adapter removal and just wonder if cutadapt is able to predict the adapter sequences from FASTQ?
                                No, it cannot. I think TagCleaner may have that ability.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Choosing Between NGS and qPCR
                                  by seqadmin



                                  Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                  10-18-2024, 07:11 AM
                                • seqadmin
                                  Non-Coding RNA Research and Technologies
                                  by seqadmin




                                  Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                  Nobel Prize for MicroRNA Discovery
                                  This week,...
                                  10-07-2024, 08:07 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 05:31 AM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-24-2024, 06:58 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-23-2024, 08:43 AM
                                0 responses
                                48 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-17-2024, 07:29 AM
                                0 responses
                                58 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X