Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #76
    I wanted to ask a question about the quality trimming of sequences. I have Illumina reads and use Trim Galore to remove adapters and primers with good success.

    Is the quality trimming based on the average Phred score of the read or if I use a cutoff of 25 and have any/some bases below 25, will the whole read be removed?

    Thanks,

    Fiona

    Comment


    • #77
      Hi Fiona,

      quality trimming removes the portion of the read where the qualities become minimal, but does not remove then entire read (pair) completely. This is taken from Cutadapt --help:

      Code:
       --quality-base=QUALITY_BASE
                              Assume that quality values are encoded as
                              ascii(quality + QUALITY_BASE). The default (33) is
                              usually correct, except for reads produced by some
                              versions of the Illumina pipeline, where this should
                              be set to 64. (Default: 33)

      Comment


      • #78
        Thank you for the fast reply.

        Apologies if I am being basic. If there is a single base with Q<25 but the following bases are ok will the read be cut at that point or is there a set number of bases needed to be below Q<20 resulting in the read being cut?

        When I look at my fastq files on FastQC the error bars do dip below 20 but I was thinking this was due to a small number of bases over multiple reads.

        Thanks

        Comment


        • #79
          I think if a single or few bases dip but then it recovers the read will actually survive. This is a sliding window model which isn't super harsh to the data.

          Comment


          • #80
            Hi all,

            This is my first sequencing data analysing. I am having difficulties trimming the adapters/contaminants from the reads. I have got 50bp single paired read. I checked in fastqc that there are overrepresented sequences which are part of 'Illumina Paired End Adapter 2'. But If I trim using the whole 'Illumina Paired End Adapter 2', still there will be plenty of overrepresented sequences left!
            Q1) On that case what how much should I trim?

            I have these overrepresented sequence,
            GATCGGAAGAGCGGTTCAGCAGG
            GATCGGAAGAGCGGTTCAGCAGGA
            GATCGGAAGAGCGGTTCAGCAGGAA
            GATCGGAAGAGCGGTTCAGCAGGAAT
            GATCGGAAGAGCGGTTCAGCAGGAATG
            GATCGGAAGAGCGGTTCAGCAGGAATGC
            GATCGGAAGAGCGGTTCAGCAGGAATGCC
            GATCGGAAGAGCGGTTCAGCAGGAATGCCG
            GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
            GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG (Illumina Paired End Adapter 2)

            Also I have another sequence which all the 'no hit' contains! That sequence is 'GTTATTTTTTTGTTTTAGTTTTT'. I looked at the contaminant file and there is no match for this.
            Q2)Should I trim this sequence without even actually knowing from which this sequence is coming from?


            I planned to trim all the sequences from bigger to smaller using cudadapt because there is no way to trim multiple adapters at a time in trim galore. But later I will also use trim galore for quality trimming.
            Q3)Is there any way to minimize these steps?


            All the scenarios described above is true for all the seven samples I analysed. Also there is know way to know the actual adapters used from the dataset.

            Thanks a lot!

            Comment


            • #81
              Hi bluepoison,

              The sequence you are seeing overrepresented is most likely some kind of adapter dimer because the sequence is lacking the leading A which it would get as a result of A-tailing the fragments. It is not normally required to trim adapter dimers specifically because they won't align to a reference genome anyway. You need to keep in mind though that the mapping efficiency will look worse because adapter primers won't align.

              It would be sufficient for Cutadapt as well as Trim Galore to just specify the first couple of bp, here GATCGGAAGAGCG, in order to trim all lengths of the occurring sequence. As I mentioned above I would not bother though because these sequences won't align anywhere anywhere.

              Just generally, the overrepresented sequences plot in FastQC is meant as a quick guide for you to spot sequences that are present in more than 0.1% of case but doesn't mean you should remove all of them from your sequenced library - especially not if you don't actually know what the sequence is. It might be a biological effect after all.

              In short: running Trim Galore in default mode will almost certainly do the right thing. Cheers, Felix

              Comment


              • #82
                Hi Felix,

                Thanks a lot for quick response. It was really helpful for me.

                I just performed a short experiment. Just wanted to share with you. I randomly pooled 1M reads, and made 3 following versions:
                version 1: without any trimming
                version 2: trim with Trim Galore with default settings
                version 3: trim with Trim Galore with default settings and trim 'GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG' with cutadapt.

                Results in terms of efficiency after aligning with bismark b2:
                version1: 39.7%
                version2: 58.9%
                version3: 58.2%

                When I checked the qualities in FASTQC, even in version 3, it gave some very short (less than 10bp)overrepresented sequences as 'no hit'. So I guess it will always give some overrepresented sequences anyway but I have to understand very well what am I trimming.

                One notable thing here is that the efficiency has not improved from version 2 to version 3. Most of the overrepresented sequences has the first part as 'GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG' and second part as the basic standard Illumina paired-end adapter. So those sequences are already rejected from the alignment just after the doing the version 2. That's why version 3 hasn't change that much.

                btw I saw several posts containing 'felix is a great guy!'. Now its making a lot more sense. thanks again!
                Originally posted by fkrueger View Post
                Hi bluepoison,

                The sequence you are seeing overrepresented is most likely some kind of adapter dimer because the sequence is lacking the leading A which it would get as a result of A-tailing the fragments. It is not normally required to trim adapter dimers specifically because they won't align to a reference genome anyway. You need to keep in mind though that the mapping efficiency will look worse because adapter primers won't align.

                It would be sufficient for Cutadapt as well as Trim Galore to just specify the first couple of bp, here GATCGGAAGAGCG, in order to trim all lengths of the occurring sequence. As I mentioned above I would not bother though because these sequences won't align anywhere anywhere.

                Just generally, the overrepresented sequences plot in FastQC is meant as a quick guide for you to spot sequences that are present in more than 0.1% of case but doesn't mean you should remove all of them from your sequenced library - especially not if you don't actually know what the sequence is. It might be a biological effect after all.

                In short: running Trim Galore in default mode will almost certainly do the right thing. Cheers, Felix

                Comment


                • #83
                  Oh dear, you should never post such things on the internet... but I'm glad it helped!

                  Comment


                  • #84
                    Understand the quality trimming

                    Hello everybody,

                    it is the first time i try to use trim_galore for quality trimming of paired end reads.
                    I checked for the sequencing settings with testformat.sh from BBMap which gives me:
                    sanger fastq raw single-ended 150bp
                    I'm not sure why there single-ended comes as an output, since it was paired-end.

                    Before i did the quality trimming, i checked with FastQC.
                    The programm didn't find adapter sequences any more (i guess they were already cut by the sequencing service) and showed the following pictures

                    Picture before quality trimming:



                    This is the line i used for trimming on unix command line.
                    trim_galore ../name_R1_001.fastq ../name_R2_001.fastq -q 20 --paired --phred33 > trim_BAC-1_S9_R1_001.fastq

                    Picture after quality trimming:



                    I had expected, that everything with a quality below 20 would be cut. Therefore i either missinterpret something or i did something wrong.
                    May please someone tell me what it is?
                    Thanks a lot, Alex
                    Last edited by Alex852013; 12-03-2015, 08:43 AM.

                    Comment


                    • #85
                      Originally posted by Alex852013 View Post
                      Hello everybody,

                      This is the line i used for trimming on unix command line.
                      trim_galore ../name_R1_001.fastq ../name_R2_001.fastq -q 20 --paired --phred33 > trim_BAC-1_S9_R1_001.fastq

                      Therefore i either missinterpret something or i did something wrong.
                      May please someone tell me what it is?
                      Thanks a lot, Alex
                      You appear to be running trim_galore incorrectly. Instead of trying to redirect the output (>) to a file you need to specify an output directory location by using a -o directory_path.

                      @felix will confirm. I don't use trim_galore.

                      Edit: Looking at trim_galore manual -o is not strictly needed. Program will use the current directory by default.

                      Edit2: @felix clarified the effect of output redirect in the post below.
                      Last edited by GenoMax; 12-03-2015, 08:59 AM.

                      Comment


                      • #86
                        Trim Galore should derive its output files from the filenames, so this will only redirect any other output to the screen to a file, so not overly useful but it won't harm.

                        The trimming algorithm to trim qualities is described in the Cudatapt option -q:

                        Code:
                        -q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF
                                                Trim low-quality bases from 5' and/or 3' ends of reads
                                                before adapter removal. If one value is given, only
                                                the 3' end is trimmed. If two comma-separated cutoffs
                                                are given, the 5' end is trimmed with the first
                                                cutoff, the 3' end with the second. [B]The algorithm is
                                                the same as the one used by BWA (see documentation).[/B]
                                                (default: no trimming)
                        This means that the qualities are assessed in windows over the read, and trimmed at a position where the score is lowest. If I understand this correctly then a read may temporarily 'dip' below the threshold you have selected, but allow the sequence to survive it the quality comes back up afterwards. So occasionally you might get a few scores that are lower than 20bp, but I personally wouldn't too worried about it as most downstream programs have their own means of dealing with low quality basecalls.

                        Comment


                        • #87
                          Thanks a lot

                          Thanks a lot, i guess i can go on on my own now!

                          Comment


                          • #88
                            Hi,

                            I've been going through the documentation and searching forum threads etc. looking to see if trim_galore can be run in a multi-core multi-thread manner. So far the total lack of information in this regard seems to point towards it not having such a capability.

                            I'm not sure if this is the appropriate place to ask but I was wondering why this is the case? I have 48 files of ~120mil reads each that I need to perform trimming on and being able to parallelize would greatly boost the speed at which this could be done. It seems to me that since each read is trimmed independently trimming software should easily scale to any number of cores. Am I correct in this assumption or am I missing something?

                            Cheers.

                            Comment


                            • #89
                              Hi whargrea, the absence of documentation for parallelization does indeed mean that reads are trimmed by calling a single instance of Cutadapt at a time. Since trimming is a one-off process that doesn't really take that long (a matter of hours) compared to the data collection process (often a matter of several days) or other downstream operations (up to several weeks?) we don't tend to bother about it very much. The easiest solution would probably to run all your 48 trims in parallel (even though this might be quite intense on the disc I/O part), or try to find another trimmer that supports parallel trimming natively.

                              Comment


                              • #90
                                I have only just begun to look at RRBS data. I am trying to use trim_galore to quality trim and adaptor trim my sequences. I am doing this in OS X.
                                Now when I run 'trim_adaptor filename.fastq.gz' it returns an error due to 'zcat: can't stat: filename.fastq.gz (filename.fastq.gz.Z): No such file or directory".
                                This is apparently a problem only in OS X, but it is not clear to me how I can get around this problem.

                                Any ideas would be appreciated

                                Cheers

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Advanced Tools Transforming the Field of Cytogenomics
                                  by seqadmin


                                  At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                                  09-26-2023, 06:26 AM
                                • seqadmin
                                  How RNA-Seq is Transforming Cancer Studies
                                  by seqadmin



                                  Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                                  09-07-2023, 11:15 PM
                                • seqadmin
                                  Methods for Investigating the Transcriptome
                                  by seqadmin




                                  Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

                                  Whole Transcriptome RNA-seq
                                  Whole transcriptome sequencing...
                                  08-31-2023, 11:07 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Today, 09:38 AM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 09-27-2023, 06:57 AM
                                0 responses
                                11 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 09-26-2023, 07:53 AM
                                0 responses
                                13 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 09-25-2023, 07:42 AM
                                0 responses
                                17 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X