Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fkrueger
    replied
    That would work, too, yes. Let me know if you've got any further questions.

    Leave a comment:


  • Rob Weeks
    replied
    Never mind, I realised that decompressing my file and then running trim_galore will bypass zcat. It then works.

    Cheers

    Rob

    EDIT: thanks Felix. I wrote this before I saw your response - I was "never minding" my question not your response!

    I have a solution which works; I will decompress before running trim_galore
    Last edited by Rob Weeks; 01-19-2016, 11:46 PM. Reason: crossed responses

    Leave a comment:


  • fkrueger
    replied
    Originally posted by Rob Weeks View Post
    I have only just begun to look at RRBS data. I am trying to use trim_galore to quality trim and adaptor trim my sequences. I am doing this in OS X.
    Now when I run 'trim_adaptor filename.fastq.gz' it returns an error due to 'zcat: can't stat: filename.fastq.gz (filename.fastq.gz.Z): No such file or directory".
    This is apparently a problem only in OS X, but it is not clear to me how I can get around this problem.

    Any ideas would be appreciated

    Cheers
    This problem is indeed known and has to do with the version of zcat which is installed on your system. On very old versions it would alter the filename passed to it to append a .Z if there wasn’t one there, so although you’re trying to read a file called Sample1_PE_R1.fastq.gz it’s actually trying to read Sample1_PE_R1.fastq.gz.Z (which doesn’t exist).

    I have changed the way Trim Galore reads from files from using a cat stream to using gunzip -c and it seems to work well. I can send you a copy of this tonight as I am at the Festival of Genomics in London all day if you send me an email. Alternatively you could try to change the filename of your input to end in .gz.Z and try that?
    Good luck, Felix

    Leave a comment:


  • Rob Weeks
    replied
    I have only just begun to look at RRBS data. I am trying to use trim_galore to quality trim and adaptor trim my sequences. I am doing this in OS X.
    Now when I run 'trim_adaptor filename.fastq.gz' it returns an error due to 'zcat: can't stat: filename.fastq.gz (filename.fastq.gz.Z): No such file or directory".
    This is apparently a problem only in OS X, but it is not clear to me how I can get around this problem.

    Any ideas would be appreciated

    Cheers

    Leave a comment:


  • fkrueger
    replied
    Hi whargrea, the absence of documentation for parallelization does indeed mean that reads are trimmed by calling a single instance of Cutadapt at a time. Since trimming is a one-off process that doesn't really take that long (a matter of hours) compared to the data collection process (often a matter of several days) or other downstream operations (up to several weeks?) we don't tend to bother about it very much. The easiest solution would probably to run all your 48 trims in parallel (even though this might be quite intense on the disc I/O part), or try to find another trimmer that supports parallel trimming natively.

    Leave a comment:


  • whargrea
    replied
    Hi,

    I've been going through the documentation and searching forum threads etc. looking to see if trim_galore can be run in a multi-core multi-thread manner. So far the total lack of information in this regard seems to point towards it not having such a capability.

    I'm not sure if this is the appropriate place to ask but I was wondering why this is the case? I have 48 files of ~120mil reads each that I need to perform trimming on and being able to parallelize would greatly boost the speed at which this could be done. It seems to me that since each read is trimmed independently trimming software should easily scale to any number of cores. Am I correct in this assumption or am I missing something?

    Cheers.

    Leave a comment:


  • Alex852013
    replied
    Thanks a lot

    Thanks a lot, i guess i can go on on my own now!

    Leave a comment:


  • fkrueger
    replied
    Trim Galore should derive its output files from the filenames, so this will only redirect any other output to the screen to a file, so not overly useful but it won't harm.

    The trimming algorithm to trim qualities is described in the Cudatapt option -q:

    Code:
    -q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF
                            Trim low-quality bases from 5' and/or 3' ends of reads
                            before adapter removal. If one value is given, only
                            the 3' end is trimmed. If two comma-separated cutoffs
                            are given, the 5' end is trimmed with the first
                            cutoff, the 3' end with the second. [B]The algorithm is
                            the same as the one used by BWA (see documentation).[/B]
                            (default: no trimming)
    This means that the qualities are assessed in windows over the read, and trimmed at a position where the score is lowest. If I understand this correctly then a read may temporarily 'dip' below the threshold you have selected, but allow the sequence to survive it the quality comes back up afterwards. So occasionally you might get a few scores that are lower than 20bp, but I personally wouldn't too worried about it as most downstream programs have their own means of dealing with low quality basecalls.

    Leave a comment:


  • GenoMax
    replied
    Originally posted by Alex852013 View Post
    Hello everybody,

    This is the line i used for trimming on unix command line.
    trim_galore ../name_R1_001.fastq ../name_R2_001.fastq -q 20 --paired --phred33 > trim_BAC-1_S9_R1_001.fastq

    Therefore i either missinterpret something or i did something wrong.
    May please someone tell me what it is?
    Thanks a lot, Alex
    You appear to be running trim_galore incorrectly. Instead of trying to redirect the output (>) to a file you need to specify an output directory location by using a -o directory_path.

    @felix will confirm. I don't use trim_galore.

    Edit: Looking at trim_galore manual -o is not strictly needed. Program will use the current directory by default.

    Edit2: @felix clarified the effect of output redirect in the post below.
    Last edited by GenoMax; 12-03-2015, 08:59 AM.

    Leave a comment:


  • Alex852013
    replied
    Understand the quality trimming

    Hello everybody,

    it is the first time i try to use trim_galore for quality trimming of paired end reads.
    I checked for the sequencing settings with testformat.sh from BBMap which gives me:
    sanger fastq raw single-ended 150bp
    I'm not sure why there single-ended comes as an output, since it was paired-end.

    Before i did the quality trimming, i checked with FastQC.
    The programm didn't find adapter sequences any more (i guess they were already cut by the sequencing service) and showed the following pictures

    Picture before quality trimming:



    This is the line i used for trimming on unix command line.
    trim_galore ../name_R1_001.fastq ../name_R2_001.fastq -q 20 --paired --phred33 > trim_BAC-1_S9_R1_001.fastq

    Picture after quality trimming:



    I had expected, that everything with a quality below 20 would be cut. Therefore i either missinterpret something or i did something wrong.
    May please someone tell me what it is?
    Thanks a lot, Alex
    Last edited by Alex852013; 12-03-2015, 08:43 AM.

    Leave a comment:


  • fkrueger
    replied
    Oh dear, you should never post such things on the internet... but I'm glad it helped!

    Leave a comment:


  • bluepoison
    replied
    Hi Felix,

    Thanks a lot for quick response. It was really helpful for me.

    I just performed a short experiment. Just wanted to share with you. I randomly pooled 1M reads, and made 3 following versions:
    version 1: without any trimming
    version 2: trim with Trim Galore with default settings
    version 3: trim with Trim Galore with default settings and trim 'GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG' with cutadapt.

    Results in terms of efficiency after aligning with bismark b2:
    version1: 39.7%
    version2: 58.9%
    version3: 58.2%

    When I checked the qualities in FASTQC, even in version 3, it gave some very short (less than 10bp)overrepresented sequences as 'no hit'. So I guess it will always give some overrepresented sequences anyway but I have to understand very well what am I trimming.

    One notable thing here is that the efficiency has not improved from version 2 to version 3. Most of the overrepresented sequences has the first part as 'GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG' and second part as the basic standard Illumina paired-end adapter. So those sequences are already rejected from the alignment just after the doing the version 2. That's why version 3 hasn't change that much.

    btw I saw several posts containing 'felix is a great guy!'. Now its making a lot more sense. thanks again!
    Originally posted by fkrueger View Post
    Hi bluepoison,

    The sequence you are seeing overrepresented is most likely some kind of adapter dimer because the sequence is lacking the leading A which it would get as a result of A-tailing the fragments. It is not normally required to trim adapter dimers specifically because they won't align to a reference genome anyway. You need to keep in mind though that the mapping efficiency will look worse because adapter primers won't align.

    It would be sufficient for Cutadapt as well as Trim Galore to just specify the first couple of bp, here GATCGGAAGAGCG, in order to trim all lengths of the occurring sequence. As I mentioned above I would not bother though because these sequences won't align anywhere anywhere.

    Just generally, the overrepresented sequences plot in FastQC is meant as a quick guide for you to spot sequences that are present in more than 0.1% of case but doesn't mean you should remove all of them from your sequenced library - especially not if you don't actually know what the sequence is. It might be a biological effect after all.

    In short: running Trim Galore in default mode will almost certainly do the right thing. Cheers, Felix

    Leave a comment:


  • fkrueger
    replied
    Hi bluepoison,

    The sequence you are seeing overrepresented is most likely some kind of adapter dimer because the sequence is lacking the leading A which it would get as a result of A-tailing the fragments. It is not normally required to trim adapter dimers specifically because they won't align to a reference genome anyway. You need to keep in mind though that the mapping efficiency will look worse because adapter primers won't align.

    It would be sufficient for Cutadapt as well as Trim Galore to just specify the first couple of bp, here GATCGGAAGAGCG, in order to trim all lengths of the occurring sequence. As I mentioned above I would not bother though because these sequences won't align anywhere anywhere.

    Just generally, the overrepresented sequences plot in FastQC is meant as a quick guide for you to spot sequences that are present in more than 0.1% of case but doesn't mean you should remove all of them from your sequenced library - especially not if you don't actually know what the sequence is. It might be a biological effect after all.

    In short: running Trim Galore in default mode will almost certainly do the right thing. Cheers, Felix

    Leave a comment:


  • bluepoison
    replied
    Hi all,

    This is my first sequencing data analysing. I am having difficulties trimming the adapters/contaminants from the reads. I have got 50bp single paired read. I checked in fastqc that there are overrepresented sequences which are part of 'Illumina Paired End Adapter 2'. But If I trim using the whole 'Illumina Paired End Adapter 2', still there will be plenty of overrepresented sequences left!
    Q1) On that case what how much should I trim?

    I have these overrepresented sequence,
    GATCGGAAGAGCGGTTCAGCAGG
    GATCGGAAGAGCGGTTCAGCAGGA
    GATCGGAAGAGCGGTTCAGCAGGAA
    GATCGGAAGAGCGGTTCAGCAGGAAT
    GATCGGAAGAGCGGTTCAGCAGGAATG
    GATCGGAAGAGCGGTTCAGCAGGAATGC
    GATCGGAAGAGCGGTTCAGCAGGAATGCC
    GATCGGAAGAGCGGTTCAGCAGGAATGCCG
    GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
    GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG (Illumina Paired End Adapter 2)

    Also I have another sequence which all the 'no hit' contains! That sequence is 'GTTATTTTTTTGTTTTAGTTTTT'. I looked at the contaminant file and there is no match for this.
    Q2)Should I trim this sequence without even actually knowing from which this sequence is coming from?


    I planned to trim all the sequences from bigger to smaller using cudadapt because there is no way to trim multiple adapters at a time in trim galore. But later I will also use trim galore for quality trimming.
    Q3)Is there any way to minimize these steps?


    All the scenarios described above is true for all the seven samples I analysed. Also there is know way to know the actual adapters used from the dataset.

    Thanks a lot!

    Leave a comment:


  • fkrueger
    replied
    I think if a single or few bases dip but then it recovers the read will actually survive. This is a sliding window model which isn't super harsh to the data.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 11:49 AM
0 responses
15 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
61 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X