Seqanswers Leaderboard Ad

**earonesty** · 06-20-2011, 05:00 AM

It only removes the whole read if the remaining length is less than the minimum length threshold, which, by default is 15:

-l N Minimum remaining sequence length (15)

15's pretty low, but you can lower it.

**fabrice** · 06-20-2011, 05:11 AM

But when I run it on the test data. It seems fastq-mcf remove the whole reads. Do I set something wrong?

fastq-mcf adapters.fa test.fastq -q 0 -l 1 -o test.trim
Scale used: 2.2
Threshold used: 1 out of 4
Adapter date_day (CCTTAAGG): counted 3 at the 'end' of 'test.fastq', clip set to 1
Files: 1
Total reads: 4
Too short after clip: 3

----------------adapters.fa----------------
>date_day
CCTTAAGG
>adaptor1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

------------------test.fastq------------------------
@test1
CCTTAAGGAAAAAAAAAAGGGGGGGGGG
+test1
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@test2
CCTTAAGGAAAAAAAAAGGGGGGGGGGG
+test2
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@test3
AGAGAGAGAGAGAGAGAGAGAGAGAGAG
+test3
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@test4
CCTTAAGGTTGACGTGATCGACACCTGG
+test4
[[[[[[[[[[[[[[[[[[[[[[[[[[[[

**earonesty** · 06-20-2011, 06:34 AM

You're right, I'd introduced a bug in detecting starts. I'm sending a release and adding it to my tests. The problem was it should have been saying at 'start' of sequence... not 'end'. By thinking it was an 'end' adapter, it was trimming from the right... hence removing the whole sequence.

(REVISION: The problem was very short adapter sequences in the input file. fastq-mcf can't be used that way. It uses subsampling to get the adapter thresholds, and requires a minimum of 15 character matches during the subsampling pass)

**fabrice** · 06-20-2011, 06:52 AM

In my opions, most of adaptor is at the start of reads. Is it right?

In my RNA-seq data, I just see a pattern at the start, not the end.

In fact, I cannot understand the means of these options.

-s N.N Log scale for clip pct to threshold (2.5)
-t N % occurance threshold before clipping (0.25)
-m N Minimum clip length, overrides scaled auto (1)
-p N Maximum adapter difference percentage (20) This is for mismatch?
-k N sKew percentage causing trimming (2)

**earonesty** · 06-20-2011, 06:58 AM

They do deserve some elaboration in the help:

# -s N.N Log scale for clip pct to threshold (2.5)

Scaling factor which causes more frequent occurrences to clip heavily. The negative log base "scale" of the ratio of adapters found becomes the minimum match-length for the adapter.

# -t N % occurance threshold before clipping (0.25)

Minimum number of times an adapter needs to be found before clipping is considered necessary

# -m N Minimum clip length, overrides scaled auto (1)

Replaces scaling algorithm (above) with hardcoded limit

# -p N Maximum adapter difference percentage (20) This is for mismatch?

Yes this is the max # mismatch

# -k N sKew percentage causing trimming (2)

If one of the bases is only 2% of the reads, then that cycle is considered "skewed".... if it's at the edge of a read, it's trimmmed off.

**fabrice** · 06-20-2011, 09:10 AM

earonesty,

I have tried your new release. The output is fellow.

Threshold used: 1 out of 4. What is the mean here?

Why both Adapter date_day and Adapter adaptor1 counted 4 at the 'start' of 'test.fastq'?

I also noticed that the "N" at the begin/end of reads that are not trimed.

_____________________output___________________________

fastq-mcf test.fa test.fastq -q 0 -l 1 -o test.trim
Scale used: 2.2
Threshold used: 1 out of 4
Adapter date_day (CCTTAAGG): counted 4 at the 'start' of 'test.fastq', clip set to 1
Adapter adaptor1 (AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT): counted 4 at the 'start' of 'test.fastq', clip set to 1
Files: 1
Total reads: 4
Too short after clip: 0
Clipped 'start' reads: Count: 3, Mean: 8.00, Sd: 0.00

**earonesty** · 06-20-2011, 09:25 AM

Well, I've never tried this with short adapters like that. Looks like the thing I did to fix your case made other things worse. I need to put the code back the way it was. The original behavior was more correct. The problem is the use of a very short adapter sequence, which is shorter than the scan length the sytem uses (15 bases). Wtih something that short, it will find them everywhere,and it isn't a realistic test. I panicked and gave you a version that was a lot worse.

Solution: Use the original version ( i put it back... v 148), and use a sequence longer than 15 BP for your test adapter.

( If you really need it to work with a very short string like that... I can get it to work... just the code that does the subsampling needs to change... not the rest. )

**fabrice** · 06-21-2011, 02:58 AM

earonesty,

Thanks. I will try it.

**earonesty** · 06-21-2011, 06:20 AM

Also, right now the default algorithm will "not clip" if no adapters are found and no skewing is detected in the subsample (unless you pass -f). I'm about to make a change that will also decide clipping is necessary if there are "significant" "low quality region" at either end of the reads. The definition of significance will be based on the -q parameter. If more than 10% of the reads would be trimmed by that parameter, clipping will proceed.

**fabrice** · 06-21-2011, 06:35 AM

Here I also confused with the parameters -f. If no adapters are found and no skewing is detected in the subsample, set -f what will happen? Will it do the trim?

Why if more than 10% of the reads would be trimmed by that parameter, clipping will proceed? Does it mean fasta-mcf do the trim only when 10% of total reads need to trim?

Do I have some misunderstand?

In fact, here just my think.

1, When we found adaptor at either end of read (for example, 10% mismatch), we do the trim.

2, From the 3' (right part) of read, if the nucleotide's quality is less than the threshold (for example, -q 20), then do the trim.

Because the adaptor contamination and low quality nucleotide will let the mapping not correctly.

Originally posted by earonesty View Post

Also, right now the default algorithm will "not clip" if no adapters are found and no skewing is detected in the subsample (unless you pass -f). I'm about to make a change that will also decide clipping is necessary if there are "significant" "low quality region" at either end of the reads. The definition of significance will be based on the -q parameter. If more than 10% of the reads would be trimmed by that parameter, clipping will proceed.

**earonesty** · 06-21-2011, 07:01 AM

Right.... maybe it should always run ... and -f should be a non-option. I've thought about that. But in my experience, it's better not to clip at all if the percentage clipped is very low. Better to just let those reads get discarded by the aligner... or marked as low-quality mappings and get washed out in the statistics later.

Good aligners take into account quality scores when doing alignment, and variant callers do as well. We generally see higher repeatability on unclipped files... but only when the clipping percentage is low. In the 5-10% range. If 95% of the reads would be left alone anyway.. better not to run at all.

I'll run some stats, we have about 10,000 samples to look at right now, so i can come up with a decent default threshold. Again... -f will force it to run always, so you can just always run it that way and get what you want.

UPDATE: 5% is working well, I'm using it in production for new batches. If you want it to "always" try to clip, regardless of sampling, use -f.
ALSO: I made it so short adapters work as "beginnings of sequence" adapters (they always worked for end of seq tests)

**upendra_35** · 10-07-2011, 02:08 PM

Originally posted by Mark View Post

Hi All

I recently downloaded the FASTX toolkit and tried to use it for trimming fastq reads of adapter sequences. This did not work, the tool simply discarded any reads containing adapter sequences though this is not seemingly its function according to the documentation. I wrote to the help contact for the tool but recieved no response (see below for details). Has anyone used this tool for this purpose successfully?

Thanks for your help

Mark

#############################################
Hello

I recently downloaded the FASTX toolkit (fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2) and attempted to use the fastx_clipper tool. I created a test fastq file (3 of the four sequences contain the default adapter CCTTAAGG):

@test1
CCTTAAGGAAAAAAAAAAGGGGGGGGGG
+test1
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@test2
CCTTAAGGAAAAAAAAAGGGGGGGGGGG
+test2
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@test3
AGAGAGAGAGAGAGAGAGAGAGAGAGAG
+test3
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@test4
CCTTAAGGTTGACGTGATCGACACCTGG
+test4
[[[[[[[[[[[[[[[[[[[[[[[[[[[[

And then executed the command (as shown on FASTX toolkit website)

-bash-3.2$ fastx_clipper -v -i test.fastq -a CCTTAAGG
@test3
AGAGAGAGAGAGAGAGAGAGAGAGAGAG
+test3
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
Clipping Adapter: CCTTAAGG
Min. Length: 5
Input: 4 reads.
Output: 1 reads.
discarded 0 too-short reads.
discarded 3 adapter-only reads.
discarded 0 N reads.

As you can see, the three reads that contain the adapter are discarded as “adapter-only reads” which (in my way of looking at things) they are not nor are they too short (default <=5) after any trimming. What is going on here? Does this tool actually trim reads or only discard them if they are found. If the former would you please tell me what I am doing incorrectly? Also if the former, is it possible to supply the tool with multiple adapters to trim?

Thanks for your help

Mark

I don't know if you have already sorted out this problem.But i figured out that fastx_clipper throws away all those reads that starts with adapters (adapter only reads) and there is no way you can tell fastx_clipper not to do that. However if the adapter sequence is found anywhere with in the read then it will clip the read starting with the adapter and keeps the rest of the read. Then you can either ask it to throw the clip read away (with -C option) or keep the clipped read (with -c option).

Hope this helps a bit......

Upendra

**nasobema** · 10-19-2011, 05:41 AM

Hi earonesty!

Thanks a lot for this great tool. I found it just today and the first test with fastq-mcf already left the impression that it is both fast and includes a lot of utilities for clippling.

However, I didn't get the meaning of all the options and the output.
In particular I would like to know what is meant by "Filtered x reads on purity flag".
Here's a sample report of a test case where I lose some 15 % due to this filter (see last line):

Code:

Scale used: 2.2
Filtering Illumina reads on purity field
Phred: 33
Warning: Too much skewing found (110), disabling skew clipping
Threshold used: 251 out of 100000
Adapter RNA-seq_PCR-primer_1_reverse (AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT): counted 21114 at the 'end' of '../rawdata/ado_pool_PE02_R2.fastq', clip set to 1
Adapter RNA-seq_PCR-primer_2_reverse (AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG): counted 21340 at the 'end' of '../rawdata/ado_pool_PE02_R1.fastq', clip set to 1
Adapter RNA-seq_PCR-primer_2_reverse (AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG): counted 449 at the 'end' of '../rawdata/ado_pool_PE02_R2.fastq', clip set to 6
Files: 2
Total reads: 42361724
Too short after clip: 420423
Clipped 'end' reads (../rawdata/ado_pool_PE02_R1.fastq): Count 20076452, Mean: 20.70, Sd: 24.83
Trimmed 16073640 reads (../rawdata/ado_pool_PE02_R1.fastq) by an average of 22.84 bases on quality < 10
Clipped 'end' reads (../rawdata/ado_pool_PE02_R2.fastq): Count 18776062, Mean: 22.10, Sd: 25.18
Trimmed 15738855 reads (../rawdata/ado_pool_PE02_R2.fastq) by an average of 21.90 bases on quality < 10
Filtered 6360682 reads on purity flag

I ran fastq-mcf with option -k 100 (this disables skew clipping, does it?).

What is purity? Were the reads bad (and in what sense)?
Is there a way to switch this off?

**earonesty** · 12-10-2011, 11:30 PM

Purity is illumina's purity filter. you can turn this off with -U ... bu you REALLY SHOULD NOT turn it off. Read up on illumina purity filtering... it is the result of confused signal from adjacent clusters.

**earonesty** · 12-10-2011, 11:31 PM

-k 0 disables skew detection. Normally there's no reason to disable it... it can help find problems in data.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News