Unconfigured Ad

**maasha** · 11-25-2010, 06:17 AM

I can't help you with the FASTX toolkit, but here is how to do it with Biopieces (www.biopieces.org).

Code:

read_fastq -i test.fastq | remove_adaptor -a CCTTAAGG -r before
SCORES: HHHHHHHHHHHHHHHHHHHH
SEQ: AAAAAAAAAAGGGGGGGGGG
ADAPTOR_POS: 0
SEQ_LEN: 20
SEQ_NAME: test1
---
SCORES: HHHHHHHHHHHHHHHHHHHH
SEQ: AAAAAAAAAGGGGGGGGGGG
ADAPTOR_POS: 0
SEQ_LEN: 20
SEQ_NAME: test2
---
SCORES: HHHHHHHHHHHHHHHHHHHHHHHHHHHH
SEQ: AGAGAGAGAGAGAGAGAGAGAGAGAGAG
ADAPTOR_POS: -1
SEQ_LEN: 28
SEQ_NAME: test3
---
SCORES: [[[[[[[[[[[[[[[[[[[[
SEQ: TTGACGTGATCGACACCTGG
ADAPTOR_POS: 0
SEQ_LEN: 20
SEQ_NAME: test4
---

Use grab to get the entries that were trimmed and finally use write_fastq to create a new file:

Code:

read_fastq -i test.fastq | remove_adaptor -a CCTTAAGG -r before | grab -e 'ADAPTOR_POS>=0' | write_fastq -o test_trimmed.fastq -x

Cheers,

Martin

**maasha** · 11-25-2010, 06:23 AM

Oh, and if you want to trim multiple adaptors either process the fastq file several times or just use remove_adaptor multiple times:

Code:

read_fastq -i test.fastq |
remove_adaptor -a CCTTAAGG -r before |
remove_adaptor -a GACACCTGG -r after

SCORES: HHHHHHHHHHHHHHHHHHHH
SEQ: AAAAAAAAAAGGGGGGGGGG
SEQ_NAME: test1
SEQ_LEN: 20
ADAPTOR_POS: -1
---
SCORES: HHHHHHHHHHHHHHHHHHHH
SEQ: AAAAAAAAAGGGGGGGGGGG
SEQ_NAME: test2
SEQ_LEN: 20
ADAPTOR_POS: -1
---
SCORES: HHHHHHHHHHHHHHHHHHHHHHHHHHHH
SEQ: AGAGAGAGAGAGAGAGAGAGAGAGAGAG
SEQ_NAME: test3
SEQ_LEN: 28
ADAPTOR_POS: -1
---
SCORES: [[[[[[[[[[[
SEQ: TTGACGTGATC
SEQ_NAME: test4
SEQ_LEN: 11
ADAPTOR_POS: 11
---

M

**gghl** · 11-25-2010, 07:22 AM

Hi Mark,

Based on my understanding, the fastx_clipper first finds the adaptor seqeunce you give and then trims off adaptor and nucleotide sequenes after the adaptor. I think fastx_clipper is designed for removeing adaptor after the insert seqeunces. And this is why in your test fastq file, reads of test 1, 2 and 4 were considered as adaptor-only reads.

I think if what you want is to remove 5' end adaptor in front of the insert seuqences, the fastx_trimmer might be able to help.

Best wishes,
gghl

**earonesty** · 04-15-2011, 12:27 PM

We rewrote a lot of fastx's toolkit stuff, and posted it here: https://code.google.com/p/ea-utils/. It attempts to do things like adapter removal, trimming, etc... without as much configuration by detecting presence of adapters located in a common file.

**Mark** · 04-16-2011, 07:40 AM

Thanks I'll give it a try

I noticed at your site a tool for stitching pe reads called fastq-join. It doesn't appear to be available yet. When will it be?

**earonesty** · 04-16-2011, 11:01 AM

Originally posted by Mark View Post

Thanks I'll give it a try

I noticed at your site a tool for stitching pe reads called fastq-join. It doesn't appear to be available yet. When will it be?

You can just grab the code... it's POSIX C++ and should compile easily:

Google Code Archive - Long-term storage for Google Code Project Hosting.

http://code.google.com/p/ea-utils/source/browse/trunk/clipper/fastq-join.c

g++ -O3 fastq-join.c -o fastq-join

**earonesty** · 04-21-2011, 12:20 PM

Note: I made a change recently to properly use the "better quality base" in the overlapping region... there was a bug in it that someone pointed out. If you're using it, you'll want the newer version.

**angelawu** · 06-13-2011, 03:25 PM

Question about fastq-mcf

Hi,

I encountered an issue when using fastq-mcf on my GA2 generated 1x36 reads, and wondering if you could shed some light.

So I made my fasta file with all the TruSeq adapter sequences in there, and ran fastq-mcf using that file, -P Phred scale set to 33 for my files are in Sanger fastq format. All other parameters were left as default.

After trimming was completed, the outfile reports removing about 10 million reads out of 24 million.

I run the trimmed file through FastQC, and under the "over-represented sequences" tab, I see that partial adapter sequences (e.g. starting from bp #2) are still over-represented in my file, which suggests that they were not trimmed.

My question is, does fastq-mcf remove partial matches to adapter sequences provided, as well as full? If so, am I doing something wrong with the way I am using the tool?

I am pretty new to bioinformatics, so sorry if this is a stupid question...

Thank you!

**earonesty** · 06-14-2011, 06:41 AM

1. It does remove partial matches. It searches only from one "end" of the file. The default settings are very conservative, so if it's removing 10 million reads, that's an enormous number - you may want to change the settings to be more aggressive for that data.

2. Can you post the summary output... it should say why sequences were removed/clipped, and why etc.

3. Until very recently, GAII's output base-64 by default, not 33, so you may want to double-check that.

EXAMPLE OUTPUT:

Code:

Scale used: 2.2
Threshold used: 101 out of 40000
Adapter ILMN RT_primer_rc (TCGTATGCCGTCTTCTGCTTG): counted 193 at the 'end' of 'example.fastq', clip set to 6
Adapter FLUIDIGM Index-SP (AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG): counted 1063 at the 'end' of 'example.fastq', clip set to 4
Files: 1
Total reads: 250000
Too short after clip: 53
Clipped 'end' reads: Count: 16612, Mean: 18.12, Sd: 17.44
Trimmed 24474 reads by an average of 10.81 bases on quality < 10

**angelawu** · 06-14-2011, 09:47 AM

Originally posted by earonesty View Post

1. It does remove partial matches. It searches only from one "end" of the file. The default settings are very conservative, so if it's removing 10 million reads, that's an enormous number - you may want to change the settings to be more aggressive for that data.

2. Can you post the summary output... it should say why sequences were removed/clipped, and why etc.

3. Until very recently, GAII's output base-64 by default, not 33, so you may want to double-check that.

EXAMPLE OUTPUT:

Code:

Scale used: 2.2
Threshold used: 101 out of 40000
Adapter ILMN RT_primer_rc (TCGTATGCCGTCTTCTGCTTG): counted 193 at the 'end' of 'example.fastq', clip set to 6
Adapter FLUIDIGM Index-SP (AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG): counted 1063 at the 'end' of 'example.fastq', clip set to 4
Files: 1
Total reads: 250000
Too short after clip: 53
Clipped 'end' reads: Count: 16612, Mean: 18.12, Sd: 17.44
Trimmed 24474 reads by an average of 10.81 bases on quality < 10

Hi earonesty,

Thanks for getting back to me!
Here is an example of the output I received:

Scale used: 2.2
Threshold used: 101 out of 40000
Adapter TruSeq-Adapter1 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter2 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter3 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter4 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG): counted 10159 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter5 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter6 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter7 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter8 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter9 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter10 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter11 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter12 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Files: 1
Total reads: 21964185
Too short after clip: 35672
Clipped 'start' reads: Count: 13283064, Mean: 1.58, Sd: 1.15
Trimmed 394023 reads by an average of 7.15 bases on quality < 10

So far, what I understand is that my samples probably have a lot of adapter-pair ligations in there without any genomic insert. This leads to the entirety of my 36bp read being a portion of the index/adapter. And since the adapter is much longer than 36bp, i think those reads are not being removed. e.g.:

My sample DNA: ADAPTER1adapter2
read: dapte

I say this because when I put the cleaned up reads through FastQC again, I see that all the "Over-represented Sequences" that are TruSeq Indexes are still present in my file.

I've managed to resolve this particular issue by basically copying in the sequence given by FastQC as the overrepresented sequence, and using those in a fasta file as the adapter sequences. It works well for my case, so maybe there is nothing wrong with the toolkit, and it's just my particular sample?

Yes, I understand that Illumina reads use Phred64, but I always convert directly to Sanger Phred33 as soon as I get my files, which is why I put the -P 33 option in there.

Thanks,

Angela

**earonesty** · 06-14-2011, 10:17 AM

Originally posted by angelawu View Post

Hi earonesty,

Thanks for getting back to me!
Here is an example of the output I received:

Scale used: 2.2
Threshold used: 101 out of 40000
Adapter TruSeq-Adapter1 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter2 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter3 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter4 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG): counted 10159 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter5 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter6 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter7 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter8 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter9 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter10 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter11 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter12 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Files: 1
Total reads: 21964185
Too short after clip: 35672
Clipped 'start' reads: Count: 13283064, Mean: 1.58, Sd: 1.15
Trimmed 394023 reads by an average of 7.15 bases on quality < 10

So far, what I understand is that my samples probably have a lot of adapter-pair ligations in there without any genomic insert. This leads to the entirety of my 36bp read being a portion of the index/adapter. And since the adapter is much longer than 36bp, i think those reads are not being removed. e.g.:

My sample DNA: ADAPTER1adapter2
read: dapte

I say this because when I put the cleaned up reads through FastQC again, I see that all the "Over-represented Sequences" that are TruSeq Indexes are still present in my file.

I've managed to resolve this particular issue by basically copying in the sequence given by FastQC as the overrepresented sequence, and using those in a fasta file as the adapter sequences. It works well for my case, so maybe there is nothing wrong with the toolkit, and it's just my particular sample?

Yes, I understand that Illumina reads use Phred64, but I always convert directly to Sanger Phred33 as soon as I get my files, which is why I put the -P 33 option in there.

Thanks,

Angela

- Your adapter file seems to have the same sequence over and over? I'm not sure how that will affect things. TruSeq-Adapter2 is the same as TruSeq-Adapter1.... etc. Try just using 1 per unique sequence. This probably won't help.

- Out of 40000 reads, 10000 had an exact match for 15 base pairs of adapter sequence. That's a lot. So when it says "clip set to 1" it will clip any matching subsequence.

- It only discarded 35672 reads and only a few bases. That's surprising to me considering the number of sequences it found in the subsample with exact matches. I would expect a higher rate of discards, and a higher number of mean bases clipped.

- This is a situation where I wish I could see about 100K reads from your sample and just run it a few times to see what happened why it did that. It should be walking the adapter along the sequence looking for the best match. It seems to be stopping early on....or perhaps the sequences that match the adapter are somewhere else (at the end...?) and it guessed wrong (you can force -e)

- There's also an undocumented "-d" option that spits out lots of debug info that I find useful.

**angelawu** · 06-14-2011, 10:22 AM

Oh, the adapter sequences are not identical. If you look closely at the middle portion of the sequences, there is a barcode in the middle that is different for each sequence. But I also do not think this would make any difference...

In any case, I think I have a solution to my particular application, so I don't know how much time I want to spend debugging this, but thanks for reminding me of the -d option, which will surely come in handy later on as well. The -e option may be the trick, since the barcode only begins in the middle of the adapter sequence?

Thanks once again!

**earonesty** · 06-15-2011, 05:20 AM

I think the barcode in the middle was making it odd. Also, I think your solution is great.

**fabrice** · 06-19-2011, 02:19 PM

I have tryed your ea-utils. But it seems as the same FASTXtoolkit adapter trimming. ea-utils also remove the whole read which contained adapter.

Originally posted by earonesty View Post

We rewrote a lot of fastx's toolkit stuff, and posted it here: https://code.google.com/p/ea-utils/. It attempts to do things like adapter removal, trimming, etc... without as much configuration by detecting presence of adapters located in a common file.

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, Yesterday, 11:10 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 42 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 103 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 125 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

FASTXtoolkit adapter trimming

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News