trimming barcoded sequences

dawe

Senior Member

Join Date: Apr 2009

Posts: 258
- Share
- Tweet
#1

trimming barcoded sequences

11-02-2009, 06:52 AM

Hi all,
I have an experiment with sequences barcoded by the user. Actually I'm sure that 37 bp from 5' ends don't align to the reference genome (because there are adaptor sequences in addition to barcodes). I'm reading at 75 bp with Illumina GAII.
I've seen that bowtie allows to trim 5' ends but the trimmed sequence is not reported in results (at least in SAM format).
I'm also using bwa and apparently there's no option for trimming sequences on the fly.
I would like to align the last ~40 bp and read the whole reads in the results.
I thought I can produce fastq file with trimmed sequences, align those and then get whole reads by matching read name.
Besides text handling tools, can anybody suggest a valid approach to handle this kind of problem?
thanks
d
Tags: None
maubp

Peter (Biopython etc)

Join Date: Jul 2009

Posts: 1543
- Share
- Tweet
#2

11-02-2009, 08:01 AM

I'd use a scripting language with good FASTQ support (e.g. BioPerl 1.6.1+ or Biopython 1.51+) to produce a filtered and trimmed set of FASTQ files using the known barcodes and adaptors. You may be fine looking for perfect matches to the barcodes, or perhaps allow one (or more) mismatches.

Then use the trimmed and filtered FASTQ files with the read mapping software of your choice.
Comment
hoytpr

Member

Join Date: Dec 2009

Posts: 62
- Share
- Tweet
#3

12-08-2009, 10:08 AM

I'm going to risk asking a really dumb question with my first post, but I've been reading around and I haven't seen this addressed completely.

My clients want clean short-read sequence data, or miRNA files as outputs. THEY want to do the blasts. The adapters or primers can be cleaned with some scripts we write for parsing, but these software packages you all are talking about (and I have tried most of them now) seem to use very specific primers of defined length. While I can sometimes ask for mismatches, the problem seems different than what is being discussed. Maybe I just don't understand the format of the regular expression well-enough, but I am trying.

ISSUE: The primer or adapter sequences on the read-ends are of different length (they can be ~6nt to a full 16 nt in this example. So I may know the primer is <NNNNNNNNCAGTGC> but a parse for the full sequence doesn't identify reads where the start of the read is CAGTGC... I can't just say "Aww, just clip off the first five bases". Please correct me if I'm wrong. Our process is painfully manual and based on the final parameter of "How many bases the client will accept as a minimum". For example. If we have 40nt reads, and the client won't want anything less than a 20nt read, we have ~20nt to work with.
SIMPLIFIED PROCESS:
1. Reduce the file to unique elements first
2. If barcodes are present we can search for them and group the sequences
3. TRIMMING:
Assuming a primer/tag/barcode of AGCTCGTAGTACTACG we end up doing sequential searches of the (5')-end starting with:
Round 1: TACTACG
- Eliminate those 5'-elements
Round 2: GTACTACG
-Eliminate those 5'-elements
Round 3: AGTACTACG
-Eliminate those 5'-elements
etc. etc until
Final Round: AGCTCGTAGTACTACG (full primer)
-Eliminate those full primer sequence 5'-elements.

Then we do ~the same for the 3'-end.

Finally, we take the "Trimmed" sequences and can align and blast them to the appropriate database.

While this seems really primitive to me... it also works really well. Isn't there a tool out there that can help us with this? Something that can take two or more primer/adapter/barcodes and sequentially trim raw sequence reads, then create a fasta output of what's left?

Thanks
Pete
First post... be kind.

Last edited by hoytpr; 12-08-2009, 10:12 AM. Reason: spelling
Comment
maubp

Peter (Biopython etc)

Join Date: Jul 2009

Posts: 1543
- Share
- Tweet
#4

12-08-2009, 10:18 AM

Originally posted by hoytpr View Post

I'm going to risk asking a really dumb question with my first post, but I've been reading around and I haven't seen this addressed completely.

My clients want clean short-read sequence data, or miRNA files as outputs. THEY want to do the blasts. The adapters or primers can be cleaned with some scripts we write for parsing, but these software packages you all are talking about (and I have tried most of them now) seem to use very specific primers of defined length. While I can sometimes ask for mismatches, the problem seems different than what is being discussed. Maybe I just don't understand the format of the regular expression well-enough, but I am trying.

Using BioPerl or Biopython you could use any length primers you like - but it won't be an "off the shelf" solution (unless someone else has a suitable script they can share). You'll probably have to write some code, and using regular expressions does seem sensible here. You could also look at doing pairwise alignment if you want to consider gaps (e.g. in 454 where the number of bases in a run is wrong).

ISSUE: The primer or adapter sequences on the read-ends are of different length (they can be ~6nt to a full 16 nt in this example. So I may know the primer is <NNNNNNNNCAGTGC> but a parse for the full sequence doesn't identify reads where the start of the read is CAGTGC... I can't just say "Aww, just clip off the first five bases".

In that case, len("NNNNNNNNCAGTGC") = 14 so you'd need to trim 14 letters. As a regular expression, the simplest trick would be to replace "N" with "." (which means any character), thus "........CAGTGC" should do the trick.

You could also look at doing this kind of thing with the EMBOSS command line tools.

P.S. If you are using Roche 454, their tools make it easy to filter and group using the Roche MID barcodes.
Comment
hoytpr

Member

Join Date: Dec 2009

Posts: 62
- Share
- Tweet
#5

12-09-2009, 04:40 PM

Thanks for the suggestions. BioPerl is giving me fits unfortunately. Either Cygwin is interfering with the Perl 5.10.1 I just installed, or I messed up the BioPerl Install using the PPM shell. Or I'm just a nincompoop. The install pages are outdated enough to confuse a relative newbie like myself. My file structure doesn't seem correct. After installing there must be 3 or 4 "bin" folders now ( /Perl/bin, /Perl/site/bin, /Perl/html/bin...). I never got to EMBOSS as I couldn't get the mysql DBD module to work. (Does Perl not have a "shell" you enter when you start perl.exe?) Any links to help me set up BioPerl and EMBOSS would be appreciated... the IRC BioPerl Wikipage is not helping.
Comment
maubp

Peter (Biopython etc)

Join Date: Jul 2009

Posts: 1543
- Share
- Tweet
#6

12-10-2009, 02:24 AM

Hi hoytpr,

From your mention of Cygwin, it sounds like you are using Windows. Cygwin is a really cool package for running Unix tools on Windows. You don't need it for Perl or EMBOSS.

You can use Perl (or Python) via Cygwin but it can often be simpler to stick to the official Windows packages for Windows itself. In any case, I'm not very knowledgeable about the intricacies of BioPerl installation, and have never used it on Windows personally. I would suggest you sign up to the BioPerl mailing list for expert help:

Bioperl-l Info Page

http://lists.open-bio.org/mailman/listinfo/bioperl-l

Again, with EMBOSS you can try and install it via Cygwin, but the Windows package should be easier:
ftp://emboss.open-bio.org/pub/EMBOSS/windows/

Peter
Comment
ewilbanks

Member

Join Date: Mar 2009

Posts: 82
- Share
- Tweet
#7

01-05-2010, 02:29 PM

Hey,

Try http://hannonlab.cshl.edu/fastx_toolkit/index.html . Haven't used it yet, but one of the scripts here is supposed to parse barcoded fasta/q files so you can sort, trim and then align.

Lizzy
Comment
xuer

Member

Join Date: Sep 2008

Posts: 17
- Share
- Tweet
#8

03-26-2010, 06:14 AM

is there anybody used this tool Fastx tiilkit? is not suggested?
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
- Channel: Articles
Yesterday, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

trimming barcoded sequences

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News