Biopiece: digest_seq
Summary
Split sequences in the stream at a given restriction enzyme's cleavage sites.
Description
digest_seq split sequences in the stream at a given restriction
enzyme's cleavage sites. For each digestion product a new record is created and
output to the stream. digest_seq requires two arguments:
1. The restriction enzyme recognition pattern
2. Cut position relative to the above match
E.g. the common restriction enzyme BamHI recognizes the pattern GGATCC and the
cut position for this enzyme is 1 indicating that the cleavage site is after
the first G:
cut_pos 1 23456
pattern G|GATCC
^
Patterns can contain IUPAC ambiguity codes.
Records with digestion products have a REC_TYPE: DIGEST key/value pair added.
Furthermore, the sequence coordinates are appended to the sequence name in
brackets (0-based) and as S_BEG and S_END keys.
To obtain the reverse-complement products use reverse_seq and
complement_seq before digest_seq.
Usage
... | digest_seq [options]
Options
[-? | --help] # Print full usage description.
[-p <string> | --pattern=<string>] # Restriction enzyme recognition pattern.
[-c <int> | --cut_pos=<int>] # Cut position relative to pattern match.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Examples
Consider the following sequence in the file test.fna:
>test
cgatcgatcGGATCCgagagggtgtgtagtgGAATTCcgctgc
We can digest this sequence with BamHI like this:
read_fasta -i test.fna | digest_seq -p GGATCC -c 1
SEQ_NAME: test[0-9]
SEQ: cgatcgatcG
SEQ_LEN: 10
S_BEG: 0
S_END: 9
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42]
SEQ: GATCCgagagggtgtgtagtgGAATTCcgctgc
SEQ_LEN: 33
S_BEG: 10
S_END: 42
REC_TYPE: DIGEST
---
To obtain the - strand digestion products we re-read the file and
reverse-complment the sequence before digestion like this:
read_fasta -i test.fna | reverse_seq | complement_seq | digest_seq -p GGATCC -c 1
SEQ_NAME: test[0-28]
SEQ: gcagcgGAATTCcactacacaccctctcG
SEQ_LEN: 29
S_BEG: 0
S_END: 28
REC_TYPE: DIGEST
---
SEQ_NAME: test[29-42]
SEQ: GATCCgatcgatcg
SEQ_LEN: 14
S_BEG: 29
S_END: 42
REC_TYPE: DIGEST
---
It is also possible to do restriction enzyme digestion with multipe enzymes
simply by piping the result through digest_seq multiple times. Here we
first digest with BamHI (pattern: GGATCC, cut_pos: 1) and then with EcoRI
(pattern: GAATTC, cut_pos: 1):
read_fasta -i test.fna | digest_seq -p GGATCC -c 1 | digest_seq -p GAATTC -c 1
SEQ_NAME: test[0-9][0-9]
SEQ: cgatcgatcG
SEQ_LEN: 10
S_BEG: 0
S_END: 9
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42][0-21]
SEQ: GATCCgagagggtgtgtagtgG
SEQ_LEN: 22
S_BEG: 0
S_END: 21
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42][22-32]
SEQ: AATTCcgctgc
SEQ_LEN: 11
S_BEG: 22
S_END: 32
REC_TYPE: DIGEST
---
See also
rescan_seq
read_fasta
reverse_seq
complement_seq
Author
Martin Asser Hansen - Copyright (C) - All rights reserved.
[email protected]
September 2010
License
GNU General Public License version 2
Help
digest_seq is part of the Biopieces framework.
Summary
Split sequences in the stream at a given restriction enzyme's cleavage sites.
Description
digest_seq split sequences in the stream at a given restriction
enzyme's cleavage sites. For each digestion product a new record is created and
output to the stream. digest_seq requires two arguments:
1. The restriction enzyme recognition pattern
2. Cut position relative to the above match
E.g. the common restriction enzyme BamHI recognizes the pattern GGATCC and the
cut position for this enzyme is 1 indicating that the cleavage site is after
the first G:
cut_pos 1 23456
pattern G|GATCC
^
Patterns can contain IUPAC ambiguity codes.
Records with digestion products have a REC_TYPE: DIGEST key/value pair added.
Furthermore, the sequence coordinates are appended to the sequence name in
brackets (0-based) and as S_BEG and S_END keys.
To obtain the reverse-complement products use reverse_seq and
complement_seq before digest_seq.
Usage
... | digest_seq [options]
Options
[-? | --help] # Print full usage description.
[-p <string> | --pattern=<string>] # Restriction enzyme recognition pattern.
[-c <int> | --cut_pos=<int>] # Cut position relative to pattern match.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Examples
Consider the following sequence in the file test.fna:
>test
cgatcgatcGGATCCgagagggtgtgtagtgGAATTCcgctgc
We can digest this sequence with BamHI like this:
read_fasta -i test.fna | digest_seq -p GGATCC -c 1
SEQ_NAME: test[0-9]
SEQ: cgatcgatcG
SEQ_LEN: 10
S_BEG: 0
S_END: 9
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42]
SEQ: GATCCgagagggtgtgtagtgGAATTCcgctgc
SEQ_LEN: 33
S_BEG: 10
S_END: 42
REC_TYPE: DIGEST
---
To obtain the - strand digestion products we re-read the file and
reverse-complment the sequence before digestion like this:
read_fasta -i test.fna | reverse_seq | complement_seq | digest_seq -p GGATCC -c 1
SEQ_NAME: test[0-28]
SEQ: gcagcgGAATTCcactacacaccctctcG
SEQ_LEN: 29
S_BEG: 0
S_END: 28
REC_TYPE: DIGEST
---
SEQ_NAME: test[29-42]
SEQ: GATCCgatcgatcg
SEQ_LEN: 14
S_BEG: 29
S_END: 42
REC_TYPE: DIGEST
---
It is also possible to do restriction enzyme digestion with multipe enzymes
simply by piping the result through digest_seq multiple times. Here we
first digest with BamHI (pattern: GGATCC, cut_pos: 1) and then with EcoRI
(pattern: GAATTC, cut_pos: 1):
read_fasta -i test.fna | digest_seq -p GGATCC -c 1 | digest_seq -p GAATTC -c 1
SEQ_NAME: test[0-9][0-9]
SEQ: cgatcgatcG
SEQ_LEN: 10
S_BEG: 0
S_END: 9
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42][0-21]
SEQ: GATCCgagagggtgtgtagtgG
SEQ_LEN: 22
S_BEG: 0
S_END: 21
REC_TYPE: DIGEST
---
SEQ_NAME: test[10-42][22-32]
SEQ: AATTCcgctgc
SEQ_LEN: 11
S_BEG: 22
S_END: 32
REC_TYPE: DIGEST
---
See also
rescan_seq
read_fasta
reverse_seq
complement_seq
Author
Martin Asser Hansen - Copyright (C) - All rights reserved.
[email protected]
September 2010
License
GNU General Public License version 2
Help
digest_seq is part of the Biopieces framework.
Leave a comment: