Seqanswers Leaderboard Ad

**Jiafen** · 10-25-2012, 11:49 AM

try 'sed '/>chrUn/d' your_file >out_file
on terminal

**vivek_** · 10-25-2012, 12:11 PM

^ Does that remove the sequence?

This thread might be useful:

FastaHack - FASTA file manipulation and subsequence extraction utilities - SEQanswers

http://seqanswers.com/forums/showthread.php?t=5380

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

**Jiafen** · 10-25-2012, 12:14 PM

Sorry, I didn't see you need remove the seqs as well, the above does not work.
But I can figure out a stupid way, you could grep -n >chrUn you_file to find the rows of >chrUn

then use syntax

sed 'n1,n2d' filename.txt

to remove the lines between n1 and n2.

**Richard Finney** · 10-25-2012, 01:32 PM

cat whatever.fasta | awk '{if (substr($0,1,6) == ">chrUn") censor=1; else if (substr($0,1,1) == ">") censor=0; if (censor==0) print $0}' > fixed.fasta

**JQL** · 10-25-2012, 01:35 PM

Hi,

thanks for the idea. This will work if all the headers are sorted by their IDs. Otherwise, you end up removing reads inbetweens. e.g. if the file has reads with ID >chr3 in between >chrUn, we will get rid of chr3 too.

I think my fasta file is sorted, but I can't take the risk of what if.

Originally posted by Jiafen View Post

Sorry, I didn't see you need remove the seqs as well, the above does not work.
But I can figure out a stupid way, you could grep -n >chrUn you_file to find the rows of >chrUn

then use syntax

sed 'n1,n2d' filename.txt

to remove the lines between n1 and n2.

**maubp** · 10-25-2012, 01:49 PM

It would be wise to learn a scripting/programming language, e.g. Perl, Python or Ruby. Here's how I would solve this using Biopython (untested):

Code:

from Bio import SeqIO
wanted = (rec for rec in SeqIO.parse("input.fq", "fastq") \
                        if not rec.id.startswith("chrUn"))
count = SeqIO.write(wanted, "filtered.fq", "fastq")
print "Saved %i records" % count

**krobison** · 10-25-2012, 06:45 PM

My typical way of solving this with perl

Code:

#!/usr/bin/perl
use strict;
my $off=0;
while ($_ = <>)
{
   if (/^>/)
  {
    $off=/^>chrUn/;
  }
  print $_ if (! $off);
}

It is easy to modify this to read a list of ids to include/exclude into a hash & instead key on those ids.

In a pinch, you can do this as a one-liner

Code:

perl -p -e '$off=/^>chrUn/ if (/^>/); print $_ if (! $off) ' myfile.fa

**JQL** · 10-26-2012, 05:26 AM

Hi Richard,

Thanks. It works nicely. Very elegant use of censor.

How does the seqeunces under ID >chrUn got removed? When it finds ID '>chrUn...', it 'censors' it as 1, otherwise, 0. It first appears that only headers are printed.

Originally posted by Richard Finney View Post

cat whatever.fasta | awk '{if (substr($0,1,6) == ">chrUn") censor=1; else if (substr($0,1,1) == ">") censor=0; if (censor==0) print $0}' > fixed.fasta

**JQL** · 10-26-2012, 05:46 AM

Ok, I got it. Im a bit slow this morning

Originally posted by JQL View Post

Hi Richard,

Thanks. It works nicely. Very elegant use of censor.

How does the seqeunces under ID >chrUn got removed? When it finds ID '>chrUn...', it 'censors' it as 1, otherwise, 0. It first appears that only headers are printed.

**Richard Finney** · 10-26-2012, 05:53 AM

"censor" is just a name for a "flag". "flag" means "boolean value" , i.e. zero or one.
It could be called "censor_flag" or "p" or "j" or "QWERzxcwer". whatever.
Think of is as "mode". Mode is either on or off. If mode "on", print out, else mode is off and don't print, i.e. "censor". While "chomping" line by line on a file, if there's a ">chrUn", mode is off, else if there's a ">" it's mode on. Then check "mode", if not is not off (i.e. "on"), then print.

The usually hiearchy of dealing with tasks is:
If it's simple, just use 1) sed and/or 2) awk and/or 3) grep and/or 4) sort and/or uniq) and/or other unix command line tools.
If it's a tad harder: use 1) perl or 2) bash script or 3) python
If it's more complicated: use "C", "Java", "C++", "R" ,etc. I..e. a "high level language".

There's considerable blurring between what's "just a scripting language" and "a serious, fast high level" tool. Python and Perl are not just "scripting languages" and "C" can be used for a one pass crunching of a flat file. Python's easier but if it's a critical part of a pipeline where every microsecond counts, "C" might be better as it's often many times faster.

**JQL** · 10-26-2012, 05:58 AM

thanks Krobison. My perl is a bit rusty since I haven't used it since grad school. The first multi-liners worked nicely for me.
a bit unclear with the one-liner.
I have a test file
ran your one-liner, the output looks like this:

Code:

$ perl -p -e '$off=/^>AAA/ if (/^>/); print $_ if (! $off) ' test
>AAA
asdfdfdfdferer
>BBB
>BBB
rereere
rereere
>AAA
dferere
>CCC
>CCC
rereregdreeeeeeeeeeeeeeeeeeeee
rereregdreeeeeeeeeeeeeeeeeeeee
>AAA
dddddddd
>BBB
>BBB
dfd
dfd
>AA
>AA
dfere
dfere

Code:

$ more test
>AAA
asdfdfdfdferer
>BBB
rereere
>AAA
dferere
>CCC
rereregdreeeeeeeeeeeeeeeeeeeee
>AAA
dddddddd
>BBB
dfd
>AA
dfere

**JQL** · 10-26-2012, 06:07 AM

good advice!

thanks everyone for sharing.

Originally posted by Richard Finney View Post

"censor" is just a name for a "flag". "flag" means "boolean value" , i.e. zero or one.
It could be called "censor_flag" or "p" or "j" or "QWERzxcwer". whatever.
Think of is as "mode". Mode is either on or off. If mode "on", print out, else mode is off and don't print, i.e. "censor". While "chomping" line by line on a file, if there's a ">chrUn", mode is off, else if there's a ">" it's mode on. Then check "mode", if not is not off (i.e. "on"), then print.

The usually hiearchy of dealing with tasks is:
If it's simple, just use 1) sed and/or 2) awk and/or 3) grep and/or 4) sort and/or uniq) and/or other unix command line tools.
If it's a tad harder: use 1) perl or 2) bash script or 3) python
If it's more complicated: use "C", "Java", "C++", "R" ,etc. I..e. a "high level language".

There's considerable blurring between what's "just a scripting language" and "a serious, fast high level" tool. Python and Perl are not just "scripting languages" and "C" can be used for a one pass crunching of a flat file. Python's easier but if it's a critical part of a pipeline where every microsecond counts, "C" might be better as it's often many times faster.

**flacchy** · 07-25-2013, 01:24 AM

remove from fasta file all the reads with Id into a txt file

Hi...
I do have a similar problem...

I have a txt file with all the reads ID (those are the reads I want to remove from my fasta file). How can I use this file to remove the reads I don't want from my fasta???

can anyone give me some advice?

Thanks

F.

**maubp** · 07-25-2013, 01:32 AM

For those of you with a local Galaxy instance, ask your administrator to install this tool:

Galaxy | Tool Shed

http://toolshed.g2.bx.psu.edu/view/peterjc/seq_filter_by_id

You can also run the Python script it calls directly at the command line if you want, current version of the source code is here:

404 — Bitbucket

https://bitbucket.org/peterjc/galaxy-central/src/61b9ded313f629f70c935809c77beadd1387f381/tools/filters/seq_filter_by_id.py?at=tools

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

remove reads in fasta file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News