Header Leaderboard Ad

Collapse

remove reads in fasta file

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • remove reads in fasta file

    This must be very simply for many of you.

    I want to remove some sequences in my fasta file based on IDs. All reads with headers
    >chrUn..... need to be removed (both ID and seqs)

    Is there any software out there can do this in a fly?

    thanks

  • #2
    try 'sed '/>chrUn/d' your_file >out_file
    on terminal

    Comment


    • #3
      ^ Does that remove the sequence?

      This thread might be useful:

      http://seqanswers.com/forums/showthread.php?t=5380

      Comment


      • #4
        Sorry, I didn't see you need remove the seqs as well, the above does not work.
        But I can figure out a stupid way, you could grep -n >chrUn you_file to find the rows of >chrUn

        then use syntax

        sed 'n1,n2d' filename.txt

        to remove the lines between n1 and n2.

        Comment


        • #5
          cat whatever.fasta | awk '{if (substr($0,1,6) == ">chrUn") censor=1; else if (substr($0,1,1) == ">") censor=0; if (censor==0) print $0}' > fixed.fasta

          Comment


          • #6
            Hi,

            thanks for the idea. This will work if all the headers are sorted by their IDs. Otherwise, you end up removing reads inbetweens. e.g. if the file has reads with ID >chr3 in between >chrUn, we will get rid of chr3 too.

            I think my fasta file is sorted, but I can't take the risk of what if.


            Originally posted by Jiafen View Post
            Sorry, I didn't see you need remove the seqs as well, the above does not work.
            But I can figure out a stupid way, you could grep -n >chrUn you_file to find the rows of >chrUn

            then use syntax

            sed 'n1,n2d' filename.txt

            to remove the lines between n1 and n2.

            Comment


            • #7
              It would be wise to learn a scripting/programming language, e.g. Perl, Python or Ruby. Here's how I would solve this using Biopython (untested):
              Code:
              from Bio import SeqIO
              wanted = (rec for rec in SeqIO.parse("input.fq", "fastq") \
                                      if not rec.id.startswith("chrUn"))
              count = SeqIO.write(wanted, "filtered.fq", "fastq")
              print "Saved %i records" % count
              Last edited by maubp; 10-25-2012, 01:49 PM. Reason: split long line

              Comment


              • #8
                My typical way of solving this with perl

                Code:
                #!/usr/bin/perl
                use strict;
                my $off=0;
                while ($_ = <>)
                {
                   if (/^>/)
                  {
                    $off=/^>chrUn/;
                  }
                  print $_ if (! $off);
                }
                It is easy to modify this to read a list of ids to include/exclude into a hash & instead key on those ids.

                In a pinch, you can do this as a one-liner
                Code:
                perl -p -e '$off=/^>chrUn/ if (/^>/); print $_ if (! $off) ' myfile.fa

                Comment


                • #9
                  Hi Richard,

                  Thanks. It works nicely. Very elegant use of censor.

                  How does the seqeunces under ID >chrUn got removed? When it finds ID '>chrUn...', it 'censors' it as 1, otherwise, 0. It first appears that only headers are printed.




                  Originally posted by Richard Finney View Post
                  cat whatever.fasta | awk '{if (substr($0,1,6) == ">chrUn") censor=1; else if (substr($0,1,1) == ">") censor=0; if (censor==0) print $0}' > fixed.fasta

                  Comment


                  • #10
                    Ok, I got it. Im a bit slow this morning

                    Originally posted by JQL View Post
                    Hi Richard,

                    Thanks. It works nicely. Very elegant use of censor.

                    How does the seqeunces under ID >chrUn got removed? When it finds ID '>chrUn...', it 'censors' it as 1, otherwise, 0. It first appears that only headers are printed.

                    Comment


                    • #11
                      "censor" is just a name for a "flag". "flag" means "boolean value" , i.e. zero or one.
                      It could be called "censor_flag" or "p" or "j" or "QWERzxcwer". whatever.
                      Think of is as "mode". Mode is either on or off. If mode "on", print out, else mode is off and don't print, i.e. "censor". While "chomping" line by line on a file, if there's a ">chrUn", mode is off, else if there's a ">" it's mode on. Then check "mode", if not is not off (i.e. "on"), then print.

                      The usually hiearchy of dealing with tasks is:
                      If it's simple, just use 1) sed and/or 2) awk and/or 3) grep and/or 4) sort and/or uniq) and/or other unix command line tools.
                      If it's a tad harder: use 1) perl or 2) bash script or 3) python
                      If it's more complicated: use "C", "Java", "C++", "R" ,etc. I..e. a "high level language".

                      There's considerable blurring between what's "just a scripting language" and "a serious, fast high level" tool. Python and Perl are not just "scripting languages" and "C" can be used for a one pass crunching of a flat file. Python's easier but if it's a critical part of a pipeline where every microsecond counts, "C" might be better as it's often many times faster.

                      Comment


                      • #12
                        thanks Krobison. My perl is a bit rusty since I haven't used it since grad school. The first multi-liners worked nicely for me.
                        a bit unclear with the one-liner.
                        I have a test file
                        ran your one-liner, the output looks like this:
                        Code:
                        $ perl -p -e '$off=/^>AAA/ if (/^>/); print $_ if (! $off) ' test
                        >AAA
                        asdfdfdfdferer
                        >BBB
                        >BBB
                        rereere
                        rereere
                        >AAA
                        dferere
                        >CCC
                        >CCC
                        rereregdreeeeeeeeeeeeeeeeeeeee
                        rereregdreeeeeeeeeeeeeeeeeeeee
                        >AAA
                        dddddddd
                        >BBB
                        >BBB
                        dfd
                        dfd
                        >AA
                        >AA
                        dfere
                        dfere
                        Code:
                        $ more test
                        >AAA
                        asdfdfdfdferer
                        >BBB
                        rereere
                        >AAA
                        dferere
                        >CCC
                        rereregdreeeeeeeeeeeeeeeeeeeee
                        >AAA
                        dddddddd
                        >BBB
                        dfd
                        >AA
                        dfere
                        Last edited by JQL; 10-26-2012, 06:07 AM.

                        Comment


                        • #13
                          good advice!

                          thanks everyone for sharing.

                          Originally posted by Richard Finney View Post
                          "censor" is just a name for a "flag". "flag" means "boolean value" , i.e. zero or one.
                          It could be called "censor_flag" or "p" or "j" or "QWERzxcwer". whatever.
                          Think of is as "mode". Mode is either on or off. If mode "on", print out, else mode is off and don't print, i.e. "censor". While "chomping" line by line on a file, if there's a ">chrUn", mode is off, else if there's a ">" it's mode on. Then check "mode", if not is not off (i.e. "on"), then print.

                          The usually hiearchy of dealing with tasks is:
                          If it's simple, just use 1) sed and/or 2) awk and/or 3) grep and/or 4) sort and/or uniq) and/or other unix command line tools.
                          If it's a tad harder: use 1) perl or 2) bash script or 3) python
                          If it's more complicated: use "C", "Java", "C++", "R" ,etc. I..e. a "high level language".

                          There's considerable blurring between what's "just a scripting language" and "a serious, fast high level" tool. Python and Perl are not just "scripting languages" and "C" can be used for a one pass crunching of a flat file. Python's easier but if it's a critical part of a pipeline where every microsecond counts, "C" might be better as it's often many times faster.

                          Comment


                          • #14
                            remove from fasta file all the reads with Id into a txt file

                            Hi...
                            I do have a similar problem...

                            I have a txt file with all the reads ID (those are the reads I want to remove from my fasta file). How can I use this file to remove the reads I don't want from my fasta???

                            can anyone give me some advice?

                            Thanks

                            F.

                            Comment


                            • #15
                              For those of you with a local Galaxy instance, ask your administrator to install this tool:
                              http://toolshed.g2.bx.psu.edu/view/p...q_filter_by_id

                              You can also run the Python script it calls directly at the command line if you want, current version of the source code is here:
                              https://bitbucket.org/peterjc/galaxy...id.py?at=tools

                              Comment

                              Working...
                              X