Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • remove reads in fasta file

    This must be very simply for many of you.

    I want to remove some sequences in my fasta file based on IDs. All reads with headers
    >chrUn..... need to be removed (both ID and seqs)

    Is there any software out there can do this in a fly?

    thanks

  • #2
    try 'sed '/>chrUn/d' your_file >out_file
    on terminal

    Comment


    • #3
      ^ Does that remove the sequence?

      This thread might be useful:

      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

      Comment


      • #4
        Sorry, I didn't see you need remove the seqs as well, the above does not work.
        But I can figure out a stupid way, you could grep -n >chrUn you_file to find the rows of >chrUn

        then use syntax

        sed 'n1,n2d' filename.txt

        to remove the lines between n1 and n2.

        Comment


        • #5
          cat whatever.fasta | awk '{if (substr($0,1,6) == ">chrUn") censor=1; else if (substr($0,1,1) == ">") censor=0; if (censor==0) print $0}' > fixed.fasta

          Comment


          • #6
            Hi,

            thanks for the idea. This will work if all the headers are sorted by their IDs. Otherwise, you end up removing reads inbetweens. e.g. if the file has reads with ID >chr3 in between >chrUn, we will get rid of chr3 too.

            I think my fasta file is sorted, but I can't take the risk of what if.


            Originally posted by Jiafen View Post
            Sorry, I didn't see you need remove the seqs as well, the above does not work.
            But I can figure out a stupid way, you could grep -n >chrUn you_file to find the rows of >chrUn

            then use syntax

            sed 'n1,n2d' filename.txt

            to remove the lines between n1 and n2.

            Comment


            • #7
              It would be wise to learn a scripting/programming language, e.g. Perl, Python or Ruby. Here's how I would solve this using Biopython (untested):
              Code:
              from Bio import SeqIO
              wanted = (rec for rec in SeqIO.parse("input.fq", "fastq") \
                                      if not rec.id.startswith("chrUn"))
              count = SeqIO.write(wanted, "filtered.fq", "fastq")
              print "Saved %i records" % count
              Last edited by maubp; 10-25-2012, 01:49 PM. Reason: split long line

              Comment


              • #8
                My typical way of solving this with perl

                Code:
                #!/usr/bin/perl
                use strict;
                my $off=0;
                while ($_ = <>)
                {
                   if (/^>/)
                  {
                    $off=/^>chrUn/;
                  }
                  print $_ if (! $off);
                }
                It is easy to modify this to read a list of ids to include/exclude into a hash & instead key on those ids.

                In a pinch, you can do this as a one-liner
                Code:
                perl -p -e '$off=/^>chrUn/ if (/^>/); print $_ if (! $off) ' myfile.fa

                Comment


                • #9
                  Hi Richard,

                  Thanks. It works nicely. Very elegant use of censor.

                  How does the seqeunces under ID >chrUn got removed? When it finds ID '>chrUn...', it 'censors' it as 1, otherwise, 0. It first appears that only headers are printed.




                  Originally posted by Richard Finney View Post
                  cat whatever.fasta | awk '{if (substr($0,1,6) == ">chrUn") censor=1; else if (substr($0,1,1) == ">") censor=0; if (censor==0) print $0}' > fixed.fasta

                  Comment


                  • #10
                    Ok, I got it. Im a bit slow this morning

                    Originally posted by JQL View Post
                    Hi Richard,

                    Thanks. It works nicely. Very elegant use of censor.

                    How does the seqeunces under ID >chrUn got removed? When it finds ID '>chrUn...', it 'censors' it as 1, otherwise, 0. It first appears that only headers are printed.

                    Comment


                    • #11
                      "censor" is just a name for a "flag". "flag" means "boolean value" , i.e. zero or one.
                      It could be called "censor_flag" or "p" or "j" or "QWERzxcwer". whatever.
                      Think of is as "mode". Mode is either on or off. If mode "on", print out, else mode is off and don't print, i.e. "censor". While "chomping" line by line on a file, if there's a ">chrUn", mode is off, else if there's a ">" it's mode on. Then check "mode", if not is not off (i.e. "on"), then print.

                      The usually hiearchy of dealing with tasks is:
                      If it's simple, just use 1) sed and/or 2) awk and/or 3) grep and/or 4) sort and/or uniq) and/or other unix command line tools.
                      If it's a tad harder: use 1) perl or 2) bash script or 3) python
                      If it's more complicated: use "C", "Java", "C++", "R" ,etc. I..e. a "high level language".

                      There's considerable blurring between what's "just a scripting language" and "a serious, fast high level" tool. Python and Perl are not just "scripting languages" and "C" can be used for a one pass crunching of a flat file. Python's easier but if it's a critical part of a pipeline where every microsecond counts, "C" might be better as it's often many times faster.

                      Comment


                      • #12
                        thanks Krobison. My perl is a bit rusty since I haven't used it since grad school. The first multi-liners worked nicely for me.
                        a bit unclear with the one-liner.
                        I have a test file
                        ran your one-liner, the output looks like this:
                        Code:
                        $ perl -p -e '$off=/^>AAA/ if (/^>/); print $_ if (! $off) ' test
                        >AAA
                        asdfdfdfdferer
                        >BBB
                        >BBB
                        rereere
                        rereere
                        >AAA
                        dferere
                        >CCC
                        >CCC
                        rereregdreeeeeeeeeeeeeeeeeeeee
                        rereregdreeeeeeeeeeeeeeeeeeeee
                        >AAA
                        dddddddd
                        >BBB
                        >BBB
                        dfd
                        dfd
                        >AA
                        >AA
                        dfere
                        dfere
                        Code:
                        $ more test
                        >AAA
                        asdfdfdfdferer
                        >BBB
                        rereere
                        >AAA
                        dferere
                        >CCC
                        rereregdreeeeeeeeeeeeeeeeeeeee
                        >AAA
                        dddddddd
                        >BBB
                        dfd
                        >AA
                        dfere
                        Last edited by JQL; 10-26-2012, 06:07 AM.

                        Comment


                        • #13
                          good advice!

                          thanks everyone for sharing.

                          Originally posted by Richard Finney View Post
                          "censor" is just a name for a "flag". "flag" means "boolean value" , i.e. zero or one.
                          It could be called "censor_flag" or "p" or "j" or "QWERzxcwer". whatever.
                          Think of is as "mode". Mode is either on or off. If mode "on", print out, else mode is off and don't print, i.e. "censor". While "chomping" line by line on a file, if there's a ">chrUn", mode is off, else if there's a ">" it's mode on. Then check "mode", if not is not off (i.e. "on"), then print.

                          The usually hiearchy of dealing with tasks is:
                          If it's simple, just use 1) sed and/or 2) awk and/or 3) grep and/or 4) sort and/or uniq) and/or other unix command line tools.
                          If it's a tad harder: use 1) perl or 2) bash script or 3) python
                          If it's more complicated: use "C", "Java", "C++", "R" ,etc. I..e. a "high level language".

                          There's considerable blurring between what's "just a scripting language" and "a serious, fast high level" tool. Python and Perl are not just "scripting languages" and "C" can be used for a one pass crunching of a flat file. Python's easier but if it's a critical part of a pipeline where every microsecond counts, "C" might be better as it's often many times faster.

                          Comment


                          • #14
                            remove from fasta file all the reads with Id into a txt file

                            Hi...
                            I do have a similar problem...

                            I have a txt file with all the reads ID (those are the reads I want to remove from my fasta file). How can I use this file to remove the reads I don't want from my fasta???

                            can anyone give me some advice?

                            Thanks

                            F.

                            Comment


                            • #15
                              For those of you with a local Galaxy instance, ask your administrator to install this tool:


                              You can also run the Python script it calls directly at the command line if you want, current version of the source code is here:

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Recent Advances in Sequencing Analysis Tools
                                by seqadmin


                                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                                05-06-2024, 07:48 AM
                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 05-14-2024, 07:03 AM
                              0 responses
                              19 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 05-10-2024, 06:35 AM
                              0 responses
                              44 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 05-09-2024, 02:46 PM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 05-07-2024, 06:57 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X