Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • satishg
    Member
    • Aug 2014
    • 15

    Fasta File Editing

    I have a file with text as:

    >APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
    >ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
    >NSVNPDVSQHSPERHFHTSEGTLC

    I need to change it by adding numbers and shifting the amino acid aequence to next line, basically into fasta format as folllows:
    >1
    APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
    >2
    ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
    >3
    NSVNPDVSQHSPERHFHTSEGTLC
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Code:
    cat foo | sed 's/>//' | awk '{idx+=1;printf(">%i\n%s\n",idx,$0)}'
    or
    Code:
    cat foo | awk '{idx+=1;$1=substr($1,2,length($1));printf(">%i\n%s\n",idx,$1)}'
    or
    Code:
    cat foo | awk '{idx+=1;sub(/>/,sprintf(">%i\n",idx),$1);print $1}'
    among many other possibilities. You'll find that familiarizing yourself with the command line will come in useful.

    Comment

    • bckirkup
      Member
      • Jan 2011
      • 17

      #3
      also try jedit

      Regex and beanshell can sort your problem out....

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        This should work
        Code:
        $ perl -p -i.bak -e '$c+=1; s/>/>$c\n/g' your_file

        Comment

        • satishg
          Member
          • Aug 2014
          • 15

          #5
          Thanks GenoMax. The output is as as follows:

          >1

          >2
          >APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
          >3

          >4
          >ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
          >5

          The order of the sequences is right but its introducing blank sequences of >1, >3 and >5.

          Could you please look into it?

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            What OS are you doing this on? Did you edit/open this file on a PC/Mac?

            NOTE: Before you edit/change a file it is important to make a backup copy (specially if you spent a day or two getting it). I have added a cp command below that would preserve an original copy should you need to go back to it.

            Try the following first before you use the perl command (this will convert from windows to unix file format, if that is the issue though I am not certain). You will need to copy the .bak file (perl command made a backup of the original with .bak extension and changed the original so you can't use the original now) to the original name before you try this:

            Code:
            $ cp your_file.bak your_file.ORIG
            $ cp your_file.bak your_file
            $ awk '{ sub(/\r$/,""); print }' your_file
            Last edited by GenoMax; 08-11-2014, 04:25 PM. Reason: Added notes about keeping an original backup copy

            Comment

            • rnaeye
              Member
              • May 2011
              • 80

              #7
              Code:
              sed 's/>//' inputFile | awk '{print ">"NR"\n"$0}'

              Comment

              • satishg
                Member
                • Aug 2014
                • 15

                #8
                GenoMax - that didn't do anything. The .bak file has no numbers assigned and when I ran the awk command that was suggested it didn't make any changes or add numbers to the output file.

                Thanks rnaeye. The original file has a sequence #5 which is of two lines. The code is making the second line of the sequence as sequence #6 in the output. I probably need to make changes to the number of characters per line on the original file. Please advise regarding the same.

                The following are the input and output files:

                INPUT-
                >APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
                >ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
                >NSVNPDVSQHSPERHFHTSEGTLC
                >AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                >VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHY
                KEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                >MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG

                OUTPUT-
                >1
                APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
                >2
                ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
                >3
                NSVNPDVSQHSPERHFHTSEGTLC
                >4
                AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                >5
                VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHY
                >6
                KEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                >7
                MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG

                Comment

                • satishg
                  Member
                  • Aug 2014
                  • 15

                  #9
                  Thanks dpryan - the third code works effectively but it skips numbers for a sequence following the one which has it on two lines; say sequence #5 has two lines for which the output would be >5 followed by >7, skipping >6. This explains better:

                  >4
                  AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                  >5
                  VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHY
                  KEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                  >7
                  MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG

                  I can live with it for now. I'll follow your advice and try to familiarize with the command line. Could you please fix the bug in the third code and let me know.....

                  Comment

                  • satishg
                    Member
                    • Aug 2014
                    • 15

                    #10
                    Thanks ALL - I however have the issue with numbering sequences in order; removed the line delimiter and finally have the output file as:

                    >1
                    APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
                    >2
                    ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
                    >3
                    NSVNPDVSQHSPERHFHTSEGTLC
                    >4
                    AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                    >5
                    VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHYKEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                    >7
                    MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG

                    Please help me fix the issue of numbering sequences in order.......

                    Comment

                    • dpryan
                      Devon Ryan
                      • Jul 2011
                      • 3478

                      #11
                      That's less a bug than a feature request, but in any case it's pretty trivial to add support for multi-line entries:

                      Code:
                      cat foo | awk '{if(substr($1,1,1)==">"){idx+=1;sub(/>/,sprintf(">%i\n",idx),$1);}print $1}'

                      Comment

                      • satishg
                        Member
                        • Aug 2014
                        • 15

                        #12
                        Finally.......it all looks good !

                        >1
                        APEGDARPRQSGHPACHELDAADRRQGEIPGVPERRLCDASL
                        >2
                        ADSGGRGGCRRRCGDLPAAALIRGRGDDTDRPVPARRRPGRVRRGAGGPATAAGRARGVDRRAGLRGRA
                        >3
                        NSVNPDVSQHSPERHFHTSEGTLC
                        >4
                        AARHRAGQGARPPGLPPEHQPARRRDRAGAGLGGPASAGAAGRGAGGAATGRAVGAVRADGGR
                        >5
                        VRRLTWHGGGGDIRAFVFFLAKNVKNLDLFGASLFQVASFHPTASLGVSKLVIRSSIFNLLHCNFKKMRLAFFNLLHYKEIRFAMITLIRSTATSGGYGICGFNLLHCHFGEIRFTMITSIRSTATLGGDKIHHGRFDPTYCNFRGIGFMVSLIVTPFSREHDL
                        >6
                        MNGAKAMEGMVCDARGEGDGGDVLQCTGRFGGKLTDLGNLGISEFREIGISESGQTRGKG
                        >7
                        MADPDEVIPTVRDVSDAPFVGSDGSNVILNEDSFGGGDNGLEEFRGEGSMGK

                        Thank You all for your time !

                        Comment

                        • syfo
                          Just a member
                          • Nov 2012
                          • 103

                          #13
                          concise mode:

                          Code:
                          cat input |  awk '/^>/{$1=">"++n"\n"substr($1,2)}1'

                          Comment

                          Latest Articles

                          Collapse

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Yesterday, 10:09 AM
                          0 responses
                          10 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-04-2026, 08:59 AM
                          0 responses
                          17 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-02-2026, 12:03 PM
                          0 responses
                          26 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-02-2026, 11:40 AM
                          0 responses
                          21 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...