Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • concatenetion files in order by perl?

    hello everyone,
    I have some fasta files like this:
    >ARP3_HUMAN
    ----------------------------------------------MAGRLPACVVDCGT
    GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
    TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
    FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
    PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
    KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRR
    PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
    RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
    >ARP3_BOVIN
    ----------------------------------------------MAGRLPACVVDCGT
    GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
    TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
    FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
    PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
    KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRR
    PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
    RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS

    I want to concatenate these files by consideration that each specious should concatenate with same specious in other files that might be in different order relate to this example file that I pasted. Any suggestion to concatenate theses file with perl script?thanks in advance

  • #2
    Try this:

    Code:
    #!/bin/bash
    #species_splitter.sh
    
    perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)};print }'  $1 | perl -lane '$species=$1 if $F[0] =~ /_(\S+)/;open(OUT,"+>>$species.fa"); print OUT ">$F[0]\n$F[1]"  '
    and run it as

    Code:
    bash species_splitter.sh  input.fasta
    Your output will be "HUMAN.fa" and "BOVIN.fa".

    Comment


    • #3
      thanks zee
      because I am not familiar with programming when I run your script I saw:
      species_splitter.sh: species_splitter.sh: No such file or directory
      What can I do next?
      thanks

      Comment


      • #4
        Oh, I left out the following I thought it was obvious:

        Save all the code in the first listing to a file called "species_splitter.sh" using text editor.

        Then run "bash species_splitter.sh <your input fasta file>"

        Comment


        • #5
          thanks zee,
          it works but still empty.lets assume if we have just one file like that and I want to concatenate just the same specious and get again the 1 file result what the script would be?
          thanks

          Comment


          • #6
            I dont understand the question. In the example you pasted two files will be created by the script

            1. BOVIN.fa
            2. HUMAN.fa

            So which file are you saying is empty? Can you describe to me whether you got the files with same species in each one?

            Comment


            • #7
              the script works, you have just ONE file as input, just one huge multifasta, like:

              Code:
              >ARP3_HUMAN
              ----------------------------------------------MAGRLPACVVDCGT
              GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
              TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
              FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
              PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
              KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRR
              PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
              RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
              >ARP3_BOVIN
              ----------------------------------------------MAGRLPACVVDCGT
              GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
              TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
              FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
              PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
              KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRR
              PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
              RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
              >ARP3_BOVIN
              ----------------------------------------------MAGRLPACVVDCGT
              GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
              TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
              FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
              ~
              by calling species_splitter.sh test_file.fa

              you get 2 fasta files, HUMAN.fa looks like:
              Code:
              >ARP3_HUMAN
              ----------------------------------------------MAGRLPACVVDCGTGYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYATKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFESFNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHIPIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWIKQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRRPLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQRYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
              and the BOVIN.fa :
              Code:
              >ARP3_BOVIN
              ----------------------------------------------MAGRLPACVVDCGTGYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYATKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFESFNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHIPIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWIKQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRRPLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQRYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
              >ARP3_BOVIN
              ----------------------------------------------MAGRLPACVVDCGTGYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYATKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFESFNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI

              probably this is not exactly what you want, or is it?

              Comment


              • #8
                I think I misunderstood your question. You would like to separate the sequences not by species names but by GENE names. Is that correct?

                Comment


                • #9
                  well i guess it's still the "old" problem:

                  Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                  but for me it's also not that clear.

                  Comment


                  • #10
                    sorry might be because of my fault.that I pasted just some part of file.now I want to concatenate within file.so assume have 1 file with some human fasta file, bovin fasta file and so on.so the file result be again one file.
                    thanks

                    Comment


                    • #11
                      to cat a file do:

                      cat file1.fa file2.fa > bigfile.fa

                      Comment


                      • #12
                        but there is a problem .I want all protein specious for certain protein under one hedaer line. for example if we have :
                        >APA human
                        AAAAAAAAAAAA
                        >TSA human
                        BBBBBBBBBBBBB
                        I need this:
                        >human
                        AAAAAAAAAAAA
                        BBBBBBBBBBBBB

                        thanks for your help

                        Comment


                        • #13
                          ok to make it clear, in one file you have something like

                          Code:
                          >ARP3_BOVIN
                          ----------------------------------------------MAGRLPACVVDCGT
                          GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
                          TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
                          FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
                          PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
                          KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRR
                          PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
                          RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
                          >HUMAN
                          PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
                          KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVR
                          >HUMAN
                          -----PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
                          RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
                          >ARP3_BOVIN
                          ----------------------------------------------MAGRLPACVVDCGT
                          GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
                          TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
                          FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
                          and the output should be:
                          Code:
                          >HUMAN
                          PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
                          KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVR-----PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
                          RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
                          >ARP3_BOVIN
                          ----------------------------------------------MAGRLPACVVDCGT
                          GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
                          TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
                          FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
                          PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
                          KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRR
                          PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
                          RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS----------------------------------------------MAGRLPACVVDCGT
                          GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
                          TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
                          FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
                          correct?

                          edit: saw your comment to late, so you have even different headers like >xxx human and >yyy human which should be all under >human?

                          it's getting complicated. :-/
                          Last edited by Thorondor; 02-28-2011, 06:42 AM.

                          Comment


                          • #14
                            OK, how large is your input file? Could you attach the whole file to this thread?

                            Comment


                            • #15
                              yes I want to concatenate all sequence of certain specious under one header of that specious. and it doest not matter that might be the first of the header is different because of different protein but it blongs to certain specious
                              like: APT_human
                              BTC_human
                              thanks

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM
                              • seqadmin
                                Multiomics Techniques Advancing Disease Research
                                by seqadmin


                                New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                                A major leap in the field has
                                ...
                                02-08-2024, 06:33 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 02-28-2024, 06:12 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-23-2024, 04:11 PM
                              0 responses
                              74 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-21-2024, 08:52 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-20-2024, 08:57 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X