Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • concatenetion files in order by perl?

    hello everyone,
    I have some fasta files like this:
    >ARP3_HUMAN
    ----------------------------------------------MAGRLPACVVDCGT
    GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
    TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
    FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
    PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
    KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRR
    PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
    RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
    >ARP3_BOVIN
    ----------------------------------------------MAGRLPACVVDCGT
    GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
    TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
    FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
    PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
    KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRR
    PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
    RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS

    I want to concatenate these files by consideration that each specious should concatenate with same specious in other files that might be in different order relate to this example file that I pasted. Any suggestion to concatenate theses file with perl script?thanks in advance

  • #2
    Try this:

    Code:
    #!/bin/bash
    #species_splitter.sh
    
    perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)};print }'  $1 | perl -lane '$species=$1 if $F[0] =~ /_(\S+)/;open(OUT,"+>>$species.fa"); print OUT ">$F[0]\n$F[1]"  '
    and run it as

    Code:
    bash species_splitter.sh  input.fasta
    Your output will be "HUMAN.fa" and "BOVIN.fa".

    Comment


    • #3
      thanks zee
      because I am not familiar with programming when I run your script I saw:
      species_splitter.sh: species_splitter.sh: No such file or directory
      What can I do next?
      thanks

      Comment


      • #4
        Oh, I left out the following I thought it was obvious:

        Save all the code in the first listing to a file called "species_splitter.sh" using text editor.

        Then run "bash species_splitter.sh <your input fasta file>"

        Comment


        • #5
          thanks zee,
          it works but still empty.lets assume if we have just one file like that and I want to concatenate just the same specious and get again the 1 file result what the script would be?
          thanks

          Comment


          • #6
            I dont understand the question. In the example you pasted two files will be created by the script

            1. BOVIN.fa
            2. HUMAN.fa

            So which file are you saying is empty? Can you describe to me whether you got the files with same species in each one?

            Comment


            • #7
              the script works, you have just ONE file as input, just one huge multifasta, like:

              Code:
              >ARP3_HUMAN
              ----------------------------------------------MAGRLPACVVDCGT
              GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
              TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
              FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
              PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
              KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRR
              PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
              RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
              >ARP3_BOVIN
              ----------------------------------------------MAGRLPACVVDCGT
              GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
              TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
              FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
              PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
              KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRR
              PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
              RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
              >ARP3_BOVIN
              ----------------------------------------------MAGRLPACVVDCGT
              GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
              TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
              FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
              ~
              by calling species_splitter.sh test_file.fa

              you get 2 fasta files, HUMAN.fa looks like:
              Code:
              >ARP3_HUMAN
              ----------------------------------------------MAGRLPACVVDCGTGYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYATKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFESFNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHIPIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWIKQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRRPLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQRYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
              and the BOVIN.fa :
              Code:
              >ARP3_BOVIN
              ----------------------------------------------MAGRLPACVVDCGTGYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYATKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFESFNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHIPIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWIKQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRRPLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQRYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
              >ARP3_BOVIN
              ----------------------------------------------MAGRLPACVVDCGTGYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYATKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFESFNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI

              probably this is not exactly what you want, or is it?

              Comment


              • #8
                I think I misunderstood your question. You would like to separate the sequences not by species names but by GENE names. Is that correct?

                Comment


                • #9
                  well i guess it's still the "old" problem:

                  Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                  but for me it's also not that clear.

                  Comment


                  • #10
                    sorry might be because of my fault.that I pasted just some part of file.now I want to concatenate within file.so assume have 1 file with some human fasta file, bovin fasta file and so on.so the file result be again one file.
                    thanks

                    Comment


                    • #11
                      to cat a file do:

                      cat file1.fa file2.fa > bigfile.fa

                      Comment


                      • #12
                        but there is a problem .I want all protein specious for certain protein under one hedaer line. for example if we have :
                        >APA human
                        AAAAAAAAAAAA
                        >TSA human
                        BBBBBBBBBBBBB
                        I need this:
                        >human
                        AAAAAAAAAAAA
                        BBBBBBBBBBBBB

                        thanks for your help

                        Comment


                        • #13
                          ok to make it clear, in one file you have something like

                          Code:
                          >ARP3_BOVIN
                          ----------------------------------------------MAGRLPACVVDCGT
                          GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
                          TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
                          FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
                          PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
                          KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRR
                          PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
                          RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
                          >HUMAN
                          PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
                          KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVR
                          >HUMAN
                          -----PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
                          RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
                          >ARP3_BOVIN
                          ----------------------------------------------MAGRLPACVVDCGT
                          GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
                          TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
                          FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
                          and the output should be:
                          Code:
                          >HUMAN
                          PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
                          KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVR-----PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
                          RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS
                          >ARP3_BOVIN
                          ----------------------------------------------MAGRLPACVVDCGT
                          GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
                          TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
                          FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
                          PIAGRDITYFIQQLLRDREVGIPPEQSLETAKAVKERYSYVCPDLVKEFNKYDTDGSKWI
                          KQYTGINAISKKEFSIDVGYERFLGPEIFFHPEFANPDFTQPISEVVDEVIQNCPIDVRR
                          PLYKNIVLSGGSTMFRDFGRRLQRDLKRTVDARLKLSEELSGGRLKPKPIDVQVITHHMQ
                          RYAVWFGGSMLASTPEFYQVCHTKKDYEEIGPSICRHNPVFGVMS----------------------------------------------MAGRLPACVVDCGT
                          GYTKLGYAGNTEPQFIIPSCIAIKESAKVGDQAQRRVMKGVDDLDFFIGDEAIEKP-TYA
                          TKWPIRHGIVEDWDLMERFMEQVIFKYLRAEPEDHYFLLTEPPLNTPENREYTAEIMFES
                          FNVPGLYIAVQAVLALAASWTSRQVGERTLTGTVIDSGDGVTHVIPVAEGYVIGSCIKHI
                          correct?

                          edit: saw your comment to late, so you have even different headers like >xxx human and >yyy human which should be all under >human?

                          it's getting complicated. :-/
                          Last edited by Thorondor; 02-28-2011, 06:42 AM.

                          Comment


                          • #14
                            OK, how large is your input file? Could you attach the whole file to this thread?

                            Comment


                            • #15
                              yes I want to concatenate all sequence of certain specious under one header of that specious. and it doest not matter that might be the first of the header is different because of different protein but it blongs to certain specious
                              like: APT_human
                              BTC_human
                              thanks

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Genetic Variation in Immunogenetics and Antibody Diversity
                                by seqadmin



                                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                11-06-2024, 07:24 PM
                              • seqadmin
                                Choosing Between NGS and qPCR
                                by seqadmin



                                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                10-18-2024, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 11:09 AM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Today, 06:13 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-01-2024, 06:09 AM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-30-2024, 05:31 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X