Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging annotations and fasta headings

    I have a fasta file like this:
    >comp0_c0_seq1 len=204 path=[4976:0-203]
    ATTGTACTTAATCTA.....
    >comp0_c1_seq1 len=222 path=[8835:0-221]
    GTACAATCACGGCACATT......
    and an annotation file like this:
    comp1558_c0_seq1 repressor protein
    comp8142_c0_seq1 aspartate aminotransferase
    comp8357_c0_seq1 l-xylulose reductase
    comp8387_c0_seq1 succinyl- synthetase alpha
    comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase
    And I want to replace the fasta files` heading by the annotation to produce:
    >comp0_c0_seq1 xxx protein
    ATTGTACTTAATCTA.....
    >comp0_c1_seq1 xxx reductase
    GTACAATCACGGCACATT......
    Here is what I wrote with python trying to do this:
    f = open("test")
    ano = open("annotation.txt")
    output = open("merged.fasta",'w')
    for line in ano:
    x = line.spilt('\t')
    for line in f:
    str = line.split(' ')
    if x[0] == str[0]:
    output.write(x)
    else:
    output.write(line)
    But it does not work, says "AttributeError: 'str' object has no attribute 'spilt'
    "
    ... any better ideas to do this or found any way to fix the codes?

    Thanks

  • #2
    Hum, looks like you’ve type “spilt” when you mean “split”.

    You could also convert you fasta file into tab format then just use a join command. Ie it could be:

    File1:
    Code:
    someheader1	GATC….
    someheader2	ATCG….
    File2:
    Code:
    someheader1	othername1
    someheader2	othername2
    So then it would be:
    Code:
    sort -k1,1 File1 > File1.sort
    sort -k1,1 File2 > File2.sort
    join -j1 -o1.1,2.2,1.2 -e “NA” -a1 File1.sort File2.sort | sed ’s/ /_/‘ | see ’s/ /\t/‘ > Newfile
    That will give you something that would look like this that could be transformed back in fasta format:

    Code:
    someheader1_othername1	GATC...
    someheader2_othername2	ATCG...
    And careful with “\t” in that code above. In case you don’t know, you’ll probably need to replace it with Ctrl-V-tab, to get it Unix to recognize it.

    Also you can you the fastx toolkit to convert to and from tab format.

    Comment


    • #3
      This is a syntax error. You need to use the proper function name. You have two similarly named functions -- 'spilt' and 'split'. Only one of them is giving you an error.

      [edit: I see that Wallysb01 beat me to the answer ... and in a much nicer fashion :-) ]

      Comment


      • #4
        you may use awk:
        I need to do something similar to this post (but with a twist). That is why I am asking. unix shell: replace by dictionary I have a dictionary(dict.txt). It is space separated and it reads like...


        Usage: awk -f foo.awk dict.dat user.dat



        NR == FNR {
        rep[$1] = $2
        next
        }

        {
        for (key in rep) {
        gsub(key, rep[key])
        }
        print
        }

        Comment


        • #5
          Thank you Wallysb01,
          But the thing is, it might need some levels of "header matching",
          the headers in annotation file are not completed in the fasta file..

          someheader1 othername1
          someheader3 othername2


          [QUOTE]
          someheader1
          ATTTTCTG...
          someheader2
          ATCGGG...
          someheader3
          ATCTTG...
          someheader4
          ATCGAG...[QUOTE]

          by the way, my code still not work after fix the syntax error...
          then I wrote a R code trying to do the same thing,,no error message comes out, but the output all containing the " "..

          ano <- read.table("annotation.txt",sep="\t")
          fasta <- readLines("TrinityS_condensed.fasta")
          i<-1
          j<-1
          while (i<5000){
          ano1 <- paste(">",ano[i,1],sep="")
          while (j<40604){
          str <- strsplit(fasta[j]," ")
          item <- str[[1]]
          if (ano1 == item[1]) {
          fasta[j] <- paste(fasta[j],ano[i,2])
          }
          j <- j+2
          }
          i <- i+1
          }
          dput(fasta,file="mergeoutput")

          Comment


          • #6
            I am sorry, the awk does not work in this case, awk use $1 for first word and $2 for the second word.

            I just wrote a python script below, and I tested it, it worked fine.
            myfasta file

            >comp1558_c0_seq1 len=204 path=[4976:0-203]
            ATTGTACTTAATCTA.....
            >comp8142_c0_seq1 len=222 path=[8835:0-221]
            GTACAATCACGGCACATT......
            >comp8570_c0_seq1 len=232 path=[3344:0-232]
            GCATCGCTTATCGGTTTA......

            annotation file:

            comp1558_c0_seq1 repressor protein
            comp8142_c0_seq1 aspartate aminotransferase
            comp8357_c0_seq1 l-xylulose reductase
            comp8387_c0_seq1 succinyl- synthetase alpha
            comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase

            script:
            Code:
            with open ("C:/Users/Tang Ming/Desktop/anotation.txt", "r") as annotation:
                anotation_dict = {}
                for line in annotation:
                    line = line.split()
                    if line: #test whether it is an empty line
                        anotation_dict[line[0]]=line[1:]
                    else:
                        continue
            
            # really should not parse the fasta file by myself. there are
            # many parsers already there. you can use module from Biopython 
            ofile = open ("C:/Users/Tang Ming/Desktop/output.txt", "w")
            
            with open ("C:/Users/Tang Ming/Desktop/my_fasta.txt", "r") as myfasta:
                for line in myfasta:
                    if line.startswith (">"):
                        line = line[1:] # skip the ">" character
                        line = line.split()
                        if line[0] in anotation_dict:
                            new_line = ">" + str(line[0]) + " " + " ".join(anotation_dict[line[0]])
                            ofile.write ( new_line + "\n")
                    else:
                        ofile.write(line)
                            
                            
            ofile.close() # always remember to close the file.
            output:
            >comp1558_c0_seq1 repressor protein
            ATTGTACTTAATCTA.....
            >comp8142_c0_seq1 aspartate aminotransferase
            GTACAATCACGGCACATT......
            >comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase
            GCATCGCTTATCGGTTTA......

            Comment


            • #7
              thank you crazyhottommy for you codes,
              it works but somehow the lines were not right..
              in fasta file it should looks like 1,3,5,7,9..... lines are the headings and 2,4,6,8,10.... lines are ATCGs...

              and in the output there are ATCGs take one line and some take multiple lines.

              I`m new to python so not sure what was wrong...

              Comment


              • #8
                I see, just changed the second last line code to:

                ofile.write(line + "\n")

                let me know if it works or not.

                Comment


                • #9
                  it seems it still doesn`t work, the multi-lines issue still exists...
                  It is probably caused by the maximum line length limit while read in the data---now sure how to fix it.

                  Comment


                  • #10
                    Originally posted by Simonli View Post
                    it seems it still doesn`t work, the multi-lines issue still exists...
                    It is probably caused by the maximum line length limit while read in the data---now sure how to fix it.
                    see my updated code. you need to add "\n" at the last line of the code.

                    Comment


                    • #11
                      That simple awk command works fine:
                      Code:
                      awk 'NR==FNR{des[">"$1]=$0;next}/^>/ && des[$1]{$0=">"des[$1]}1' annotation_file fasta_file
                      cat annotation_file
                      Code:
                      comp1558_c0_seq1 repressor protein
                      comp8142_c0_seq1 aspartate aminotransferase
                      comp8357_c0_seq1 l-xylulose reductase
                      comp8387_c0_seq1 succinyl- synthetase alpha
                      comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase 
                      comp0_c1_seq1 bli
                      comp0_c0_seq1 blahblah blah blah
                      cat fasta_file
                      Code:
                      >comp0_c0_seq1 len=204 path=[4976:0-203]
                      ATTGTACTTAATCTA.....
                      >comp0_c1_seq1 len=222 path=[8835:0-221]
                      GTACAATCACGGCACATATCTATGCA
                      CACAGTCAGCTAC
                      resulting output:
                      Code:
                      >comp0_c0_seq1 blahblah blah blah
                      ATTGTACTTAATCTA.....
                      >comp0_c1_seq1 bli
                      GTACAATCACGGCACATATCTATGCA
                      CACAGTCAGCTAC

                      Comment


                      • #12
                        Thank you syfo,
                        but I`m not sure how to save the output into a txt file... awk command seems just print the result.

                        Comment


                        • #13
                          Originally posted by Simonli View Post
                          Thank you syfo,
                          but I`m not sure how to save the output into a txt file... awk command seems just print the result.
                          Just redirect the default output into a file by adding at the end of the command something like

                          > name-of-your-output-file.txt

                          Comment


                          • #14
                            Originally posted by syfo View Post
                            Just redirect the default output into a file by adding at the end of the command something like

                            > name-of-your-output-file.txt
                            which gives you more explicitly:

                            Code:
                            awk 'NR==FNR{des[">"$1]=$0;next}/^>/ && des[$1]{$0=">"des[$1]}1' annotation_file fasta_file > output-file.txt
                            Last edited by syfo; 08-11-2014, 06:31 AM. Reason: typo

                            Comment


                            • #15
                              You are awesome! Thank you so much!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Understanding Genetic Influence on Infectious Disease
                                by seqadmin




                                During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                                Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                                Yesterday, 10:59 AM
                              • seqadmin
                                Addressing Off-Target Effects in CRISPR Technologies
                                by seqadmin






                                The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                                08-27-2024, 04:44 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 09-06-2024, 08:02 AM
                              0 responses
                              138 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-03-2024, 08:30 AM
                              0 responses
                              141 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 08-27-2024, 04:40 AM
                              0 responses
                              152 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 08-22-2024, 05:00 AM
                              0 responses
                              395 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X