Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Perl script for annotating DEGseq results

    Hi all,

    My output files from DEGseq and edgeR all contain differentially expressed GeneID's from a gff file and I would like a script that returns the "product=" annotation of those GeneID's from the gff file.

    Anyone have a script for this already?

  • #2
    Since gff files can vary, can you post a few lines?

    Comment


    • #3
      Sorry I didn't add this before. It's actually a gff file that I've extracted only the CDS from using a perl script. First three lines:

      SL254 RefSeq CDS 337 2796 . + 0 ID=SL254:thrA:unknown_transcript_1;Parent=SL254:thrA;locus_tag=SNSL254_A0002;EC_number=2.7.2.4;EC_number=1.1.1.3;note=multifunctional homotetrameric enzyme that catalyzes the phosphorylation of aspartate to form aspartyl-4-phosphate as well as conversion of aspartate semialdehyde to homoserine%3B functions in a number of amino acid biosynthetic pathways;transl_table=11;product=bifunctional aspartokinase I%2Fhomoserine dehydrogenase I;protein_id=YP_002039229.1;db_xref=GI:194446390;db_xref=GeneID:6487056;exon_number=1
      SL254 RefSeq CDS 2801 3727 . + 0 ID=SL254:thrB:unknown_transcript_1;Parent=SL254:thrB;locus_tag=SNSL254_A0003;EC_number=2.7.1.39;note=catalyzes the formation of O-phospho-L-homoserine from L-homoserine in threonine biosynthesis from asparate;transl_table=11;product=homoserine kinase;protein_id=YP_002039230.1;db_xref=GI:194444692;db_xref=GeneID:6486461;exon_number=1
      SL254 RefSeq CDS 3734 5017 . + 0 ID=SL254:thrC:unknown_transcript_1;Parent=SL254:thrC;locus_tag=SNSL254_A0004;EC_number=4.2.3.1;note=catalyzes the formation of L-threonine from O-phospho-L-homoserine;transl_table=11;product=threonine synthase;protein_id=YP_002039231.1;db_xref=GI:194444103;db_xref=GeneID:6484714;exon_number=1

      Comment


      • #4
        It's not perl, but this worked for me:

        cut -f9 your_file.gff | awk -F\; '{print $(NF-4)}'

        Comment


        • #5
          That's great! Could you explain how that works? Also how to write the results to a file and possibly make the output 2 columns, one column would be "GeneID" and one column would be "product"?

          Comment


          • #6
            This will get the GeneID and the product fields and output them as a tab-separated file.

            cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-1),$(NF-4)}' > output_file

            cut -f9 takes the 9th field in your_file.gff.
            "|" pipes (sends) the output to the next command.
            Under awk, the -F\; option tells it the input has ";" as a field separator. OFS="\t" makes the output a tab delimited file. print $(NF-1) prints the field to the left of the last field separator, which is the one containing the geneID. print $(NF-4) prints the field to the left of the fourth field separator from the end of the line, which is the one containing the product.

            There are a huge number of ways to parse out the data you want under unix or perl. This was just the first one that came to mind since I've been scripting in awk all morning...
            Last edited by pbluescript; 06-15-2012, 10:47 AM. Reason: Edited based on GenoMax's suggestion.

            Comment


            • #7
              Small extension as noted below to capture the output in a file.

              Originally posted by pbluescript View Post
              This will get the GeneID and the product fields and output them as a tab-separated file.

              cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-1),$(NF-4)}' > output_file_name

              Comment


              • #8
                Originally posted by GenoMax View Post
                Small extension as noted below to capture the output in a file.
                Thank you! I overlooked that part.

                Comment


                • #9
                  Thank you. I am a microbiologist with a bit of coding experience, I am trying to learn perl on my own, but needed this quick fix because I am sitting on a mound of data.

                  Mahalo,

                  -K

                  Comment


                  • #10
                    Can I get awk to print everything to the right of say "product="? or between "product=" and ";"?

                    mahalo

                    Comment


                    • #11
                      Originally posted by umnklang View Post
                      Can I get awk to print everything to the right of say "product="? or between "product=" and ";"?

                      mahalo
                      Isn't between "product=" and ";" what you already asked for?

                      Anyway, getting everything to the right of "product=" is just a variation of the code I already mentioned.

                      Code:
                      cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-4),$(NF-3),$(NF-2),$(NF-1),$(NF-0)}'

                      Comment


                      • #12
                        Well the above code does work great, but only for those annotations that have the "product" in the 4th position from the end for every CDS. Sometimes the annotations differ between CDS and the "translation table", for example, is on the 4th position for some CDS's within the same gff file.

                        Comment


                        • #13
                          Originally posted by umnklang View Post
                          Well the above code does work great, but only for those annotations that have the "product" in the 4th position from the end for every CDS. Sometimes the annotations differ between CDS and the "translation table", for example, is on the 4th position for some CDS's within the same gff file.
                          Ah... that would be an issue then.

                          Here's a couple ways that worked for me.
                          Sticking with the cut/awk strategy, but changing the FS to "product=". You are essentially using your search term as a delimiter and then getting everything after that.

                          Code:
                          cut -f9 file.gff | awk '{FS="product="} {print "product="$(NF-0)}'
                          This works, but it is ugly.

                          Here's a better way:
                          Code:
                          sed "s/.*product=/product=/" file.gff

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Non-Coding RNA Research and Technologies
                            by seqadmin




                            Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                            Nobel Prize for MicroRNA Discovery
                            This week,...
                            10-07-2024, 08:07 AM
                          • seqadmin
                            Recent Developments in Metagenomics
                            by seqadmin





                            Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                            09-23-2024, 06:35 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Today, 06:55 AM
                          0 responses
                          9 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 10-02-2024, 04:51 AM
                          0 responses
                          105 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 10-01-2024, 07:10 AM
                          0 responses
                          114 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 09-30-2024, 08:33 AM
                          1 response
                          118 views
                          0 likes
                          Last Post EmiTom
                          by EmiTom
                           
                          Working...
                          X