Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Perl script for annotating DEGseq results

    Hi all,

    My output files from DEGseq and edgeR all contain differentially expressed GeneID's from a gff file and I would like a script that returns the "product=" annotation of those GeneID's from the gff file.

    Anyone have a script for this already?

  • #2
    Since gff files can vary, can you post a few lines?

    Comment


    • #3
      Sorry I didn't add this before. It's actually a gff file that I've extracted only the CDS from using a perl script. First three lines:

      SL254 RefSeq CDS 337 2796 . + 0 ID=SL254:thrA:unknown_transcript_1;Parent=SL254:thrA;locus_tag=SNSL254_A0002;EC_number=2.7.2.4;EC_number=1.1.1.3;note=multifunctional homotetrameric enzyme that catalyzes the phosphorylation of aspartate to form aspartyl-4-phosphate as well as conversion of aspartate semialdehyde to homoserine%3B functions in a number of amino acid biosynthetic pathways;transl_table=11;product=bifunctional aspartokinase I%2Fhomoserine dehydrogenase I;protein_id=YP_002039229.1;db_xref=GI:194446390;db_xref=GeneID:6487056;exon_number=1
      SL254 RefSeq CDS 2801 3727 . + 0 ID=SL254:thrB:unknown_transcript_1;Parent=SL254:thrB;locus_tag=SNSL254_A0003;EC_number=2.7.1.39;note=catalyzes the formation of O-phospho-L-homoserine from L-homoserine in threonine biosynthesis from asparate;transl_table=11;product=homoserine kinase;protein_id=YP_002039230.1;db_xref=GI:194444692;db_xref=GeneID:6486461;exon_number=1
      SL254 RefSeq CDS 3734 5017 . + 0 ID=SL254:thrC:unknown_transcript_1;Parent=SL254:thrC;locus_tag=SNSL254_A0004;EC_number=4.2.3.1;note=catalyzes the formation of L-threonine from O-phospho-L-homoserine;transl_table=11;product=threonine synthase;protein_id=YP_002039231.1;db_xref=GI:194444103;db_xref=GeneID:6484714;exon_number=1

      Comment


      • #4
        It's not perl, but this worked for me:

        cut -f9 your_file.gff | awk -F\; '{print $(NF-4)}'

        Comment


        • #5
          That's great! Could you explain how that works? Also how to write the results to a file and possibly make the output 2 columns, one column would be "GeneID" and one column would be "product"?

          Comment


          • #6
            This will get the GeneID and the product fields and output them as a tab-separated file.

            cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-1),$(NF-4)}' > output_file

            cut -f9 takes the 9th field in your_file.gff.
            "|" pipes (sends) the output to the next command.
            Under awk, the -F\; option tells it the input has ";" as a field separator. OFS="\t" makes the output a tab delimited file. print $(NF-1) prints the field to the left of the last field separator, which is the one containing the geneID. print $(NF-4) prints the field to the left of the fourth field separator from the end of the line, which is the one containing the product.

            There are a huge number of ways to parse out the data you want under unix or perl. This was just the first one that came to mind since I've been scripting in awk all morning...
            Last edited by pbluescript; 06-15-2012, 10:47 AM. Reason: Edited based on GenoMax's suggestion.

            Comment


            • #7
              Small extension as noted below to capture the output in a file.

              Originally posted by pbluescript View Post
              This will get the GeneID and the product fields and output them as a tab-separated file.

              cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-1),$(NF-4)}' > output_file_name

              Comment


              • #8
                Originally posted by GenoMax View Post
                Small extension as noted below to capture the output in a file.
                Thank you! I overlooked that part.

                Comment


                • #9
                  Thank you. I am a microbiologist with a bit of coding experience, I am trying to learn perl on my own, but needed this quick fix because I am sitting on a mound of data.

                  Mahalo,

                  -K

                  Comment


                  • #10
                    Can I get awk to print everything to the right of say "product="? or between "product=" and ";"?

                    mahalo

                    Comment


                    • #11
                      Originally posted by umnklang View Post
                      Can I get awk to print everything to the right of say "product="? or between "product=" and ";"?

                      mahalo
                      Isn't between "product=" and ";" what you already asked for?

                      Anyway, getting everything to the right of "product=" is just a variation of the code I already mentioned.

                      Code:
                      cut -f9 your_file.gff | awk -F\; '{OFS="\t"; print $(NF-4),$(NF-3),$(NF-2),$(NF-1),$(NF-0)}'

                      Comment


                      • #12
                        Well the above code does work great, but only for those annotations that have the "product" in the 4th position from the end for every CDS. Sometimes the annotations differ between CDS and the "translation table", for example, is on the 4th position for some CDS's within the same gff file.

                        Comment


                        • #13
                          Originally posted by umnklang View Post
                          Well the above code does work great, but only for those annotations that have the "product" in the 4th position from the end for every CDS. Sometimes the annotations differ between CDS and the "translation table", for example, is on the 4th position for some CDS's within the same gff file.
                          Ah... that would be an issue then.

                          Here's a couple ways that worked for me.
                          Sticking with the cut/awk strategy, but changing the FS to "product=". You are essentially using your search term as a delimiter and then getting everything after that.

                          Code:
                          cut -f9 file.gff | awk '{FS="product="} {print "product="$(NF-0)}'
                          This works, but it is ugly.

                          Here's a better way:
                          Code:
                          sed "s/.*product=/product=/" file.gff

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Genetic Variation in Immunogenetics and Antibody Diversity
                            by seqadmin



                            The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                            11-06-2024, 07:24 PM
                          • seqadmin
                            Choosing Between NGS and qPCR
                            by seqadmin



                            Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                            10-18-2024, 07:11 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 11-08-2024, 11:09 AM
                          0 responses
                          59 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 11-08-2024, 06:13 AM
                          0 responses
                          38 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 11-01-2024, 06:09 AM
                          0 responses
                          35 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 10-30-2024, 05:31 AM
                          0 responses
                          23 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X