Header Leaderboard Ad

Collapse

Using AWK to perform VLOOKUP-like command

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using AWK to perform VLOOKUP-like command

    Hi all,

    I have a set of files with genotypic data, each divided into 3 columns of data, including: MARKER, ID, GENOTYPE.

    I would like to use AWK (without changing the order/sorting of the files) to perform a VLOOKUP-like command in order to join the data within the files into a single file as follows:

    File1:

    BIEC2-99962 HOR_233 G_G
    BIEC2-9997 HOR_233 A_G
    BIEC2-999748 HOR_233 C_C
    BIEC2-999848 HOR_233 G_G
    BIEC2-99989 HOR_233 A_A

    File2:

    BIEC2-9997 HOR_250 A_A
    BIEC2-999748 HOR_250 C_C
    BIEC2-99989 HOR_250 A_C

    File3:

    BIEC2-9997 HOR_615 A_G
    BIEC2-999748 HOR_615 A_C
    BIEC2-999848 HOR_615 A_G
    BIEC2-99989 HOR_615 A_C

    Expected result:

    BIEC2-99962 G_G NA NA
    BIEC2-9997 A_G A_A A_G
    BIEC2-999748 C_C C_C A_C
    BIEC2-999848 G_G NA A_G
    BIEC2-99989 A_A A_C A_C

    I would appreciate any help on this.

    Thanks!
    Last edited by sagi.polani; 02-19-2015, 12:25 AM.

  • #2
    Cross-posted on biostars.

    Comment


    • #3
      You could probably modify this script for your input sets:

      Code:
      #!/usr/bin/env python
      
      input_one = '''
      BIEC2-99962 HOR_233 G_G
      BIEC2-9997 HOR_233 A_G
      BIEC2-999748 HOR_233 C_C
      BIEC2-999848 HOR_233 G_G
      BIEC2-99989 HOR_233 A_A
      '''.strip().split()
      
      input_two = '''
      BIEC2-9997 HOR_250 A_A
      BIEC2-999748 HOR_250 C_C
      BIEC2-99989 HOR_250 A_C
      '''.strip().split()
      
      input_three = '''
      BIEC2-9997 HOR_615 A_G
      BIEC2-999748 HOR_615 A_C
      BIEC2-999848 HOR_615 A_G
      BIEC2-99989 HOR_615 A_C
      '''.strip().split()
      
      one_list = input_one[::3]
      one_dict = dict(zip(one_list, input_one[2::3]))
      two_dict = dict(zip(input_two[::3], input_two[2::3]))
      three_dict = dict(zip(input_three[::3], input_three[2::3]))
      
      print '\n'.join([' '.join([k, one_dict[k], two_dict.get(k, 'NA'), three_dict.get(k, 'NA')]) for k in one_list])
      The output looks like:

      Code:
      $ ./join_test.py
      BIEC2-99962 G_G NA NA
      BIEC2-9997 A_G A_A A_G
      BIEC2-999748 C_C C_C A_C
      BIEC2-999848 G_G NA A_G
      BIEC2-99989 A_A A_C A_C
      This seems to match your expected output.

      If you want to understand how the script works, use some print statements for each variable before the list comprehension, and then break the list comprehension down into smaller pieces.

      Ultimately, you would replace lists input_one, input_two and input_three with the results from reading in your input files with open() and readlines() methods.

      Remember to strip() and split() so that each element of the list is separated from the others, regardless of whether the delimiter is a space or newline — use print to investigate one of the sample input lists, if this requirement isn't clear.

      I'd second that awk is not really the ideal tool for this job, and I use it a great deal.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        How RNA-Seq is Transforming Cancer Studies
        by seqadmin



        Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
        09-07-2023, 11:15 PM
      • seqadmin
        Methods for Investigating the Transcriptome
        by seqadmin




        Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

        Whole Transcriptome RNA-seq
        Whole transcriptome sequencing...
        08-31-2023, 11:07 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:18 AM
      0 responses
      5 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 09-20-2023, 09:17 AM
      0 responses
      8 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 09-19-2023, 09:23 AM
      0 responses
      25 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 09-19-2023, 09:14 AM
      0 responses
      7 views
      0 likes
      Last Post seqadmin  
      Working...
      X