Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • GMDickson
    Junior Member
    • Feb 2014
    • 9

    Data format - Unusable to R and Plink

    Hullo! Apologies for the seemingly basic question but I am in knots over this and hope there is some help.

    I have recently 'inherited' an Illumina dataset containing about 36,000SNP genotypes for about 400 individuals. The data I have is presented as a 3 columns in a text file.

    The columns are 'Indiv_ID', 'SNP_ID' and 'Genotype' (so 36000x400 rows in total) and I need the data in some usable format so that I can extract data for specific 500 SNPs. ideally I would prefer the data in an individual x SNP matrix.

    Usually I have used R to reshape such data as such but this file seems to be too big and it just freezes whilst processing. I have also used Plink in the past to extract specific SNP data from 'column' data but in the text file I have, the genotype is given as an 'AB' format which Plink doesn't accept as a compound genotype. I have attempted to change all As and Bs for 1s and 2s so that I can input as a compound to Plink, but the software I was using to do this also adds " "s which then need to be removed. And no text editor I have seems to cope with this for so many lines of text.

    I am not able to get this data in any other format and the only other data I have in relation to this is (just) enough for me to creat a .map file for Plink. Otherwise it is just the (seeminly) infinite column. I appreciate that this is probably a very simple task all in but at the moment I cannot see the wood for trees, and am going around in circles. I would welcome any starting points or good reference sites to check out!

    Thanks!
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Just write the names of the SNP_IDs that you're interested to a file and either use grep to get them out or just write a small python/perl/whatever script to do so.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      New Genomics Tools and Methods Shared at AGBT 2025
      by seqadmin


      This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

      The Headliner
      The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
      03-03-2025, 01:39 PM
    • seqadmin
      Investigating the Gut Microbiome Through Diet and Spatial Biology
      by seqadmin




      The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
      02-24-2025, 06:31 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 03-20-2025, 05:03 AM
    0 responses
    17 views
    0 reactions
    Last Post seqadmin  
    Started by seqadmin, 03-19-2025, 07:27 AM
    0 responses
    18 views
    0 reactions
    Last Post seqadmin  
    Started by seqadmin, 03-18-2025, 12:50 PM
    0 responses
    19 views
    0 reactions
    Last Post seqadmin  
    Started by seqadmin, 03-03-2025, 01:15 PM
    0 responses
    186 views
    0 reactions
    Last Post seqadmin  
    Working...