Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • upendra_35
    Senior Member
    • Apr 2010
    • 102

    how do i filter rownames based on column value

    I have dataframe with two columns (target_id and fpkm). I want to keep only those in first column that are not duplicated. If they are duplicated then i would like to keep only one based on value in column 2. I have given an example for this kind below.

    target_id fpkm
    comp247393_c0_seq1 3.197885
    comp257058_c0_seq4 1.624577
    comp242590_c0_seq1 1.750319
    comp77911_c0_seq1 1.293059
    comp241426_c0_seq1 1.626589
    comp288413_c0_seq1 14.828853
    comp294436_c0_seq1 11.555596
    comp63603_c0_seq1 1.982386
    comp267138_c0_seq1 8.594494
    comp267138_c0_seq2 11.134958
    comp321623_c0_seq1 6.934149

    In the above dataframe as you can see there are two rownames with the same name (almost) comp267138_c0_seq1 comp267138_c0_seq2 and i want to keep only comp267138_c0_seq2 because it has higher value in column 2. Please help me with this....
  • rflrob
    Member
    • May 2010
    • 50

    #2
    Assuming you want to keep the seq number, it could be done with a moderately simple python script:
    Code:
    fh = open('file_name')
    print fh.readline() # Clear the header
    best_lines = {}
    for line in fh:
        id, fpkm = line.strip().split()
        fpkm = float(fpkm)  # Turn into a number
        id_base, id_seqnum = id.rsplit('_', 1) # Assume everything before _seq is the same
    
        if id_base not in best_lines:
            best_lines[id_base] = (fpkm, id_seqnum)
        else:
            if fpkm > best_lines[id_base][0]:
                best_lines[id_base] = (fpkm, id_seqnum)
    
    for id_base in best_lines:
        fpkm, id_seqnum = best_lines[id_base]
        print id_base+"_"+id_seqnum, fpkm

    This won't necessarily retain the original order of the file, but will deal with the possibility that, for instance, comp267138_c0_seq1 and comp267138_c0_seq2 aren't in adjacent lines.

    Comment

    • upendra_35
      Senior Member
      • Apr 2010
      • 102

      #3
      Originally posted by rflrob View Post
      Assuming you want to keep the seq number, it could be done with a moderately simple python script:
      Code:
      fh = open('file_name')
      print fh.readline() # Clear the header
      best_lines = {}
      for line in fh:
          id, fpkm = line.strip().split()
          fpkm = float(fpkm)  # Turn into a number
          id_base, id_seqnum = id.rsplit('_', 1) # Assume everything before _seq is the same
      
          if id_base not in best_lines:
              best_lines[id_base] = (fpkm, id_seqnum)
          else:
              if fpkm > best_lines[id_base][0]:
                  best_lines[id_base] = (fpkm, id_seqnum)
      
      for id_base in best_lines:
          fpkm, id_seqnum = best_lines[id_base]
          print id_base+"_"+id_seqnum, fpkm

      This won't necessarily retain the original order of the file, but will deal with the possibility that, for instance, comp267138_c0_seq1 and comp267138_c0_seq2 aren't in adjacent lines.
      Hi rflrob, it worked perfectly. I have been struggling to write something like this in perl for a while but couldn't get it to work and your script worked like a charm. Don't worry about the order of id's as i am not too worried about them as long as i filter the columns. Thanks a lot again man.

      Comment

      Latest Articles

      Collapse

      • SEQadmin2
        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
        by SEQadmin2


        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

        Here are nine questions we think about, in roughly the order they matter, before...
        06-18-2026, 07:11 AM
      • SEQadmin2
        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
        by SEQadmin2


        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
        ...
        06-02-2026, 10:05 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by SEQadmin2, Today, 05:37 AM
      0 responses
      5 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-26-2026, 11:10 AM
      0 responses
      16 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-17-2026, 06:09 AM
      0 responses
      49 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-09-2026, 11:58 AM
      0 responses
      109 views
      0 reactions
      Last Post SEQadmin2  
      Working...