Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how do i filter rownames based on column value

    I have dataframe with two columns (target_id and fpkm). I want to keep only those in first column that are not duplicated. If they are duplicated then i would like to keep only one based on value in column 2. I have given an example for this kind below.

    target_id fpkm
    comp247393_c0_seq1 3.197885
    comp257058_c0_seq4 1.624577
    comp242590_c0_seq1 1.750319
    comp77911_c0_seq1 1.293059
    comp241426_c0_seq1 1.626589
    comp288413_c0_seq1 14.828853
    comp294436_c0_seq1 11.555596
    comp63603_c0_seq1 1.982386
    comp267138_c0_seq1 8.594494
    comp267138_c0_seq2 11.134958
    comp321623_c0_seq1 6.934149

    In the above dataframe as you can see there are two rownames with the same name (almost) comp267138_c0_seq1 comp267138_c0_seq2 and i want to keep only comp267138_c0_seq2 because it has higher value in column 2. Please help me with this....

  • #2
    Assuming you want to keep the seq number, it could be done with a moderately simple python script:
    Code:
    fh = open('file_name')
    print fh.readline() # Clear the header
    best_lines = {}
    for line in fh:
        id, fpkm = line.strip().split()
        fpkm = float(fpkm)  # Turn into a number
        id_base, id_seqnum = id.rsplit('_', 1) # Assume everything before _seq is the same
    
        if id_base not in best_lines:
            best_lines[id_base] = (fpkm, id_seqnum)
        else:
            if fpkm > best_lines[id_base][0]:
                best_lines[id_base] = (fpkm, id_seqnum)
    
    for id_base in best_lines:
        fpkm, id_seqnum = best_lines[id_base]
        print id_base+"_"+id_seqnum, fpkm

    This won't necessarily retain the original order of the file, but will deal with the possibility that, for instance, comp267138_c0_seq1 and comp267138_c0_seq2 aren't in adjacent lines.

    Comment


    • #3
      Originally posted by rflrob View Post
      Assuming you want to keep the seq number, it could be done with a moderately simple python script:
      Code:
      fh = open('file_name')
      print fh.readline() # Clear the header
      best_lines = {}
      for line in fh:
          id, fpkm = line.strip().split()
          fpkm = float(fpkm)  # Turn into a number
          id_base, id_seqnum = id.rsplit('_', 1) # Assume everything before _seq is the same
      
          if id_base not in best_lines:
              best_lines[id_base] = (fpkm, id_seqnum)
          else:
              if fpkm > best_lines[id_base][0]:
                  best_lines[id_base] = (fpkm, id_seqnum)
      
      for id_base in best_lines:
          fpkm, id_seqnum = best_lines[id_base]
          print id_base+"_"+id_seqnum, fpkm

      This won't necessarily retain the original order of the file, but will deal with the possibility that, for instance, comp267138_c0_seq1 and comp267138_c0_seq2 aren't in adjacent lines.
      Hi rflrob, it worked perfectly. I have been struggling to write something like this in perl for a while but couldn't get it to work and your script worked like a charm. Don't worry about the order of id's as i am not too worried about them as long as i filter the columns. Thanks a lot again man.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Recent Advances in Sequencing Analysis Tools
        by seqadmin


        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
        Today, 07:48 AM
      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 07:17 AM
      0 responses
      7 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-02-2024, 08:06 AM
      0 responses
      19 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-30-2024, 12:17 PM
      0 responses
      20 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-29-2024, 10:49 AM
      0 responses
      28 views
      0 likes
      Last Post seqadmin  
      Working...
      X