Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • chudar
    Junior Member
    • Jul 2015
    • 9

    Finding mismatches between two sequences using R

    Hi,

    I am new to R programming. I have two fasta files namely WT and MT that contains 3 protein sequences which is given below

    WT:
    >seq1
    MGFTIIIJKLPADFDEMMJDHGASFWEIHDI
    >seq2
    IIJHGTIPIOIZTRWERQQMSCVYMMDMMLI
    >seq3
    KPOIHUTWMMKLIZGTIIWHSAKDFJGVAHD

    MT:
    >seq1
    TGFTHIHJKLPADFDEMTJDHGASFWEIHDH
    >seq2
    HHJHGTIPIOIZTRWERQQMSCVYTMDTMLH
    >seq3
    KPOIHUTWTTKLIZGTHHWHSAKDFJGVAHD


    WT contains set of fasta sequences without any mutated aminoacids while the MT contains same protein sequences of WT but some amino acid are mutated.

    I want compare the two sequences and find out counts for particular mutations wherever M mutated to T, I mutated to H for every protein and list them like Seq1 MT_counts:xx;IH_counts:YY Seq2:MT_counts:xx; IH_counts:yy Seq3:MT_counts:xx IH_counts:yy

    I have written a small function in R where I read the fasta sequencces using read.fasta from seqinr package and later using two for loops for iterating through all sequences and another for iterating through every letter of seq in WT and comparing with its counterpart in MT.

    Code:
    library(seqinr)
    wt=read.fasta("C:/Users/tsekaran/Documents/sample_ref_protein.fasta")
    mt=read.fasta("C:/Users/tsekaran/Documents/sample_mut_protein.fasta")
    mismatch_finder=function(wt,mt)
    {
    for (i in 1:length(names(wt)))
    { 
      MT_count=0
      IH_count=0
      #for(j in 1:length(wt$seq1))
      for(j in 1:length(wt[[i]]))
      {  
        if(wt[j]=="m" && mt[j]=="t" )
        {
          MT_count=MT_count+1
        }
      else if(wt[j]=="i" && mt[j]=="h" )
        {
          IH_count=IH_count+1
        }
      }
      print(names(wt[i]))
      print(MT_count)
      print(IH_count)
    }
    }
    But it gives me counts for MT and IH as 0 for all seq1 ,seq2 and seq3. I know there are some mutated aminoacid but I am surprised as the function reports the count as 0. Could some guide me where if there is any mistake in comparison. Kindly help me. Thanks in advance.
    Last edited by chudar; 07-31-2015, 02:04 AM. Reason: Updating R script
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    If you ask people to help you with an error message then you need to post the error message. I see a number of problems with your code, from using non-existent variables ("wtf") to incorrect syntax ("wt$[j]").

    Comment

    • chudar
      Junior Member
      • Jul 2015
      • 9

      #3
      Originally posted by dpryan View Post
      If you ask people to help you with an error message then you need to post the error message. I see a number of problems with your code, from using non-existent variables ("wtf") to incorrect syntax ("wt$[j]").
      Dear DpRyan,

      Thank you very much for your concern. I have updated my R script and have edited my question also. When I ran my updated script it expected it give number of mismatches but it gave me only 0 for all my sequences. I dont whether my comparison code is corrrect. Kindly take a look and guide me please. Thanks in advance

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        Grr, the site ate my initial reply!

        I assume you meant to write this, since otherwise you're trying to compare a character and a list (wt[i] == "m"), which will always be false and thus yield counts of 0:

        Code:
        for (i in 1:length(names(wt)))
        { 
          MT_count=0
          IH_count=0
          #for(j in 1:length(wt$seq1))
          for(j in 1:length(wt[[i]]))
          {  
            if(wt[[i]][j]=="m" && mt[[i]][j]=="t" )
            {
              MT_count=MT_count+1
            }
          else if(wt[[i]][j]=="i" && mt[[i]][j]=="h" )
            {
              IH_count=IH_count+1
            }
          }
          print(names(wt[i]))
          print(MT_count)
          print(IH_count)
        }
        This could be done more succinctly (and probably with better performance) with:
        Code:
        for (i in 1:length(names(wt))) {
            print(names(wt)[i])
            print(length(intersect(which(wt[[i]] == "m"), which(mt[[i]] == "t"))))
            print(length(intersect(which(wt[[i]] == "i"), which(mt[[i]] == "h"))))
        }

        Comment

        • chudar
          Junior Member
          • Jul 2015
          • 9

          #5
          Hi Dpryan,

          Thank you very much for code. it works fine. I would like to output the data into a data frame so that it looks like

          Code:
          ID     MT  IH
          seq1   2    5
          seq2   4    7
          seq3   6    9
          So I edited the code like below where I created an data frame with NA and used it inside the for loop

          Code:
          mismatch=function(wt,mt)
          {
          df=data.frame(ID=NA,MT=NA,IH=NA)
          for (i in 1:length(names(wt))) {
            df$ID[i]=names(wt)[i]
            df$MT[i]= length(intersect(which(wt[[i]] == "m"), which(mt[[i]] == "t")))
            df$IH[i]=length(intersect(which(wt[[i]] == "i"), which(mt[[i]] == "h")))
          }
           return (df)
          }

          but it gives me an error as follows
          Code:
          Error in `$<-.data.frame`(`*tmp*`, "ID", value = c("seq1", "seq2")) : 
            replacement has 2 rows, data has 1
          I have no clue how am I ending up like this.

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            Dataframes need to have identical length rows, so you can't add something to ID without also adding values to MT and IH. The actual R way to do this would be something like:

            Code:
            library(seqinr)
            wt=read.fasta("C:/Users/tsekaran/Documents/sample_ref_protein.fasta")
            mt=read.fasta("C:/Users/tsekaran/Documents/sample_mut_protein.fasta")
            
            getDiffs <- function(wt, mt) {
                c(names(wt),
                length(intersect(which(wt == "m"), which(mt == "t"))),
                length(intersect(which(wt == "i"), which(mt == "h"))))
            }
            
            res <- as.data.frame(t(mapply(getDiffs, wt, mt)))
            names(res) <- c("MT","IT")

            Comment

            • dpryan
              Devon Ryan
              • Jul 2011
              • 3478

              #7
              BTW, you could also just use rbind() if you want to keep the for loop.

              Comment

              Latest Articles

              Collapse

              • GATTACAT
                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by GATTACAT
                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                07-01-2026, 11:43 AM
              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 07-02-2026, 11:08 AM
              0 responses
              9 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-30-2026, 05:37 AM
              0 responses
              13 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-26-2026, 11:10 AM
              0 responses
              20 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              54 views
              0 reactions
              Last Post SEQadmin2  
              Working...