Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • De novo DNA motif search

    Hello,
    I have a set of DNA sequences (from non-geneic region) in which I am searching for motif. I have no idea about width of the motif which I should look for. What other parameters are important to get better result? Please suggest.
    Thanks.

  • #2
    Using grep for motif searching of raw Illumina data

    sikidiri -

    This is my first post. I am probably 100 hrs away from asking a reasonable question but the monitor of this forum prompted me to say something. I am glad this forum is here, the contributors are especially knowledgeable and impressive - so it'll be awhile before I can contribute - hopefully when I am in a corner I can find some assistance here.

    I am currently getting under way with Bio-Linux 6 and coming from Windows its been a chore, but doing my best to start from the beginning. Right now I am looking at parakeet Illumina data from the web, having around 90bp reads with quality scores, etc.

    In my upcoming workflow I have the chicken genome to align against, but first must learn how to trim the low quality reads and perhaps remove duplicate reads as I have found there are at times serveral when querying the underlying data.

    I am looking for an enzyme that is a paralog (hypothesis) of an existing enzyme found in birds and reptiles (I have several candidate parent enzymes to go on). These enzymes are highly conserved in their framework across orthologs so I have certain regions of the protein from which I can search. The enzyme was likely the result of a gene duplication event over 400 Myr ago so there is chance it may not align against the chicken genome. Right now I am waiting for my Illumina reads from Emory Univ. We harvested retina tissue and the DNA looked pretty good going in.

    I have been very successful in finding many exons from these proteins by using the grep (pattern) command against raw parakeet Illumina data. I have found 2-3 exons that match highly conserved regions of my unknown protein and also appear to have high quality scores when I look at them (Q20-Q35). This approach strikes me as reasonable as a first "look and see" at the raw reads. If the queries are designed across the conserved regions I obtain, generally, about 15 hits over 22 gigs of reads, all in about 10-15 minutes each. I am running a Mac Pro OSX. Right now I have 32 gigs of Ram but Bio-Linux (via Virtual Box) only sees 16 gigs - so a problem I need to research.

    I made up a quick table for searching protein motifs which I am posting below (perhaps will save another newbie like myself a few minutes). It works real well and the returns on the queries are quite impressive. To wit:

    Code:
    [FONT="Courier New"]
    AA => nt
    
    G.............:  GG.               S............: TC.
    A.............:  GC.               T............: AC.
    G/A...........:  G[G/C].           S/T..........: [A/T]C.
    V.............:  GT.               C............: TG[T/C]
    G/A/V.........:  G[G/C/T].         
    L.............:  CT. OR TT[A/G]    
    L/V..........:   [G/C]T.           
    P.............:  CC.
    
    I.............:  AT[T/C/A]         F.............:  TT[T/C]      
    M.............:  ATG               Y.............:  TA[T/C]
    I/M...........:  AT.               F/Y..........:  T[T/A][T/C]
    L/I/M........:   [C/A]T.           W............:  TGG
    F/L...........:  TT. 
       
    K.............:   AA[A/G]
    N.............:   AA[T/C]
    K/N..........:    AA.                  D............:  GA[T/C]
    R.............:   AG[A/G]              E............:  GA[A/G]
    K/R..........:    A[A/G][A/G]          D/E..........:  GA.
                                              
    H............:   CA[T/C]                
    Q............:   CA[A/G]  
    Q/H..........:   CA.
    K/N/Q/H......:   [A/C]A.[/FONT]
    The format is a abit off. Say, e.g., I have a highly conserved region on my protein of interest such as follows:

    anoCar_cp TMGALLYKHSDLEERVGG

    Bold indicates a highly conserved position. My query, to reduce the number returned, may look like this:

    Query: (T/S)M(A/V/G)(A/V/G)LL(Y/F)(K/R)-(S/T)--EE(R/K)(V/A/G)GG

    In this way I can look for T or S in a given position, and so on. The resulting grep command is made simply by cutting and pasting from the above table:

    Code:
    grep "[A/T]C.ATGG[G/C/T].G[G/C/T].CT.CT.T[T/A][T/C]A[G/A][A/G]...[A/T]C.......GA.GA.A[A/G][A/G]G[G/C/T].GG.GG" filename.fq
    returning (after translation):

    ..TMGALLYKHSDLEERVGG..
    ..SMVVLLYKHSDLEEKVGG..
    ..TMAALLFRHTELEERAGG..
    ..SMGVLLFKHTELDERAGG..
    ..etc

    This may not address your problem but at the level at which I find myself, and the necessity to discover a yet un-annotated paralog that I think has been missed in several genomes, it is a first approach to the data. I also can observe quality scores, the number of repeats for certain reads, etc, and get a feel for underlying data.

    My next step is to trim this test set and truncate low quality scores (thinking about ngs_backbone or similar approach). But like I said, I'm 100 hrs out from even knowing what to ask at this point - I have much reading and trial and error to go before I begin to appreciate the dedication of the people who visit and post here. I am not completely removed from programming but when it comes to RNA-seq I couldn't be more of a newbie - hopefully after several months I will have progressed to the point in which I can align my transcriptome with some confidence.
    Last edited by Louis_Lemire; 07-15-2011, 06:40 PM. Reason: err in grep code; updating

    Comment


    • #3
      Normalization of chip-seq data

      Hello All,
      I have two different chip-seq samples. One is the wild type (untreated) and another one is treated sample.
      After the treatment I see the reduction in tag numbers. So I want to analyse whether it is due to the effect of the treatment or this is due to some other reason.

      Do you think there is some way to normalize the chip-seq data in order to make the two samples comparable?

      Or is it okay to compare the two data sets without any normalization?
      Thanks.

      Comment


      • #4
        Normalization of chip-seq data

        Hello All,
        I have two different chip-seq samples. One is the wild type (untreated) and another one is treated sample.
        After the treatment, I see an increase in the signal on my region of interest. Moreover, the two samples were sequenced at different time and during this time, Illumina sequencing got upgraded in terms of deep sequencing and hence we are getting more number of reads in the second sample. So I want to analyse whether it is due to the effect of the treatment or because of up gradation in the sequencing, we are getting more tag numbers.
        Do you think there is some way to normalise the chip-seq data in order to make the two samples comparable?
        Thanks.

        Comment


        • #5
          You can use the following programs for "de novo" motif searches:

          MEME. Use the web interface or if you have a large number of sequences, install locally. http://meme.nbcr.net/meme/intro.html There is even a specific version for ChIP data.

          SCOPE: http://genie.dartmouth.edu/scope/


          Originally posted by sikidiri View Post
          Hello,
          I have a set of DNA sequences (from non-geneic region) in which I am searching for motif. I have no idea about width of the motif which I should look for. What other parameters are important to get better result? Please suggest.
          Thanks.

          Comment


          • #6
            Could anyone tell us what is use of de novo motife search?

            Comment


            • #7
              Originally posted by nicedad View Post
              Could anyone tell us what is use of de novo motife search?
              Suppose you have domains where a protein is associated with DNA. A de novo motif search will help you figure out what sort of sequences are important for its association. That's probably the most common situation, but you could easily think of others regarding methylation changes and such.

              Comment


              • #8
                Thanks dpryan, is the motif search applicable for transcriptome or only DNA?
                For instance, I found only two motifs in my transcriptome, what does this mean?
                Best,

                Comment


                • #9
                  Motif search tools are no different than microarrays or Western blots or any other experimental tool. You can use them on anything, but the results are only meaningful if you're asking a meaningful question that's answerable by the tool. Whether or not the motifs you found are meaningful will depend on the exact experiment that resulted in the transcriptome.

                  Comment


                  • #10
                    Thanks a lot dpryan, this is helpfull.
                    best,

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    19 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    50 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X