Seqanswers Leaderboard Ad

**Louis_Lemire** · 07-15-2011, 06:28 PM

Using grep for motif searching of raw Illumina data

sikidiri -

This is my first post. I am probably 100 hrs away from asking a reasonable question but the monitor of this forum prompted me to say something. I am glad this forum is here, the contributors are especially knowledgeable and impressive - so it'll be awhile before I can contribute - hopefully when I am in a corner I can find some assistance here.

I am currently getting under way with Bio-Linux 6 and coming from Windows its been a chore, but doing my best to start from the beginning. Right now I am looking at parakeet Illumina data from the web, having around 90bp reads with quality scores, etc.

In my upcoming workflow I have the chicken genome to align against, but first must learn how to trim the low quality reads and perhaps remove duplicate reads as I have found there are at times serveral when querying the underlying data.

I am looking for an enzyme that is a paralog (hypothesis) of an existing enzyme found in birds and reptiles (I have several candidate parent enzymes to go on). These enzymes are highly conserved in their framework across orthologs so I have certain regions of the protein from which I can search. The enzyme was likely the result of a gene duplication event over 400 Myr ago so there is chance it may not align against the chicken genome. Right now I am waiting for my Illumina reads from Emory Univ. We harvested retina tissue and the DNA looked pretty good going in.

I have been very successful in finding many exons from these proteins by using the grep (pattern) command against raw parakeet Illumina data. I have found 2-3 exons that match highly conserved regions of my unknown protein and also appear to have high quality scores when I look at them (Q20-Q35). This approach strikes me as reasonable as a first "look and see" at the raw reads. If the queries are designed across the conserved regions I obtain, generally, about 15 hits over 22 gigs of reads, all in about 10-15 minutes each. I am running a Mac Pro OSX. Right now I have 32 gigs of Ram but Bio-Linux (via Virtual Box) only sees 16 gigs - so a problem I need to research.

I made up a quick table for searching protein motifs which I am posting below (perhaps will save another newbie like myself a few minutes). It works real well and the returns on the queries are quite impressive. To wit:

Code:

[FONT="Courier New"]
AA => nt

G.............:  GG.               S............: TC.
A.............:  GC.               T............: AC.
G/A...........:  G[G/C].           S/T..........: [A/T]C.
V.............:  GT.               C............: TG[T/C]
G/A/V.........:  G[G/C/T].         
L.............:  CT. OR TT[A/G]    
L/V..........:   [G/C]T.           
P.............:  CC.

I.............:  AT[T/C/A]         F.............:  TT[T/C]      
M.............:  ATG               Y.............:  TA[T/C]
I/M...........:  AT.               F/Y..........:  T[T/A][T/C]
L/I/M........:   [C/A]T.           W............:  TGG
F/L...........:  TT. 
   
K.............:   AA[A/G]
N.............:   AA[T/C]
K/N..........:    AA.                  D............:  GA[T/C]
R.............:   AG[A/G]              E............:  GA[A/G]
K/R..........:    A[A/G][A/G]          D/E..........:  GA.
                                          
H............:   CA[T/C]                
Q............:   CA[A/G]  
Q/H..........:   CA.
K/N/Q/H......:   [A/C]A.[/FONT]

The format is a abit off. Say, e.g., I have a highly conserved region on my protein of interest such as follows:

anoCar_cp TMGALLYKHSDLEERVGG

Bold indicates a highly conserved position. My query, to reduce the number returned, may look like this:

Query: (T/S)M(A/V/G)(A/V/G)LL(Y/F)(K/R)-(S/T)--EE(R/K)(V/A/G)GG

In this way I can look for T or S in a given position, and so on. The resulting grep command is made simply by cutting and pasting from the above table:

Code:

grep "[A/T]C.ATGG[G/C/T].G[G/C/T].CT.CT.T[T/A][T/C]A[G/A][A/G]...[A/T]C.......GA.GA.A[A/G][A/G]G[G/C/T].GG.GG" filename.fq

returning (after translation):

..TMGALLYKHSDLEERVGG..
..SMVVLLYKHSDLEEKVGG..
..TMAALLFRHTELEERAGG..
..SMGVLLFKHTELDERAGG..
..etc

This may not address your problem but at the level at which I find myself, and the necessity to discover a yet un-annotated paralog that I think has been missed in several genomes, it is a first approach to the data. I also can observe quality scores, the number of repeats for certain reads, etc, and get a feel for underlying data.

My next step is to trim this test set and truncate low quality scores (thinking about ngs_backbone or similar approach). But like I said, I'm 100 hrs out from even knowing what to ask at this point - I have much reading and trial and error to go before I begin to appreciate the dedication of the people who visit and post here. I am not completely removed from programming but when it comes to RNA-seq I couldn't be more of a newbie - hopefully after several months I will have progressed to the point in which I can align my transcriptome with some confidence.

**sikidiri** · 10-21-2011, 12:30 AM

Normalization of chip-seq data

Hello All,
I have two different chip-seq samples. One is the wild type (untreated) and another one is treated sample.
After the treatment I see the reduction in tag numbers. So I want to analyse whether it is due to the effect of the treatment or this is due to some other reason.

Do you think there is some way to normalize the chip-seq data in order to make the two samples comparable?

Or is it okay to compare the two data sets without any normalization?
Thanks.

**sikidiri** · 10-21-2011, 02:23 AM

Normalization of chip-seq data

Hello All,
I have two different chip-seq samples. One is the wild type (untreated) and another one is treated sample.
After the treatment, I see an increase in the signal on my region of interest. Moreover, the two samples were sequenced at different time and during this time, Illumina sequencing got upgraded in terms of deep sequencing and hence we are getting more number of reads in the second sample. So I want to analyse whether it is due to the effect of the treatment or because of up gradation in the sequencing, we are getting more tag numbers.
Do you think there is some way to normalise the chip-seq data in order to make the two samples comparable?
Thanks.

**GenoMax** · 10-21-2011, 04:44 AM

You can use the following programs for "de novo" motif searches:

MEME. Use the web interface or if you have a large number of sequences, install locally. http://meme.nbcr.net/meme/intro.html There is even a specific version for ChIP data.

SCOPE: http://genie.dartmouth.edu/scope/

Originally posted by sikidiri View Post

Hello,
I have a set of DNA sequences (from non-geneic region) in which I am searching for motif. I have no idea about width of the motif which I should look for. What other parameters are important to get better result? Please suggest.
Thanks.

**nicedad** · 12-07-2011, 06:38 AM

Could anyone tell us what is use of de novo motife search?

**dpryan** · 12-07-2011, 08:07 AM

Originally posted by nicedad View Post

Could anyone tell us what is use of de novo motife search?

Suppose you have domains where a protein is associated with DNA. A de novo motif search will help you figure out what sort of sequences are important for its association. That's probably the most common situation, but you could easily think of others regarding methylation changes and such.

**nicedad** · 12-08-2011, 04:11 AM

Thanks dpryan, is the motif search applicable for transcriptome or only DNA?
For instance, I found only two motifs in my transcriptome, what does this mean?
Best,

**dpryan** · 12-08-2011, 05:07 AM

Motif search tools are no different than microarrays or Western blots or any other experimental tool. You can use them on anything, but the results are only meaningful if you're asking a meaningful question that's answerable by the tool. Whether or not the motifs you found are meaningful will depend on the exact experiment that resulted in the transcriptome.

**nicedad** · 12-08-2011, 06:03 AM

Thanks a lot dpryan, this is helpfull.
best,

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 50 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

De novo DNA motif search

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News