Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GimmeMotifs: a ChIP-seq motif prediction pipeline

    Hello all,

    As we're working with a lot of ChIP-seq data in our lab, we needed a tool to reliably predict motifs de novo from our peaks. The approach we developed might be useful to others, so I'd like to point you to the website:


    Basically, the approach is to run several different algorithms (as was suggested in some benchmark studies and reviews), and combine the output into a non-redundant list of motifs. Long-time favorites such as MEME and MotifSampler are included, as well as some more recent tools developed for ChIP-seq (or ChIP-chip) data including trawler and MoAn.
    To rank and evaluate the motifs we predict motifs on a part of the dataset, and use the rest for evaluation (enrichment, ROC curve, MNCP score).

    You can see an example of the output here (this is for a ChIP-seq experiment with the transcription factor p63):


    The package is implemented in Python, and can be freely downloaded. Installation is somewhat of a hassle as all the different tools need to be installed and configured separately, but other than that I hope that the installation procedure is smooth and documented.

    Please let me know if you find GimmeMotifs useful, have any questions or notice any bugs or omissions in the documentation.

    Simon

  • #2
    Hi Simon - this looks pretty neat, Im installing it now and pester you with questions!

    Comment


    • #3
      First problem I've overcome is some strange incompatibility between parallel python (python-pp version 1.5.7-1) and numpy using the Ubuntu 10.04 repository versions, I solved this by installing version 1.6.0-RC5 of parallel python from here and I am now up and running the included example using using meme, Weeder, MDmodule, gadem

      Which version of parallel python are you developing with? It could be a bug specific to my system as it hasnt had a clean install since Ubuntu 8.04

      Comment


      • #4
        Hmm that's strange. I'm using version 1.5.7 of pp in combination with numpy version 1.4.1, and that works fine. Which version of numpy is in the Ubuntu repositories? Are you running Python 2.6?
        Was it similar to this bug: http://www.parallelpython.com/compon...9/topic,413.0?

        Let me know if using pp 1.6.0 resolves the issue.

        Comment


        • #5
          Yeah that link is where I got the idea to install pp 1.6.0 (ubuntu numpy is only version 1.3.0, if I have more troubles I'll try upgrading that next), all using python 2.6

          I've run into a few bugs in gimmemotifs that I'm fixing along the way, you should see a pull request on your github soon! (though I'm no python developer)

          Comment


          • #6
            Ok I've gotten it to successfully run the included example - what I had to do was remove the Ubuntu versions of numpy (therefore matplotlib), scipy and parallel python and install from source

            numpy-1.4.1
            scipy-0.8.0rc1
            pp-1.5.7 (doesn't work with pp-1.6.0rc5)
            matplotlib-0.99.3

            Its now running on one of my .bed files output from MACS - I had to remove trim it down to a 3 column bed to get it to work, what does gimmemotifs use the 4th column for?

            But so for this looks pretty useful, thanks for releasing it

            Comment


            • #7
              Thanks for finding and fixing some of the bugs

              I will have a look at the input format. I should fix it, so that any file in valid BED format is accepted. The fourth column is used to sort the peaks (we usually have the nr of reads in there). This is for the benefit of MDmodule, which actually uses the ranking of the sequences in the motif search. However, if there is no numerical value in the fourth column, it should just be left unused, instead of choking on that input.

              Comment


              • #8
                Please add an entry in the software wiki; otherwise you're stuck with what I put there!

                Comment


                • #9
                  Ah, yes, that was on my to-do list, it's good to be reminded. Done

                  Comment


                  • #10
                    I just wanted to let you know that GimmeMotifs has been accepted for publication in Bioinformatics:
                    doi: 10.1093/bioinformatics/btq636.

                    The installation procedure has been simplified, and packages for Ubuntu, Debian and Fedora are now available. If you need motif prediction for ChIP-seq data, give it a try and let me know what you think: http://www.ncmls.nl/bioinfo/gimmemotifs/.

                    Comment


                    • #11
                      Hi Simon,

                      first of all thank you for the tool. I am now preparing to try it out but since my data is a tad tricky I was wondering if you could give some hints on how to best set-up the run.

                      The issue is that the peaks are not from ChIP-seq but from DamID-seq. This means that the motif might not not be necessarily located in middle of the peak and the peaks - if one can called them that - can be quite broad (from a 100bp to >5kb). This is for a transcription factor btw.

                      So the question is, do you have any recommendations when analysing data from this type of experiment (or similar)? At the moment what I am selecting peaks less than 1kb to use as an input.

                      Comment


                      • #12
                        This is indeed trickier than a typical ChIP-seq run, but most likely not impossible. Basically there's two important things here. First is, the fact that the motif is not located in the center of the peak. Most motif programs that are run by GimmeMotifs do not take the location of the motif in the sequence into account. However, by default GimmeMotifs truncates the input sequences to 200 basepairs. This is probably too strict in your case. So I would change the -w parameter to 1000 to use 1kb sequences for searching. Otherwise, even if your input sequences are 1kb, only 200bp would be used as input.
                        Second is the "peak" size. If you have enough regions smaller than 1kb, I would indeed use these for motif searching. You can later always check the presence of the motif in the larger sequences. Otherwise you can just use all regions as input, as GimmeMotifs will truncate the larger sequences. If there's enough sequences that contain a motif, this should not be that big of a problem.

                        Comment


                        • #13
                          Originally posted by simonvh View Post
                          This is indeed trickier than a typical ChIP-seq run, but most likely not impossible. Basically there's two important things here. First is, the fact that the motif is not located in the center of the peak. Most motif programs that are run by GimmeMotifs do not take the location of the motif in the sequence into account. However, by default GimmeMotifs truncates the input sequences to 200 basepairs. This is probably too strict in your case. So I would change the -w parameter to 1000 to use 1kb sequences for searching. Otherwise, even if your input sequences are 1kb, only 200bp would be used as input.
                          Second is the "peak" size. If you have enough regions smaller than 1kb, I would indeed use these for motif searching. You can later always check the presence of the motif in the larger sequences. Otherwise you can just use all regions as input, as GimmeMotifs will truncate the larger sequences. If there's enough sequences that contain a motif, this should not be that big of a problem.

                          Thanks a lot for the suggestions. After posing the question, I selected regions up to 500bpand also up to 1kb (always setting the -w parameter). And got a similar motifs with both which is comforting. The pwmscan.py also came in handy.

                          Just another couple of things:

                          1. I looked at the manual, could not find a description of the output of pwmscan.py.

                          2. The results I have for my best motif look good from my interpretation of the report. Is this correct? Here are the results:
                          random
                          enrichment 6.00
                          p-value 0.00
                          ROC_AUC 0.703
                          MNCP 4.116

                          genomic_matched
                          enrichment 2.25
                          p-value 0.00
                          ROC_AUC 0.695
                          MNCP 1.808


                          The p-value=0 is the one that is bugging me.

                          Comment


                          • #14
                            Dear Simon


                            We are contacting you as user of your gimmemotif pipeline.
                            We are trying to use the roc.py and cluster.py scripts with a file (PWMFILE) which is not derived from gimmemotif. Instead the matrix I am trying to run is composed by results I ve got with another predictor scripts. The error message I ve got in trying to run the ROC script is:

                            comand:
                            gimme roc -o kentaro_roc.pdf kentaro2_julio2016 nuevalista_junio2016.fasta 10000_random_promoters_1500pb_masked_not_E011.fasta

                            error:
                            failed to initialize cache
                            global name 'make_region' is not defined
                            Traceback (most recent call last):
                            File "/tools/anaconda2/bin/gimme", line 469, in <module>
                            args.func(args)
                            File "/tools/anaconda2/lib/python2.7/site-packages/gimmemotifs/commands/roc.py", line 40, in roc
                            for scores in s.best_score(fg_file):
                            File "/tools/anaconda2/lib/python2.7/site-packages/gimmemotifs/scanner.py", line 270, in best_score
                            for matches in self.scan(seqs, 1, scan_rc, cutoff=0):
                            File "/tools/anaconda2/lib/python2.7/site-packages/gimmemotifs/scanner.py", line 355, in scan
                            for result in it:
                            File "/tools/anaconda2/lib/python2.7/site-packages/gimmemotifs/scanner.py", line 418, in _scan_sequences
                            motif_digest = self.checksum[motif_file]
                            KeyError: 'kentaro2_julio2016.txt'


                            In a previous version of gimmemotif, I was able to do this, but I noticed that after GM update the input file (PWMFILE) is not recognized. I attached here the mentioned matrix for you to see whether the error could be.

                            In trying to bypass this trouble, I started from the very beginning running the whole gimmemotif pipeline (including all predictors). However, in the step where I have to give a fasta file with the whole genome sequence to take as background, I failed in indexing the whole tomato genome (my samples are from this species). The error message I ve got in this opportunity is:

                            comand:
                            gimme background -i SL_todoscrom.fa -f SL.fa -g 2.3 -n 1

                            error:
                            background: error: too few arguments


                            Thank you very much in advance for your help with this. Your comments and suggestions are more than welcome.

                            Best wishes,

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            31 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            33 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            28 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            53 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X