Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • sudders
    Member
    • Dec 2011
    • 32

    Introducing CGAT: computational genomics analysis toolkit

    We have just published CGAT: computational genomics analysis toolkit in Bioinformatics.

    The toolkit grew out of the code base that we have developed over the years working on many genomics and NGS projects. It currently contains over 50 tools covering tasks such as converting, filtering, comparison, conversion, summrization and annotation of genomic intervals, genesets and sequences.

    The tools follow a common command-line interface, and can be integrated into workflow tools such as galaxy.

    Full documentation, including examples, recipes and tool reference is here.

    Installation of the released version on a linux system should be as easy as:
    Code:
    pip install cgat
    For more detailed instructions and troubleshooting see here.

    The project is being actively developed and we welcome community input. The actively developed code is hosted on the github repository at:
    Do not use - please refer to our newest code: https://github.com/cgat-developers/cgat-apps - CGATOxford/cgat
    Last edited by sudders; 01-09-2014, 06:02 AM. Reason: Fixed documentation link
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    FYI: The documentation link is not working. http://www.cgat.org/~andreas/documen.../cgat/cgat.htm

    Comment

    • sudders
      Member
      • Dec 2011
      • 32

      #3
      Sorry, should now be fixed.

      Comment

      • crazyhottommy
        Senior Member
        • Apr 2012
        • 187

        #4
        Hi thanks for making this tool
        could you please give the R script for making the average plot and the heatmap for bam2geneprofile


        Tommy

        Comment

        • sudders
          Member
          • Dec 2011
          • 32

          #5
          Hi Tommy,

          Thanks for your interest.

          Simple average profiles are produced automatically by the script (using matplotlib code rather than R).

          You can use R to produce more complex plots (like showing more than one profile on the same plot). A simple example is given on the page for the recipe:

          What is the binding profile of NFKB across gene models?

          If you wanted to plot more than one line on the plot (for example the data for the input) you could do some like the following (assuming that you'd already run bam2geneprofile for each sample)

          Code:
          > profile_chip <- read.csv("nfkb_profile.geneprofile.matrix.tsv.gz", header = T, stringsAsFactors = F, sep = "\t")
          
          > profile_input <- read.csv("input_profile.geneprofile.matrix.tsv.gz", header = T, stringsAsFactors = F, sep="\t")
          
          > plot(profile_chip$bin, profile_chip$counts, cex = 0, xaxt = "none")
          
          > lines(profile_chip$bin, profile_chip$counts, col = "blue")
          
          > lines(profile_input$bin, profile_input$counts, col = "red")
          
          > abline(v = c(1000, 2000), lty = 2)
          
          > mtext("upstream", adj = 0.1)
          
          > mtext("exons", adj = 0.5)
          
          > mtext("downstream", adj = 0.9)
          The data for the heatmaps is produced by bam2peakshape rather than bam2geneprofile.

          bam2peakshape doesn't produce currenlty the plots itself, but the R code to do so isn't difficult. It is also on the recipe linked above, at the bottom of the pages. I reproduce it below.

          Assuming you've run bam2peakshape with a test and control bam and output to the pattern peakshap.%s you can do the following in R

          Code:
          > library( gplots )
          
          > library( RColorBrewer )
          
          > # read the H3K4me3 matrix into R
          > me3 <- read.csv( "peakshape.matrix_peak_height.gz", header=TRUE, sep="\t", row.names=1 )
          
          > # convert to matrix
          > me3.matrix <- as.matrix( me3 )
          
          > # A proportion of NFkB intervals have no discernable H3K4me3 or H3K4me1 coverage. These are removed before plotting.
          > me3.matrix <- me3.matrix[ c( 4000, 14906 ), ]
          
          > # the remainder are plotted
          > cols <- brewer.pal( 9, "Blues" )
          
          > heatmap.2( me3.matrix, col=cols, Rowv=F, Colv=F, labRow="", key=FALSE, labCol="", trace="none", dendrogram="none", breaks=seq(0, 1000, 101) )
          
          > # A second plot can be produced for the H3K4me1 data
          > me1 <- read.csv( "peakshape.control_peak_height.gz", header=T, sep="\t", row.names=1 )
          
          > me1.matrix <- as.matrix( me3 )
          
          > me1.matrix <- me1.matrix[ c( 4000, 14906 ), ]
          
          > cols <- brewer.pal( 9, "Greens" )
          
          > heatmap.2( me1.matrix, col=cols, Rowv=F, Colv=F, labRow="", key=FALSE, labCol="", trace="none", dendrogram="none", breaks=seq(0, 100, 11))
          I hope this helps. Do let me know if I can help further.

          Ian
          ---

          Comment

          • crazyhottommy
            Senior Member
            • Apr 2012
            • 187

            #6
            Hi Ian,

            Thank you so much for your kind reply. meta gene plot is becoming very routine in the NGS papers. Your tool is very helpful.

            I was using Homer + R for heatmap and HTSeq for meta-gene profile.

            Thank you again.

            Tommy




            Originally posted by sudders View Post
            Hi Tommy,

            Thanks for your interest.

            Simple average profiles are produced automatically by the script (using matplotlib code rather than R).

            You can use R to produce more complex plots (like showing more than one profile on the same plot). A simple example is given on the page for the recipe:

            What is the binding profile of NFKB across gene models?

            If you wanted to plot more than one line on the plot (for example the data for the input) you could do some like the following (assuming that you'd already run bam2geneprofile for each sample)

            Code:
            > profile_chip <- read.csv("nfkb_profile.geneprofile.matrix.tsv.gz", header = T, stringsAsFactors = F, sep = "\t")
            
            > profile_input <- read.csv("input_profile.geneprofile.matrix.tsv.gz", header = T, stringsAsFactors = F, sep="\t")
            
            > plot(profile_chip$bin, profile_chip$counts, cex = 0, xaxt = "none")
            
            > lines(profile_chip$bin, profile_chip$counts, col = "blue")
            
            > lines(profile_input$bin, profile_input$counts, col = "red")
            
            > abline(v = c(1000, 2000), lty = 2)
            
            > mtext("upstream", adj = 0.1)
            
            > mtext("exons", adj = 0.5)
            
            > mtext("downstream", adj = 0.9)
            The data for the heatmaps is produced by bam2peakshape rather than bam2geneprofile.

            bam2peakshape doesn't produce currenlty the plots itself, but the R code to do so isn't difficult. It is also on the recipe linked above, at the bottom of the pages. I reproduce it below.

            Assuming you've run bam2peakshape with a test and control bam and output to the pattern peakshap.%s you can do the following in R

            Code:
            > library( gplots )
            
            > library( RColorBrewer )
            
            > # read the H3K4me3 matrix into R
            > me3 <- read.csv( "peakshape.matrix_peak_height.gz", header=TRUE, sep="\t", row.names=1 )
            
            > # convert to matrix
            > me3.matrix <- as.matrix( me3 )
            
            > # A proportion of NFkB intervals have no discernable H3K4me3 or H3K4me1 coverage. These are removed before plotting.
            > me3.matrix <- me3.matrix[ c( 4000, 14906 ), ]
            
            > # the remainder are plotted
            > cols <- brewer.pal( 9, "Blues" )
            
            > heatmap.2( me3.matrix, col=cols, Rowv=F, Colv=F, labRow="", key=FALSE, labCol="", trace="none", dendrogram="none", breaks=seq(0, 1000, 101) )
            
            > # A second plot can be produced for the H3K4me1 data
            > me1 <- read.csv( "peakshape.control_peak_height.gz", header=T, sep="\t", row.names=1 )
            
            > me1.matrix <- as.matrix( me3 )
            
            > me1.matrix <- me1.matrix[ c( 4000, 14906 ), ]
            
            > cols <- brewer.pal( 9, "Greens" )
            
            > heatmap.2( me1.matrix, col=cols, Rowv=F, Colv=F, labRow="", key=FALSE, labCol="", trace="none", dendrogram="none", breaks=seq(0, 100, 11))
            I hope this helps. Do let me know if I can help further.

            Ian
            ---

            Comment

            • crazyhottommy
              Senior Member
              • Apr 2012
              • 187

              #7
              Hi Ian,

              I do have another question. for bam2geneprofile, one needs to provide a gtf file.

              for bam2peakshape, one needs to provide a bed file (generated from MACS)

              If I want to generate an average plot with the ChIP-seq data ( similar to the meta-gene plot, but I am plotting the average on the peak intervals rather than the gene-model), I can still use bam2peakshape, get the matrix (the matrix is used for the heatmap), and calculate the colMeans for each bin, and then plot a line graph.

              Or I can use the bam2geneprofile, but I need to convert the MACS bed file to gtf file first.

              Am I correct?

              Thanks!
              Tommy




              Originally posted by sudders View Post
              Hi Tommy,

              Thanks for your interest.

              Simple average profiles are produced automatically by the script (using matplotlib code rather than R).

              You can use R to produce more complex plots (like showing more than one profile on the same plot). A simple example is given on the page for the recipe:

              What is the binding profile of NFKB across gene models?

              If you wanted to plot more than one line on the plot (for example the data for the input) you could do some like the following (assuming that you'd already run bam2geneprofile for each sample)

              Code:
              > profile_chip <- read.csv("nfkb_profile.geneprofile.matrix.tsv.gz", header = T, stringsAsFactors = F, sep = "\t")
              
              > profile_input <- read.csv("input_profile.geneprofile.matrix.tsv.gz", header = T, stringsAsFactors = F, sep="\t")
              
              > plot(profile_chip$bin, profile_chip$counts, cex = 0, xaxt = "none")
              
              > lines(profile_chip$bin, profile_chip$counts, col = "blue")
              
              > lines(profile_input$bin, profile_input$counts, col = "red")
              
              > abline(v = c(1000, 2000), lty = 2)
              
              > mtext("upstream", adj = 0.1)
              
              > mtext("exons", adj = 0.5)
              
              > mtext("downstream", adj = 0.9)
              The data for the heatmaps is produced by bam2peakshape rather than bam2geneprofile.

              bam2peakshape doesn't produce currenlty the plots itself, but the R code to do so isn't difficult. It is also on the recipe linked above, at the bottom of the pages. I reproduce it below.

              Assuming you've run bam2peakshape with a test and control bam and output to the pattern peakshap.%s you can do the following in R

              Code:
              > library( gplots )
              
              > library( RColorBrewer )
              
              > # read the H3K4me3 matrix into R
              > me3 <- read.csv( "peakshape.matrix_peak_height.gz", header=TRUE, sep="\t", row.names=1 )
              
              > # convert to matrix
              > me3.matrix <- as.matrix( me3 )
              
              > # A proportion of NFkB intervals have no discernable H3K4me3 or H3K4me1 coverage. These are removed before plotting.
              > me3.matrix <- me3.matrix[ c( 4000, 14906 ), ]
              
              > # the remainder are plotted
              > cols <- brewer.pal( 9, "Blues" )
              
              > heatmap.2( me3.matrix, col=cols, Rowv=F, Colv=F, labRow="", key=FALSE, labCol="", trace="none", dendrogram="none", breaks=seq(0, 1000, 101) )
              
              > # A second plot can be produced for the H3K4me1 data
              > me1 <- read.csv( "peakshape.control_peak_height.gz", header=T, sep="\t", row.names=1 )
              
              > me1.matrix <- as.matrix( me3 )
              
              > me1.matrix <- me1.matrix[ c( 4000, 14906 ), ]
              
              > cols <- brewer.pal( 9, "Greens" )
              
              > heatmap.2( me1.matrix, col=cols, Rowv=F, Colv=F, labRow="", key=FALSE, labCol="", trace="none", dendrogram="none", breaks=seq(0, 100, 11))
              I hope this helps. Do let me know if I can help further.

              Ian
              ---

              Comment

              • sudders
                Member
                • Dec 2011
                • 32

                #8
                Hi Tommy,

                Sorry for the slow reply, i've been away. In future, if you use the CGAT user group (https://groups.google.com/forum/?fro...gat-user-group), your message will go to more people, so someone will reply to you even if i'm not around.

                As to your question:
                If I want to generate an average plot with the ChIP-seq data ( similar to the meta-gene plot, but I am plotting the average on the peak intervals rather than the gene-model), I can still use bam2peakshape, get the matrix (the matrix is used for the heatmap), and calculate the colMeans for each bin, and then plot a line graph.

                Or I can use the bam2geneprofile, but I need to convert the MACS bed file to gtf file first.
                I would recommend using the second of these two methods. The tool bed2gff will do the conversion for you.

                Code:
                  zcat my_bed_file.bed.gz 
                | cgat bed2gff -a 
                | cgat bam2geneprofile --bamfile=my_bam_file.bam --gtffile=- --method=intervalprofile --reporter=transcript
                Along with what ever normalisation and output options you want. The - in --gtfile tells bam2geneprofile to use stdin for the interval file, and the -a on bed2gff tells it to output gtf.

                Let me know if you have any further problems.

                Ian
                ---

                Comment

                • crazyhottommy
                  Senior Member
                  • Apr 2012
                  • 187

                  #9
                  Hi Ian,

                  Thank you very much!

                  Tommy
                  Originally posted by sudders View Post
                  Hi Tommy,

                  Sorry for the slow reply, i've been away. In future, if you use the CGAT user group (https://groups.google.com/forum/?fro...gat-user-group), your message will go to more people, so someone will reply to you even if i'm not around.

                  As to your question:


                  I would recommend using the second of these two methods. The tool bed2gff will do the conversion for you.

                  Code:
                    zcat my_bed_file.bed.gz 
                  | cgat bed2gff -a 
                  | cgat bam2geneprofile --bamfile=my_bam_file.bam --gtffile=- --method=intervalprofile --reporter=transcript
                  Along with what ever normalisation and output options you want. The - in --gtfile tells bam2geneprofile to use stdin for the interval file, and the -a on bed2gff tells it to output gtf.

                  Let me know if you have any further problems.

                  Ian
                  ---

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Pathogen Surveillance with Advanced Genomic Tools
                    by seqadmin




                    The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                    Yesterday, 11:48 AM
                  • seqadmin
                    New Genomics Tools and Methods Shared at AGBT 2025
                    by seqadmin


                    This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                    The Headliner
                    The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                    03-03-2025, 01:39 PM
                  • seqadmin
                    Investigating the Gut Microbiome Through Diet and Spatial Biology
                    by seqadmin




                    The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                    02-24-2025, 06:31 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-20-2025, 05:03 AM
                  0 responses
                  26 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-19-2025, 07:27 AM
                  0 responses
                  33 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-18-2025, 12:50 PM
                  0 responses
                  25 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-03-2025, 01:15 PM
                  0 responses
                  190 views
                  0 reactions
                  Last Post seqadmin  
                  Working...