Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correlation coefficient in cluster

    I have question about seqmonk- when we perform clustering on any analyzed data then we can have different subsets of clusters based upon correlation coefficient. How it calculate correlation coefficient? Is it that whatever the data matrix is it will calculate correlation. In that case there should be one correlation value per row. How it is that one single correlation coefficient is used to cut rows. I may be missing something. Any suggestion please

    Comment


    • Originally posted by mathew View Post
      I have question about seqmonk- when we perform clustering on any analyzed data then we can have different subsets of clusters based upon correlation coefficient. How it calculate correlation coefficient? Is it that whatever the data matrix is it will calculate correlation. In that case there should be one correlation value per row. How it is that one single correlation coefficient is used to cut rows. I may be missing something. Any suggestion please
      The correlation clustering is an iterative process where you start by making a set of clusters with only one probe in each. In each round the program finds the two most correlated clusters and joins them together. It keeps doing this until all of the clusters are joined together. Since the most strongly correlated clusters are always joined in each round, the level of correlation decreases as the clustering continues. It also means that every cluster join has a specific R value associated with it.

      When you adjust the clustering stringency with the slider in SeqMonk what you're actually doing is moving through the cluster tree to find the largest cluster set for which the R value which joined that cluster is at or above the R value that you set. High R values will most likely be found early on in the clustering but will generate only small clusters, smaller or negative R values will be late stage joins of large clusters, so adjusting this threshold allows you to define the stringency of clustering.

      Hope this clears things up.

      Comment


      • Originally posted by Neuromancer View Post
        Hi Simon,

        Just a short question about genome versions:
        As far as I know, SeqMonk genomes are derived from ENSEMBL genome releases, right?
        So is the current SeqMonk mouse genome (GRCm38) the same as the annotation and coordinates in ENSEMBLE release 73 (i.e. GRCm38p1 + new annotations by ENSEMBL)?

        [This current release has 38561 genes (ensemble gene IDs), SeqMonk's probe generator (v0.25.0) generates 32029 genes (feature probes over genes, nothing removed)...]

        What's the status of the SeqMonk (mouse) genome then?
        In general we only update the genomes for new assemblies and the gene builds we distribute are the initial builds for that assembly. GRCm38 hasn't changed its sequence since the initial Ensembl build so the gene models are still on Ensembl v68. If there is a significant improvement in the gene builds then we can update these and SeqMonk will pick up the updates, but we didn't build in a place to record the specific annotation version when we built the back end (would have been nice in retrospect) so we're generally reluctant to do this.

        If you want a newer gene build you can always download the GTF file for any specific build and import that as an additional annotation set. You can prefix all of the features with a specific string so you can tell them apart from the core features.

        Comment


        • Originally posted by simonandrews View Post
          If you want a newer gene build you can always download the GTF file for any specific build and import that as an additional annotation set. You can prefix all of the features with a specific string so you can tell them apart from the core features.
          Thank you, that was what I had in mind as a solution as well! Thanks for the quick answer.

          Comment


          • I've just released SeqMonk v0.26.0 onto our project web site. The immediate reason for this is to fix a problem which occurred with the program launcher in the new OSX Mavericks release. We have also, though, included another tool we've been working on which makes it much easier to create and work with custom genomes, so that if you just have a collection of fastq files or a GTF file then it's now much easier to use these with SeqMonk.

            Please try out the new release and send your experiences either back to us directly or post them in this forum.

            Comment


            • how to predict gene from transcriptome data by mapping of transcriptome to genome

              hi everyone
              i want to ask how to predict gene from transcrips by mapping them to genome. please reply if someone know about it. i am new in this field

              Comment


              • Originally posted by rajeshgazara View Post
                hi everyone
                i want to ask how to predict gene from transcrips by mapping them to genome. please reply if someone know about it. i am new in this field
                Probably the most commonly used tool for this would be cufflinks. Since you're asking in a SeqMonk thread I should point out that we've done this kind of analysis and then loaded the raw mapped data and the GTF file from cufflinks in to SeqMonk to check the results. We've found that it's been very variable whether the predictions it made matched with what we expected from looking at the data ourselves.

                Comment


                • Hi Simon.

                  I'm trying to use the tool for constructing genomes via gff and fasta files. I'm trying to get a draft version of the kiwifruit genome (from http://bioinfo.bti.cornell.edu/cgi-b...i/download.cgi) into seqmonk.

                  I've been loading the gff file - Kiwifruit_pseudomolecule.gff3 - it appears to accept it and then I get nothing when create a new project. Just 25 blank chromosomes. I've tried adding in the various fasta files, with little success. When I load the scaffold file into the genome creation tool along side the others, it'll show me the scaffold track, but that is of little value.

                  Is there some modification to the gff file I need to make that I'm missing? The seqmonk version I'm using was downloaded a week or so ago, so that should be current.

                  Any thoughts/pointers would be much appreciated.

                  Cheers
                  Ben.

                  Comment


                  • Originally posted by tirohia View Post
                    Hi Simon.

                    I'm trying to use the tool for constructing genomes via gff and fasta files. I'm trying to get a draft version of the kiwifruit genome (from http://bioinfo.bti.cornell.edu/cgi-b...i/download.cgi) into seqmonk.

                    I've been loading the gff file - Kiwifruit_pseudomolecule.gff3 - it appears to accept it and then I get nothing when create a new project. Just 25 blank chromosomes. I've tried adding in the various fasta files, with little success. When I load the scaffold file into the genome creation tool along side the others, it'll show me the scaffold track, but that is of little value.
                    Hi Ben,

                    I had a look at this. It's a bug in SeqMonk - it doesn't pick up files with .gff3 extensions when it creates the default annotation set for a new custom genome. If you change the extension to just .gff and rebuild the custom genome it should work.

                    I'll fix this in the next release. Thanks for spotting and reporting this.

                    Cheers

                    Simon.

                    Comment


                    • Ah. That works. Brilliant.

                      Ta muchly.

                      Ben.

                      Comment


                      • Hi Simon.

                        Possibly a rehash (of sorts of an old question if I may. I'm getting an error when I try and import a bam file into seqmonk - "Couldn't extract a valid name from <name>". It's the same one that was in this post a while back.

                        I've read the article that you linked to in your response.

                        So my gff file, where the reference data came from, has entries like this:

                        Chr6 glean gene 14845357 14856602 . + . ID=Achn215061; status=novel;
                        Chr6 glean mRNA 14845357 14856602 0.240873 + . ID=Achn215061-TA; Parent=Achn215061; status=novel;
                        Chr6 glean CDS 14845357 14845647 . + 0 Parent=Achn215061-TA;
                        Chr6 glean CDS 14848222 14849031 . + 0 Parent=Achn215061-TA;

                        The corresponding line in the header of the sam file that I have been attempting to import (actually been trying in bam, but this is the sam file it is derived from), has an entry like this:

                        @SQ SN:Achn215061 LN:4998

                        Both files have data from all the available chromosomes in them but the sam file only contains accession numbers, not chromosome details. It's all just Achn******. So how does one go about setting up aliases as indicated in the article that you linked to?

                        I'm assuming at this point, that I'll have to add something to the names of the various genes in the sam file, but the article indicates that there is no regexp attempts on the alias's provided in the files, which would mean that if I added a chromosome name as a prefix/suffix to all the entries in the sam file (Achn215061chr6 maybe), it wouldn't pick them up. I'm not sure where/how I would add the chromosome information in the SAM file.

                        Am I missing something obvious?

                        Cheers
                        Ben.

                        Comment


                        • Originally posted by tirohia View Post
                          I'm getting an error when I try and import a bam file into seqmonk - "Couldn't extract a valid name from <name>". It's the same one that was in this post a while back.

                          I've read the article that you linked to in your response.

                          So my gff file, where the reference data came from, has entries like this:

                          Chr6 glean gene 14845357 14856602 . + . ID=Achn215061; status=novel;
                          Chr6 glean mRNA 14845357 14856602 0.240873 + . ID=Achn215061-TA; Parent=Achn215061; status=novel;
                          Chr6 glean CDS 14845357 14845647 . + 0 Parent=Achn215061-TA;
                          Chr6 glean CDS 14848222 14849031 . + 0 Parent=Achn215061-TA;

                          The corresponding line in the header of the sam file that I have been attempting to import (actually been trying in bam, but this is the sam file it is derived from), has an entry like this:

                          @SQ SN:Achn215061 LN:4998

                          Both files have data from all the available chromosomes in them but the sam file only contains accession numbers, not chromosome details. It's all just Achn******. So how does one go about setting up aliases as indicated in the article that you linked to?
                          Setting up chromosome aliases is fairly simple. You simply need to create a file called aliases.txt in the folder containing your seqmonk genome and then add in alias[tab]chromosome name pairs to allow seqmonk to do the lookup when importing.

                          However, in this case I suspect you might have a different problem. It's difficult to tell from the information you've supplied but I think your data might have been mapped against a transcriptome rather than a genome, so although your genome has assembled chromosomes, the coordinates in your BAM file might be within transcripts rather than being genomic positions. If this is the case then it's not just a case of adding an alias since the positions will be offset in the genome. The aliases file does allow for supplying an offset position as well as an alias, but if you're working in a species which does splicing then even this isn't going to be enough since you will have a different offset for each exon.

                          It's theoretically possible to translate transcriptome coordinates to genomic coordinates (tophat does this internally for example), but I've never actually tried this and don't know of a simple approach to do this, but if your BAM file is mapped against a transcriptome and you wanted to view the data on a genome then this is what you'd need to do.

                          If you can give us a bit more information about where this data came from and how the mapping was done we can probably give a more concrete answer.

                          Comment


                          • So the reference genome that I've loaded into Seqmonk was the one that I was trying to load a week or two ago - from gff file at http://bioinfo.bti.cornell.edu/cgi-b...i/download.cgi. (Your fix worked well, thanks for that).

                            I've taken the corredesponding file of coding sequences from that site (ftp://bioinfo.bti.cornell.edu/pub/ki...ruit_cds.fa.gz) that correspond to the 39040 genes in the gff file and used that to create an index for bwa.
                            I then used bwa to map trasncriptome data against those coding sequences, ending up with a sam file. I imagine this is where the problem is. The genes get split up into chromosomes when the reference is loaded into seqmonk, but thre's no chromosome information in the cds file - thus no chromosome information when the mapped reads are put into the sam file.

                            Though, for reference, when I first started, I mapped my transcriptome data against the kiwifruit pseudomolecule sequence - which are the chromosomes and I was getting the same error though a lot less of them.

                            Any other info that would be helpful?

                            Comment


                            • Sorry. I may have answered my own question. When I was using the kiwifruit pseudomolecule sequences (i.e. the chromosome sequences) to map my transcriptome data against, I was getting the same error - but I didn't, at the time, find the link about the mapping of the chromsomes.

                              I'm repeating the mapping with the chromosome sequences, and I'll see if I get results wherein I can figure out how to set up the aliases.

                              Cheers
                              Ben.

                              Comment


                              • Hi Ben,

                                You should be fine if you map against the chromosome sequences. You'll need to use a splice aware mapper such as tophat to do the mapping, but you can also pass in your GTF file to the mapper so that it will effectively map against the transcriptome first, but will give you genomic coordinates.

                                Let me know if it works out OK, but hopefully this batch of mapped reads will be OK.

                                Simon.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Choosing Between NGS and qPCR
                                  by seqadmin



                                  Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                  10-18-2024, 07:11 AM
                                • seqadmin
                                  Non-Coding RNA Research and Technologies
                                  by seqadmin




                                  Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                  Nobel Prize for MicroRNA Discovery
                                  This week,...
                                  10-07-2024, 08:07 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 05:31 AM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-24-2024, 06:58 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-23-2024, 08:43 AM
                                0 responses
                                48 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-17-2024, 07:29 AM
                                0 responses
                                58 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X