Originally posted by crazyhottommy
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
-
Originally posted by mathew View PostI am using Seqmonk for analysis of RNAseq PE reads. When I run pipeline will use option of getting raw rpkm, will it give me fpkm values if I use PE reads. Thanks.
If you want RPKM then you'd run the RNA-Seq pipeline making the following changes to the defaults - turn off log transforming and turn on the transcript length correction. Again this will give RPKM and you'd manually correct to get FPKM.
Comment
-
Seqmonk Bismark import tool
Hi all,
I have an issue with the Bismark Import tool in Seqmonk. After using the methylation_extractor command, I obtain the desired .txt file with 5 columns: seq-ID, methylation state, chromosome, start position (= end position), methylation call. The last column is a letter indicating the cytosine context and methylation state. However when I try to import it I get the following error:
Location 44953034-X was not an integer
Each cytosine produces the same error. It looks as if the import tool is merging columns 4 and 5 with a hyphen, and of course it can't find the location on the chromosome. Removing the 5th column does not work because the import tool expect 5 columns and refuses to work on 4.
Is there a way around the problem?
Cheers,
Quentin
Comment
-
Originally posted by QuentinG View PostHi all,
I have an issue with the Bismark Import tool in Seqmonk. After using the methylation_extractor command, I obtain the desired .txt file with 5 columns: seq-ID, methylation state, chromosome, start position (= end position), methylation call. The last column is a letter indicating the cytosine context and methylation state. However when I try to import it I get the following error:
Location 44953034-X was not an integer
Each cytosine produces the same error. It looks as if the import tool is merging columns 4 and 5 with a hyphen, and of course it can't find the location on the chromosome. Removing the 5th column does not work because the import tool expect 5 columns and refuses to work on 4.
Is there a way around the problem?
Cheers,
Quentin
If you want to import files generated by the Bismark methylation extractor you need the use the generic text import, where you simply specify chromosome (col 3), start and end both as the position (col 4), and strand as col 2.
The Bismark import tool was designed to extract methylation values straight from the now outdated --vanilla format.
Best,
Felix
Comment
-
Hi Simon,
I have a question about the hierachiacal graph mentioned here:
http://www.bioinformatics.babraham.a...le%20Plot.html
Is it reasonable that I compare ChIP-seq data sets from different antibodies?
say I have three ChIP-seq datasets for transcription factors TF1, TF2, TF3.
I want to see if at probes I defined, when TF1 binding is strong, TF2 is also strong, or when TF1 binding is strong, TF3 binding is strong. Basically, I want to see if there are any patterns existing.
the Aligned probe plot and probe trend plot (this is an average view) are not what I want.
Also, is it possible to extract probes that show correlated changes?
Thanks so much!
I've been using Seqmonk for a while, and am getting familiar with its functionalities. It is a great tool!
Comment
-
Thank you so much!
Originally posted by simonandrews View PostYou can't import this type of data into SeqMonk at the moment. You would have to transform this into a 2 line format with the interacting pairs on consecutive lines and then import this using the generic text import and selecting the option that this is HiC data. If this is becoming common it wouldn't be that hard to extend the BED importer to notice that this is BED12 data and import that as a HiC dataset.
Comment
-
Hi Simon,
I just did as what you told me. I split one pair to two consecutive lines like this:
chr1 851440 855732
chr1 932773 936146
chr1 857967 862189
chr1 1243062 1245066
the first two lines is one pair, the next two lines is another pair, I imported the file using the generic txt form and indicated that it is an Hi-C data.
But the data track only showed me those fragments individually. Supposely, there should be one line that connects the interacting two pairs, right?
Thanks again.
Originally posted by simonandrews View PostYou can't import this type of data into SeqMonk at the moment. You would have to transform this into a 2 line format with the interacting pairs on consecutive lines and then import this using the generic text import and selecting the option that this is HiC data. If this is becoming common it wouldn't be that hard to extend the BED importer to notice that this is BED12 data and import that as a HiC dataset.Last edited by crazyhottommy; 06-24-2013, 12:25 PM.
Comment
-
Originally posted by crazyhottommy View Post
the first two lines is one pair, the next two lines is another pair, I imported the file using the generic txt form and indicated that it is an Hi-C data.
But the data track only showed me those fragments individually. Supposely, there should be one line that connects the interacting two pairs, right?
See:
https://www.youtube.com/watch?v=-N2DHLvVpTU
..and
https://www.youtube.com/watch?v=SbSD-xgStMs
Comment
-
Originally posted by crazyhottommy View PostHi Simon,
I have a question about the hierachiacal graph mentioned here:
http://www.bioinformatics.babraham.a...le%20Plot.html
Is it reasonable that I compare ChIP-seq data sets from different antibodies?
say I have three ChIP-seq datasets for transcription factors TF1, TF2, TF3.
I want to see if at probes I defined, when TF1 binding is strong, TF2 is also strong, or when TF1 binding is strong, TF3 binding is strong. Basically, I want to see if there are any patterns existing.
- Call peaks for TF1
- Turn the probes into an annotation track called Peaks using File > Import Annotation > Active Probe Lists
- Repeat steps 1 and 2 for TF2 and TF3
- Use the feature probe generator to make probes over all features of type Peak
- Use the deduplication probe generator to remove duplicate probes where there was a peak in more than one TF
Alternatively if you can identify a common feature where all of the TFs bind (promoters for example) then you could simply make probes over all promoters regardless of whether they showed a peak in any of your datasets.
Once you have the probes defined you can quantitate them - a simple log transformed read count would probably suffice.
You then have a number of different ways you can look at these data. For comparing two TFs you could use a scatterplot, but for 3 or more the hierarchical heatmap is probably better. It will cluster your probes into groups which show correlated patterns across your different TF datasets. The more TFs you have the better this type of plot will be.
You can also use this plot to extract out subgroups which respond in similar ways. You can use the slider on the left to set the threshold for clustering probes - higher correlation will make smaller more tightly correlated groups, lower thresholds will make fewer larger more diffuse groups. Once you're happy you can use the "Save Clusters" button at the bottom to save the groups as probe lists in your main project so you can see what they are and do further analysis on them.
Comment
-
Thanks Simon for your detailed answer. I am clear how to do it in seqmonk now.
but I do have a question related to this kind of analysis, it is more about statistics I guess.
qualities of different antibodies are different, so the library sizes for each TF are different.
also, Total TFs binding sites in the genome intrinsically vary because they are different TFs! The total reads in the libraries are different.
So, when compare binding strength, in this case, seqmonk uses the log2 counts( potentially normalized to library size? people tend to use counts per million (cpm) to represent the data ). Is it reasonable to compare between different TFs?
I've got the raw read counts for my probe sets using bedtools (coverage) from the bam file for each TF. I ended up with a dataframe like this:
probes TF1 TF2 TF3 TF4 TF5
MACS_peak_291 7 10 18 2 19
MACS_peak_33 27 64 47 22 12
MACS_peak_46 5 2 20 1 5
MACS_peak_6 15 9 14 9 206
MACS_peak_7 7 12 9 6 1
MACS_peak_8 37 5 24 10 18
I can then get rid of the outliers,normalized them to counts per million and transform by log2.
Now, I can use heatmap.2 in gplots to generate a heatmap. I am not sure whether is it reasonable to do it like this, as I mentioned my concern : they are different TFs. That's why I turned to seqmonk.
I think the problem with us pure biologist is that we lack sufficient statistics knowledge. we do not know the underlying statistical models applied in the software like seqmonk. but that's the whole point of seqmonk, hiding the detailed statistical model, providing a handy tool for wet biologist However, I do want to know a little bit more....
well, my thought was that I can rank the counts ( assign a new value) for each TF at my probe set from strong to weak based on the whole genome binding profile for the same TF. Then, I can cluster them. I am not sure how to do it exactly though...
I am eager to see your insight about this kind of question.
I appreciate your help
Originally posted by simonandrews View PostYes, you can do this. First you'd need to decide where to put your probes - you could either do separate peak calling in your different datasets and then make a combined set of peaks in which to measure. If you wanted to do this then the process would be:
- Call peaks for TF1
- Turn the probes into an annotation track called Peaks using File > Import Annotation > Active Probe Lists
- Repeat steps 1 and 2 for TF2 and TF3
- Use the feature probe generator to make probes over all features of type Peak
- Use the deduplication probe generator to remove duplicate probes where there was a peak in more than one TF
Alternatively if you can identify a common feature where all of the TFs bind (promoters for example) then you could simply make probes over all promoters regardless of whether they showed a peak in any of your datasets.
Once you have the probes defined you can quantitate them - a simple log transformed read count would probably suffice.
You then have a number of different ways you can look at these data. For comparing two TFs you could use a scatterplot, but for 3 or more the hierarchical heatmap is probably better. It will cluster your probes into groups which show correlated patterns across your different TF datasets. The more TFs you have the better this type of plot will be.
You can also use this plot to extract out subgroups which respond in similar ways. You can use the slider on the left to set the threshold for clustering probes - higher correlation will make smaller more tightly correlated groups, lower thresholds will make fewer larger more diffuse groups. Once you're happy you can use the "Save Clusters" button at the bottom to save the groups as probe lists in your main project so you can see what they are and do further analysis on them.Last edited by crazyhottommy; 06-25-2013, 08:50 AM.
Comment
-
Originally posted by crazyhottommy View PostThanks Simon for your detailed answer. I am clear how to do it in seqmonk now.
but I do have a question related to this kind of analysis, it is more about statistics I guess.
qualities of different antibodies are different, so the library sizes for each TF are different.
also, Total TFs binding sites in the genome intrinsically vary because they are different TFs! The total reads in the libraries are different.
So, when compare binding strength, in this case, seqmonk uses the log2 counts( potentially normalized to library size? people tend to use counts per million (cpm) to represent the data ). Is it reasonable to compare between different TFs?
In terms of the analysis there are some things which are pretty simple to do - you can correlate the abundances of different factors. Just yesterday I was looking at two ChIP datasets and it was quickly apparent that there was an inverse relationship between the two datasets. The scaling was very different but the trend was very apparent. You can also do simple visualisations such as scatterplots for two samples or clusters for more than two which will quickly give you an impression of whether the different factors are acting the same way, and whether there are obvious subgroups of sites acting in different ways.
You can also look at coverage trend plots to see if the binding sites of factors are similar in two or more datasets. We use this a lot for things like histone modification ChIP data, but it could apply to other types as well.
For the quantitation there are various ways to look at and if needs be normalise your basic quantitation. You can use the distribution matching tools in seqmonk to correct for differences in enrichment efficiency between different samples. On a broader scale you can transform your raw values into z-scores or even into ranks if your distributions are really very different and you want to try to make them more comparable.
Comment
-
Thank you
Originally posted by simonandrews View PostI guess the answer to this depends on what you are looking for in the data. A direct quantitative comparison of ChIP data for different factors is pretty tenuous - there will be wide variations in the levels of enrichment, the number of sites and maybe the genomic positioning of the enrichment you're looking at. Doing simple log2 CPM is normally a good place to start to look at this data though.
In terms of the analysis there are some things which are pretty simple to do - you can correlate the abundances of different factors. Just yesterday I was looking at two ChIP datasets and it was quickly apparent that there was an inverse relationship between the two datasets. The scaling was very different but the trend was very apparent. You can also do simple visualisations such as scatterplots for two samples or clusters for more than two which will quickly give you an impression of whether the different factors are acting the same way, and whether there are obvious subgroups of sites acting in different ways.
You can also look at coverage trend plots to see if the binding sites of factors are similar in two or more datasets. We use this a lot for things like histone modification ChIP data, but it could apply to other types as well.
For the quantitation there are various ways to look at and if needs be normalise your basic quantitation. You can use the distribution matching tools in seqmonk to correct for differences in enrichment efficiency between different samples. On a broader scale you can transform your raw values into z-scores or even into ranks if your distributions are really very different and you want to try to make them more comparable.
Comment
-
Hi Simon,
Sorry to bother you again. I have two ChIP-seq data, I do a MACS peak call for one, and got some peaks as probes, I then import them as annotations to keep them.
I then did the same thing for my second data set, and also import the probes as the annotations.
How can I pool these two probes (union of these two)? "the combining existing list" only works on active probes, I did not find any function in define probes tab either.
I can, however, export the probes to txt file, and then combine the two using bedtools. I am just wandering if there is an direct way to do it in seqmonk.
Thanks!
Tommy
Comment
-
Originally posted by crazyhottommy View PostHi Simon,
Sorry to bother you again. I have two ChIP-seq data, I do a MACS peak call for one, and got some peaks as probes, I then import them as annotations to keep them.
I then did the same thing for my second data set, and also import the probes as the annotations.
How can I pool these two probes (union of these two)? "the combining existing list" only works on active probes, I did not find any function in define probes tab either.
Comment
Latest Articles
Collapse
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, Yesterday, 05:31 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Yesterday, 05:31 AM
|
||
Started by seqadmin, 10-24-2024, 06:58 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
10-24-2024, 06:58 AM
|
||
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types
by seqadmin
Started by seqadmin, 10-23-2024, 08:43 AM
|
0 responses
48 views
0 likes
|
Last Post
by seqadmin
10-23-2024, 08:43 AM
|
||
Started by seqadmin, 10-17-2024, 07:29 AM
|
0 responses
58 views
0 likes
|
Last Post
by seqadmin
10-17-2024, 07:29 AM
|
Comment