Seqanswers Leaderboard Ad

**simonandrews** · 06-11-2013, 12:13 AM

Originally posted by crazyhottommy View Post

Hi Simon,

I am wandering whether seqmonk supports the bed12 format?

I have a bed12 file resulted from ChIA-PET like this:
chr14 69441719 69522938 chr14:69441719..69443220-chr14:69520758..69522938,2 200 . 69441719 69522938 255,0,0 2 1501,2180 0,79039

How can I visualize it in seqmonk?

You can't import this type of data into SeqMonk at the moment. You would have to transform this into a 2 line format with the interacting pairs on consecutive lines and then import this using the generic text import and selecting the option that this is HiC data. If this is becoming common it wouldn't be that hard to extend the BED importer to notice that this is BED12 data and import that as a HiC dataset.

**simonandrews** · 06-11-2013, 12:15 AM

Originally posted by mathew View Post

I am using Seqmonk for analysis of RNAseq PE reads. When I run pipeline will use option of getting raw rpkm, will it give me fpkm values if I use PE reads. Thanks.

The raw option in the RNA-Seq pipeline doesn't generate RPKM, it generates completely raw counts - the sort of thing you'd need for programs like DESeq. The values are read counts so you'd need to transform them into fragment counts. You could do this with the manual normalisation option.

If you want RPKM then you'd run the RNA-Seq pipeline making the following changes to the defaults - turn off log transforming and turn on the transcript length correction. Again this will give RPKM and you'd manually correct to get FPKM.

**QuentinG** · 06-20-2013, 06:50 AM

Seqmonk Bismark import tool

Hi all,
I have an issue with the Bismark Import tool in Seqmonk. After using the methylation_extractor command, I obtain the desired .txt file with 5 columns: seq-ID, methylation state, chromosome, start position (= end position), methylation call. The last column is a letter indicating the cytosine context and methylation state. However when I try to import it I get the following error:
Location 44953034-X was not an integer
Each cytosine produces the same error. It looks as if the import tool is merging columns 4 and 5 with a hyphen, and of course it can't find the location on the chromosome. Removing the 5th column does not work because the import tool expect 5 columns and refuses to work on 4.
Is there a way around the problem?
Cheers,
Quentin

**fkrueger** · 06-20-2013, 06:57 AM

Originally posted by QuentinG View Post

Hi all,
I have an issue with the Bismark Import tool in Seqmonk. After using the methylation_extractor command, I obtain the desired .txt file with 5 columns: seq-ID, methylation state, chromosome, start position (= end position), methylation call. The last column is a letter indicating the cytosine context and methylation state. However when I try to import it I get the following error:
Location 44953034-X was not an integer
Each cytosine produces the same error. It looks as if the import tool is merging columns 4 and 5 with a hyphen, and of course it can't find the location on the chromosome. Removing the 5th column does not work because the import tool expect 5 columns and refuses to work on 4.
Is there a way around the problem?
Cheers,
Quentin

Hi Quentin,

If you want to import files generated by the Bismark methylation extractor you need the use the generic text import, where you simply specify chromosome (col 3), start and end both as the position (col 4), and strand as col 2.

The Bismark import tool was designed to extract methylation values straight from the now outdated --vanilla format.

Best,
Felix

**QuentinG** · 06-20-2013, 06:58 AM

Ah, too easy. Thanks!

**crazyhottommy** · 06-24-2013, 09:48 AM

Hi Simon,

I have a question about the hierachiacal graph mentioned here:

http://www.bioinformatics.babraham.a...le%20Plot.html

Is it reasonable that I compare ChIP-seq data sets from different antibodies?
say I have three ChIP-seq datasets for transcription factors TF1, TF2, TF3.
I want to see if at probes I defined, when TF1 binding is strong, TF2 is also strong, or when TF1 binding is strong, TF3 binding is strong. Basically, I want to see if there are any patterns existing.

the Aligned probe plot and probe trend plot (this is an average view) are not what I want.

Also, is it possible to extract probes that show correlated changes?

Thanks so much!
I've been using Seqmonk for a while, and am getting familiar with its functionalities. It is a great tool!

**crazyhottommy** · 06-24-2013, 09:49 AM

Thank you so much!

Originally posted by simonandrews View Post

You can't import this type of data into SeqMonk at the moment. You would have to transform this into a 2 line format with the interacting pairs on consecutive lines and then import this using the generic text import and selecting the option that this is HiC data. If this is becoming common it wouldn't be that hard to extend the BED importer to notice that this is BED12 data and import that as a HiC dataset.

**crazyhottommy** · 06-24-2013, 10:32 AM

Hi Simon,

I just did as what you told me. I split one pair to two consecutive lines like this:

chr1 851440 855732
chr1 932773 936146
chr1 857967 862189
chr1 1243062 1245066

the first two lines is one pair, the next two lines is another pair, I imported the file using the generic txt form and indicated that it is an Hi-C data.

But the data track only showed me those fragments individually. Supposely, there should be one line that connects the interacting two pairs, right?

Thanks again.

Originally posted by simonandrews View Post

You can't import this type of data into SeqMonk at the moment. You would have to transform this into a 2 line format with the interacting pairs on consecutive lines and then import this using the generic text import and selecting the option that this is HiC data. If this is becoming common it wouldn't be that hard to extend the BED importer to notice that this is BED12 data and import that as a HiC dataset.

**simonandrews** · 06-24-2013, 11:30 PM

Originally posted by crazyhottommy View Post

the first two lines is one pair, the next two lines is another pair, I imported the file using the generic txt form and indicated that it is an Hi-C data.

But the data track only showed me those fragments individually. Supposely, there should be one line that connects the interacting two pairs, right?

No, as long as the data view shows [HiC] before the data set name then it's imported correctly. In general we don't try to show individual pairs within a HiC set since the range of distances and the number of trans chromosomal pairs mean that the display just ends up a mess. Instead, you should now be able to use the various HiC specific views to look at your data.

See:

https://www.youtube.com/watch?v=-N2DHLvVpTU

..and

https://www.youtube.com/watch?v=SbSD-xgStMs

**simonandrews** · 06-24-2013, 11:41 PM

Originally posted by crazyhottommy View Post

Hi Simon,

I have a question about the hierachiacal graph mentioned here:

http://www.bioinformatics.babraham.a...le%20Plot.html

Is it reasonable that I compare ChIP-seq data sets from different antibodies?
say I have three ChIP-seq datasets for transcription factors TF1, TF2, TF3.
I want to see if at probes I defined, when TF1 binding is strong, TF2 is also strong, or when TF1 binding is strong, TF3 binding is strong. Basically, I want to see if there are any patterns existing.

Yes, you can do this. First you'd need to decide where to put your probes - you could either do separate peak calling in your different datasets and then make a combined set of peaks in which to measure. If you wanted to do this then the process would be:

Call peaks for TF1
Turn the probes into an annotation track called Peaks using File > Import Annotation > Active Probe Lists
Repeat steps 1 and 2 for TF2 and TF3
Use the feature probe generator to make probes over all features of type Peak
Use the deduplication probe generator to remove duplicate probes where there was a peak in more than one TF

Alternatively if you can identify a common feature where all of the TFs bind (promoters for example) then you could simply make probes over all promoters regardless of whether they showed a peak in any of your datasets.

Once you have the probes defined you can quantitate them - a simple log transformed read count would probably suffice.

You then have a number of different ways you can look at these data. For comparing two TFs you could use a scatterplot, but for 3 or more the hierarchical heatmap is probably better. It will cluster your probes into groups which show correlated patterns across your different TF datasets. The more TFs you have the better this type of plot will be.

You can also use this plot to extract out subgroups which respond in similar ways. You can use the slider on the left to set the threshold for clustering probes - higher correlation will make smaller more tightly correlated groups, lower thresholds will make fewer larger more diffuse groups. Once you're happy you can use the "Save Clusters" button at the bottom to save the groups as probe lists in your main project so you can see what they are and do further analysis on them.

**crazyhottommy** · 06-25-2013, 02:50 AM

Thanks Simon for your detailed answer. I am clear how to do it in seqmonk now.
but I do have a question related to this kind of analysis, it is more about statistics I guess.

qualities of different antibodies are different, so the library sizes for each TF are different.
also, Total TFs binding sites in the genome intrinsically vary because they are different TFs! The total reads in the libraries are different.

So, when compare binding strength, in this case, seqmonk uses the log2 counts( potentially normalized to library size? people tend to use counts per million (cpm) to represent the data ). Is it reasonable to compare between different TFs?

I've got the raw read counts for my probe sets using bedtools (coverage) from the bam file for each TF. I ended up with a dataframe like this:

probes TF1 TF2 TF3 TF4 TF5
MACS_peak_291 7 10 18 2 19
MACS_peak_33 27 64 47 22 12
MACS_peak_46 5 2 20 1 5
MACS_peak_6 15 9 14 9 206
MACS_peak_7 7 12 9 6 1
MACS_peak_8 37 5 24 10 18

I can then get rid of the outliers,normalized them to counts per million and transform by log2.
Now, I can use heatmap.2 in gplots to generate a heatmap. I am not sure whether is it reasonable to do it like this, as I mentioned my concern : they are different TFs. That's why I turned to seqmonk.

I think the problem with us pure biologist is that we lack sufficient statistics knowledge. we do not know the underlying statistical models applied in the software like seqmonk. but that's the whole point of seqmonk, hiding the detailed statistical model, providing a handy tool for wet biologist

However, I do want to know a little bit more....

well, my thought was that I can rank the counts ( assign a new value) for each TF at my probe set from strong to weak based on the whole genome binding profile for the same TF. Then, I can cluster them. I am not sure how to do it exactly though...

I am eager to see your insight about this kind of question.

I appreciate your help

Originally posted by simonandrews View Post

Yes, you can do this. First you'd need to decide where to put your probes - you could either do separate peak calling in your different datasets and then make a combined set of peaks in which to measure. If you wanted to do this then the process would be:

Call peaks for TF1
Turn the probes into an annotation track called Peaks using File > Import Annotation > Active Probe Lists
Repeat steps 1 and 2 for TF2 and TF3
Use the feature probe generator to make probes over all features of type Peak
Use the deduplication probe generator to remove duplicate probes where there was a peak in more than one TF

Alternatively if you can identify a common feature where all of the TFs bind (promoters for example) then you could simply make probes over all promoters regardless of whether they showed a peak in any of your datasets.

Once you have the probes defined you can quantitate them - a simple log transformed read count would probably suffice.

You then have a number of different ways you can look at these data. For comparing two TFs you could use a scatterplot, but for 3 or more the hierarchical heatmap is probably better. It will cluster your probes into groups which show correlated patterns across your different TF datasets. The more TFs you have the better this type of plot will be.

You can also use this plot to extract out subgroups which respond in similar ways. You can use the slider on the left to set the threshold for clustering probes - higher correlation will make smaller more tightly correlated groups, lower thresholds will make fewer larger more diffuse groups. Once you're happy you can use the "Save Clusters" button at the bottom to save the groups as probe lists in your main project so you can see what they are and do further analysis on them.

**simonandrews** · 06-26-2013, 12:16 AM

Originally posted by crazyhottommy View Post

Thanks Simon for your detailed answer. I am clear how to do it in seqmonk now.
but I do have a question related to this kind of analysis, it is more about statistics I guess.

qualities of different antibodies are different, so the library sizes for each TF are different.
also, Total TFs binding sites in the genome intrinsically vary because they are different TFs! The total reads in the libraries are different.

So, when compare binding strength, in this case, seqmonk uses the log2 counts( potentially normalized to library size? people tend to use counts per million (cpm) to represent the data ). Is it reasonable to compare between different TFs?

I guess the answer to this depends on what you are looking for in the data. A direct quantitative comparison of ChIP data for different factors is pretty tenuous - there will be wide variations in the levels of enrichment, the number of sites and maybe the genomic positioning of the enrichment you're looking at. Doing simple log2 CPM is normally a good place to start to look at this data though.

In terms of the analysis there are some things which are pretty simple to do - you can correlate the abundances of different factors. Just yesterday I was looking at two ChIP datasets and it was quickly apparent that there was an inverse relationship between the two datasets. The scaling was very different but the trend was very apparent. You can also do simple visualisations such as scatterplots for two samples or clusters for more than two which will quickly give you an impression of whether the different factors are acting the same way, and whether there are obvious subgroups of sites acting in different ways.

You can also look at coverage trend plots to see if the binding sites of factors are similar in two or more datasets. We use this a lot for things like histone modification ChIP data, but it could apply to other types as well.

For the quantitation there are various ways to look at and if needs be normalise your basic quantitation. You can use the distribution matching tools in seqmonk to correct for differences in enrichment efficiency between different samples. On a broader scale you can transform your raw values into z-scores or even into ranks if your distributions are really very different and you want to try to make them more comparable.

**crazyhottommy** · 06-27-2013, 01:27 PM

Thank you

Originally posted by simonandrews View Post

I guess the answer to this depends on what you are looking for in the data. A direct quantitative comparison of ChIP data for different factors is pretty tenuous - there will be wide variations in the levels of enrichment, the number of sites and maybe the genomic positioning of the enrichment you're looking at. Doing simple log2 CPM is normally a good place to start to look at this data though.

In terms of the analysis there are some things which are pretty simple to do - you can correlate the abundances of different factors. Just yesterday I was looking at two ChIP datasets and it was quickly apparent that there was an inverse relationship between the two datasets. The scaling was very different but the trend was very apparent. You can also do simple visualisations such as scatterplots for two samples or clusters for more than two which will quickly give you an impression of whether the different factors are acting the same way, and whether there are obvious subgroups of sites acting in different ways.

You can also look at coverage trend plots to see if the binding sites of factors are similar in two or more datasets. We use this a lot for things like histone modification ChIP data, but it could apply to other types as well.

For the quantitation there are various ways to look at and if needs be normalise your basic quantitation. You can use the distribution matching tools in seqmonk to correct for differences in enrichment efficiency between different samples. On a broader scale you can transform your raw values into z-scores or even into ranks if your distributions are really very different and you want to try to make them more comparable.

**crazyhottommy** · 07-02-2013, 06:53 AM

Hi Simon,

Sorry to bother you again. I have two ChIP-seq data, I do a MACS peak call for one, and got some peaks as probes, I then import them as annotations to keep them.

I then did the same thing for my second data set, and also import the probes as the annotations.

How can I pool these two probes (union of these two)? "the combining existing list" only works on active probes, I did not find any function in define probes tab either.

I can, however, export the probes to txt file, and then combine the two using bedtools. I am just wandering if there is an direct way to do it in seqmonk.

Thanks!
Tommy

**simonandrews** · 07-02-2013, 06:59 AM

Originally posted by crazyhottommy View Post

Hi Simon,

Sorry to bother you again. I have two ChIP-seq data, I do a MACS peak call for one, and got some peaks as probes, I then import them as annotations to keep them.

I then did the same thing for my second data set, and also import the probes as the annotations.

How can I pool these two probes (union of these two)? "the combining existing list" only works on active probes, I did not find any function in define probes tab either.

If you give the two sets of results the same feature name ('peak' for example) then they'll be merged together in the same track and you can use the feature probe generator to make probes from the combined set. You can always separate them later by renaming one of the tracks using the controls in the Annotation Sets folder of the data view.

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News