Seqanswers Leaderboard Ad

**ikremsky** · 04-01-2010, 04:43 PM

Hello,

I am attempting to run the function getGeneExp but am running into problems. After loading the DEGseq package and setting the working directory to the one containing the bed and refFlat files, I type:
>getGeneExp(c("nameofbedfile.bed.txt"), refFlat = "nameofrefFlatfile.txt")

The output is as follows:
SampleFiles:
nameofbedfile.bed.txt
Count the number of reads mapped to each gene.
This will take several minutes.
Please wait...

It then continues indefinitely, as if it is in an infinite loop. I know it is not a memory issue, as I have run this on a large Unix cluster, and I also tried using only a small fraction of the original .bed file, but after up to 12 hours of runtime the job eventually times out and nothing is returned.

I am running the latest version of R, version 2.10.1. I noticed earlier in the thread that you recommend using version 2.10.0, so perhaps there is a compatibility problem with the new version? This is just a passing thought; I would appreciate some advice though.

Thanks!

**Xi Wang** · 04-01-2010, 07:22 PM

Originally posted by ikremsky View Post

Hello,

I am attempting to run the function getGeneExp but am running into problems. After loading the DEGseq package and setting the working directory to the one containing the bed and refFlat files, I type:
>getGeneExp(c("nameofbedfile.bed.txt"), refFlat = "nameofrefFlatfile.txt")

The output is as follows:
SampleFiles:
nameofbedfile.bed.txt
Count the number of reads mapped to each gene.
This will take several minutes.
Please wait...

It then continues indefinitely, as if it is in an infinite loop. I know it is not a memory issue, as I have run this on a large Unix cluster, and I also tried using only a small fraction of the original .bed file, but after up to 12 hours of runtime the job eventually times out and nothing is returned.

I am running the latest version of R, version 2.10.1. I noticed earlier in the thread that you recommend using version 2.10.0, so perhaps there is a compatibility problem with the new version? This is just a passing thought; I would appreciate some advice though.

Thanks!

Thanks for your question.

Usually, it takes several minutes for the function getGeneExp to get a result. It seems weird in your case. Could you please make sure two things? First, I am wondering if the program really found the input files. Using the full path instead of the file name could be helpful. Second, make sure your BED and REFFLAT files following the instruction of file formats in DEGseq manual. BED file format here is a little bit different from its original version. An example of BED format:

Code:

chr12 7 38 readID 2 +

which means: chromosome, start_pos, end_pos, read_ID, mis_matches (actually this column not used), strand.

I don't think there are some compatibility problems here.

Hope this helps. Any further questions, just let me know.

**ikremsky** · 04-02-2010, 12:22 AM

Originally posted by Xi Wang View Post

Thanks for your question.

Usually, it takes several minutes for the function getGeneExp to get a result. It seems weird in your case. Could you please make sure two things? First, I am wondering if the program really found the input files. Using the full path instead of the file name could be helpful. Second, make sure your BED and REFFLAT files following the instruction of file formats in DEGseq manual. BED file format here is a little bit different from its original version. An example of BED format:

Code:

chr12 7 38 readID 2 +

which means: chromosome, start_pos, end_pos, read_ID, mis_matches (actually this column not used), strand.

I don't think there are some compatibility problems here.

Hope this helps. Any further questions, just let me know.

Hi Xi,

Thank you for your quick answer. Unfortunately, it did not solve the problem. I verified the file format, and used the full file addresses, but the problem persists. In addtion, I ruled out any potential compatibility issues, as I tried running it in R version 2.10.0 (actually it is still running, but it has been running for over 6 hours in the same spot it always gets stuck). Actually the same thing happens when I run DEGseq, it never moves past counting the number of reads.

Is there a special way the files should be converted to .txt? Initially the files have .bed extensions, but I simply added the .txt extension to the end to match more closely with your examples. It does the samet thing whether or not I add the .txt at the end actually. Any other ideas what the problem might be? The bed files are the output read files from Rmap. I tried going through the example from your guide, but was getting some error message there as well (although that was coming before being put into the getGeneExp function).

Thanks!

**Xi Wang** · 04-02-2010, 07:29 AM

Originally posted by ikremsky View Post

Hi Xi,

Thank you for your quick answer. Unfortunately, it did not solve the problem. I verified the file format, and used the full file addresses, but the problem persists. In addtion, I ruled out any potential compatibility issues, as I tried running it in R version 2.10.0 (actually it is still running, but it has been running for over 6 hours in the same spot it always gets stuck). Actually the same thing happens when I run DEGseq, it never moves past counting the number of reads.

Is there a special way the files should be converted to .txt? Initially the files have .bed extensions, but I simply added the .txt extension to the end to match more closely with your examples. It does the samet thing whether or not I add the .txt at the end actually. Any other ideas what the problem might be? The bed files are the output read files from Rmap. I tried going through the example from your guide, but was getting some error message there as well (although that was coming before being put into the getGeneExp function).

Thanks!

Hi,

1. Don't need to convert the files to .TXT, but .TXT could also work.
2. As you also got errors when you run the examples, it seems that the DEGseq was not correctly installed. It is recommended to re-install the DEGseq package, and the release version is better. It can be got via

DEGseq

http://bioconductor.org/packages/release/bioc/html/DEGseq.html

DEGseq is an R package to identify differentially expressed genes from RNA-Seq data.

I'd like to know if DEGseq works well this time.

Good luck!

**ikremsky** · 04-02-2010, 03:26 PM

Originally posted by Xi Wang View Post

Hi,

1. Don't need to convert the files to .TXT, but .TXT could also work.
2. As you also got errors when you run the examples, it seems that the DEGseq was not correctly installed. It is recommended to re-install the DEGseq package, and the release version is better. It can be got via

DEGseq

http://bioconductor.org/packages/release/bioc/html/DEGseq.html

DEGseq is an R package to identify differentially expressed genes from RNA-Seq data.

I'd like to know if DEGseq works well this time.

Good luck!

Ok I reinstalled DEGseq. I was able to run the example, but I am still unable to run it on the files I have produced. I noticed that on the readID section in your sample bed file, that all id's have the '>' symbol before the id, whereas this symbol is not in the bed files I have produced. For example, the entry in your sample file looks like ">CMLIVERKIDNEY_7:1:1:67:406" (without quotes) whereas in the file I've created, they look like "SRR001357.7181352". Could that be the problem? What do those numbers mean on your sample id's?

Thanks!

**ikremsky** · 04-02-2010, 05:41 PM

I am starting to think the problem has to do with the large file size. I tried simply scanning the bed file into R using function read.delim(), but it took quite about 20 minutes. Accessing parts of the table was also slow.

What is the best way to deal with very large bed files with millions of reads? R seems to be inherently very slow in dealing with this type of file.

Thanks!

**Xi Wang** · 04-02-2010, 06:30 PM

Originally posted by ikremsky View Post

Ok I reinstalled DEGseq. I was able to run the example, but I am still unable to run it on the files I have produced. I noticed that on the readID section in your sample bed file, that all id's have the '>' symbol before the id, whereas this symbol is not in the bed files I have produced. For example, the entry in your sample file looks like ">CMLIVERKIDNEY_7:1:1:67:406" (without quotes) whereas in the file I've created, they look like "SRR001357.7181352". Could that be the problem? What do those numbers mean on your sample id's?

Thanks!

I think it has nothing to do with the read ids. The numbers are just related to the sequencing platform.

**Xi Wang** · 04-02-2010, 06:36 PM

Originally posted by ikremsky View Post

I am starting to think the problem has to do with the large file size. I tried simply scanning the bed file into R using function read.delim(), but it took quite about 20 minutes. Accessing parts of the table was also slow.

What is the best way to deal with very large bed files with millions of reads? R seems to be inherently very slow in dealing with this type of file.

Thanks!

How many reads are there in your bed file? DEGseq can deal with millions of reads up on our test. Actually, C/C++ code is embeded in DEGseq to deal with read-counting. It will be much faster than R.

Could you please paste a few lines of your BED file and your REFFLAT file here, or email to me ([email protected])? I will look into them and test them on my machine. Thanks.

**ikremsky** · 04-02-2010, 07:04 PM

Originally posted by Xi Wang View Post

How many reads are there in your bed file? DEGseq can deal with millions of reads up on our test. Actually, C/C++ code is embeded in DEGseq to deal with read-counting. It will be much faster than R.

Could you please paste a few lines of your BED file and your REFFLAT file here, or email to me ([email protected])? I will look into them and test them on my machine. Thanks.

There are a little over 11 million reads.

Here are the first few lines of the bed file:
chr7 33062200 33062233 SRR001357.7181352 0 +
chr7 33062200 33062233 SRR001357.7208189 0 +
chr4 91653375 91653408 SRR001357.758827 0 -
chrX 47443645 47443678 SRR001357.1304368 0 +

and the refFlat file:
Xkr4 NM_001011874 chr1 - 3204562 3661579 3206102 3661429 3 3204562,3411782,3660632, 3207049,3411982,3661579,
Rp1 NM_011283 chr1 - 4334223 4350473 4334680 4342906 4 4334223,4341990,4342282,4350280, 4340172,4342162,4342918,4350473,
Sox17 NM_011441 chr1 - 4481008 4486494 4481796 4483487 5 4481008,4483180,4483852,4485216,4486371, 4482749,4483547,4483944,4486023,4486494,
Mrpl15 NM_025300 chr1 - 4764014 4775768 4764532 4775758 5 4764014,4767605,4772648,4774031,4775653, 4764597,4767729,4772814,4774186,4775768,

**Xi Wang** · 04-02-2010, 07:27 PM

Originally posted by ikremsky View Post

There are a little over 11 million reads.

Here are the first few lines of the bed file:
chr7 33062200 33062233 SRR001357.7181352 0 +
chr7 33062200 33062233 SRR001357.7208189 0 +
chr4 91653375 91653408 SRR001357.758827 0 -
chrX 47443645 47443678 SRR001357.1304368 0 +

and the refFlat file:
Xkr4 NM_001011874 chr1 - 3204562 3661579 3206102 3661429 3 3204562,3411782,3660632, 3207049,3411982,3661579,
Rp1 NM_011283 chr1 - 4334223 4350473 4334680 4342906 4 4334223,4341990,4342282,4350280, 4340172,4342162,4342918,4350473,
Sox17 NM_011441 chr1 - 4481008 4486494 4481796 4483487 5 4481008,4483180,4483852,4485216,4486371, 4482749,4483547,4483944,4486023,4486494,
Mrpl15 NM_025300 chr1 - 4764014 4775768 4764532 4775758 5 4764014,4767605,4772648,4774031,4775653, 4764597,4767729,4772814,4774186,4775768,

It works well on my machine. Did you try the 4-line files on your cluster? (copying the 4 lines directly from the web does not work, as TAB separator is required in the input files.)
And how about the example in the manual for getGeneExp? Was there any warning message when you install DEGseq package?

**ikremsky** · 04-03-2010, 01:13 PM

Originally posted by Xi Wang View Post

It works well on my machine. Did you try the 4-line files on your cluster? (copying the 4 lines directly from the web does not work, as TAB separator is required in the input files.)
And how about the example in the manual for getGeneExp? Was there any warning message when you install DEGseq package?

Ever since I installed R version 2.10.0 a couple days ago in parallel with versoin 2.10.1, I have been getting the message: "In fun(...) : no DISPLAY variable so Tk is not available" when loading the DEGseq package. (I think the problem was caused when I attempted to uninstall R 2.10.1 by going into the head directory and typing "make uninstall"; it did not uninstall as I had expected it to, and afterward I decided to keep it installed. However, the message occurs when I use both version 2.10.0 and 2.10.1, so maybe that wasn't the cause afterall.) I'm not sure if that is the problem though, because I wasn't getting that message earlier but was still having the same exact problem.

I tried running the first 4 lines like you suggested (I loaded the full files into MS Excel and then copied and pasted the top 4 lines into a new file and saved as tab-delimited text.) Here are my input and output messages:

getGeneExp("/auto/cmb-01/ikremsky/mouse_liver/supershort.bed.txt", refFlat = "/auto/cmb-01/ikremsky/mouse_liver/refFlat_mm9short.txt")
Please wait...

SampleFiles:
/auto/cmb-01/ikremsky/mouse_liver/supershort.bed.txt
Count the number of reads mapped to each gene.
This will take several minutes.
Please wait ...
total 1 unique genes
processed 0 reads (supershort.bed.txt)Wrong Format!
There is something wrong!Please check...

**Xi Wang** · 04-03-2010, 05:33 PM

Originally posted by ikremsky View Post

Ever since I installed R version 2.10.0 a couple days ago in parallel with versoin 2.10.1, I have been getting the message: "In fun(...) : no DISPLAY variable so Tk is not available" when loading the DEGseq package. (I think the problem was caused when I attempted to uninstall R 2.10.1 by going into the head directory and typing "make uninstall"; it did not uninstall as I had expected it to, and afterward I decided to keep it installed. However, the message occurs when I use both version 2.10.0 and 2.10.1, so maybe that wasn't the cause afterall.) I'm not sure if that is the problem though, because I wasn't getting that message earlier but was still having the same exact problem.

It just because you can't use R display for figures. Have nothing to do with DEGseq itself. So the installation should be ok.

Originally posted by ikremsky View Post

I tried running the first 4 lines like you suggested (I loaded the full files into MS Excel and then copied and pasted the top 4 lines into a new file and saved as tab-delimited text.) Here are my input and output messages:

getGeneExp("/auto/cmb-01/ikremsky/mouse_liver/supershort.bed.txt", refFlat = "/auto/cmb-01/ikremsky/mouse_liver/refFlat_mm9short.txt")
Please wait...

SampleFiles:
/auto/cmb-01/ikremsky/mouse_liver/supershort.bed.txt
Count the number of reads mapped to each gene.
This will take several minutes.
Please wait ...
total 1 unique genes
processed 0 reads (supershort.bed.txt)Wrong Format!
There is something wrong!Please check...

It is clear now that the formats are not recognized. I think it is related to the line field mark. There is something wrong when switching the Windows and Linux platform. Please try to use the Linux command

Code:

cat /auto/cmb-01/ikremsky/mouse_liver/supershort.bed.txt

to check if the file dose appear in one line. I'll get back to you later if you still can't solve this problem.

**ikremsky** · 04-03-2010, 06:34 PM

Originally posted by Xi Wang View Post

It just because you can't use R display for figures. Have nothing to do with DEGseq itself. So the installation should be ok.

It is clear now that the formats are not recognized. I think it is related to the line field mark. There is something wrong when switching the Windows and Linux platform. Please try to use the Linux command

Code:

cat /auto/cmb-01/ikremsky/mouse_liver/supershort.bed.txt

to check if the file dose appear in one line. I'll get back to you later if you still can't solve this problem.

Alright, I have fixed the problem! Thanks very much for all your help Xi.

**gen2prot** · 04-05-2010, 10:38 AM

Hi Xi,

I am using the DEGseq package and the DEGexp utility to find differentially expressed genes between male and female drosophila flies. If I leave the default values and use the MARS method or the FET method, I get a almost 7000 genes which are differentially expressed. I found that genes with a read count difference of 5000 has similar p-values to genes with a difference of 300. Is there a way to select only the most significant of these differences? How is the threshold for gene expression difference decided. I intend to remove from my dataset all genes (as much as computationally possible) that are sex-associated and I am not sure if 7000 genes qualify as sex-associated.

Thanks
Abhijit

**Xi Wang** · 04-05-2010, 07:11 PM

Originally posted by gen2prot View Post

Hi Xi,

I am using the DEGseq package and the DEGexp utility to find differentially expressed genes between male and female drosophila flies. If I leave the default values and use the MARS method or the FET method, I get a almost 7000 genes which are differentially expressed. I found that genes with a read count difference of 5000 has similar p-values to genes with a difference of 300. Is there a way to select only the most significant of these differences? How is the threshold for gene expression difference decided. I intend to remove from my dataset all genes (as much as computationally possible) that are sex-associated and I am not sure if 7000 genes qualify as sex-associated.

Thanks
Abhijit

Number of absolute difference can not reflect the significance of difference expressions levels. For example, suppose we have two genes and measured their expression levels at two conditions using read count, one is 5001:1 and the other is 200,000:205,000. In this case, both genes have the same difference of read count, but we are more confident that the first gene is expressed differentially than the second. So p-value can give you a rank the extend that observed difference read counts deviate from expectation if gene is not differentially expressed. As for your second question, you should first plot the distribution of p-values, and choose an appropriate cut-off, and remove genes fall below the chosen significance level.

Hope this helps.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 21 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News