Hello all,
I have an RNA-Seq expression data, both RPKM/ReadsCount. Number of records in each dataset is ~35,000 gene.
The samples are organised as follows:
Tissue #1: 5 stages , 2 replicates for each stage. Total 10 samples
Tissue #2: 5 stages , 2 replicates for each stage. Total 10 samples
What I want to do with the expression data is basically clustering. I need to cluster each tissue separate from the other for now.
A couple of issues I came across:
I would really appreciate any kind of help with each of the above issues
I have an RNA-Seq expression data, both RPKM/ReadsCount. Number of records in each dataset is ~35,000 gene.
The samples are organised as follows:
Tissue #1: 5 stages , 2 replicates for each stage. Total 10 samples
Tissue #2: 5 stages , 2 replicates for each stage. Total 10 samples
What I want to do with the expression data is basically clustering. I need to cluster each tissue separate from the other for now.
A couple of issues I came across:
- What is the correct way to deal with missing values as I have many of them scattered all over the dataset?
- I read that it good to get rid of those genes that have zero ReadsCount in more than a specific percentage (30%, 40%, 50% ... ) of the samples? makes sense to me, but I dunno if it is correct!
- What kind of data should go into the clustering process (RPKM, ReadsCount, add 0.00001 to all RPKM values, take the log2 , then calculate the standard normalized value of each RPKM
- Should replicates go into the clustering, or values from both replicates of each samples need to normalized in some way to produce one value (this way I will end up with 5 columns for each sample)?
I would really appreciate any kind of help with each of the above issues
Comment