Hi all,
I have a specific stats question, that is a bit deep in the weeds (for me anyway, and also reveals some of my shortcomings, in terms of understanding the background statistical techniques regarding random forests…).
I'm attempting to use the Out of Bag Error classification that is suggested in Seurat's pbmc33k code object on my own dataset. The general idea is to over cluster your single cell data and then merge the clusters that are very similar (as assessed by the out of bag error stat). I do have a couple of questions however:
1) How does one assess how many nodes need to be merged? In the pbmc example, the authors choose 8, based on "High OOBE" scores. From reading a bit, however, it appears those scores are fairly dataset specific. Do you have
any suggestions as to any methodologies for this choice? (I paste the OOBE results from my dataset below).
2) Also, it seems possible that one could have overclustering within 2 (or more) transcriptionally distinct
sets of cells (let's say within dendritic cells and within macrophages). Presumably, this means you'd have to merge twice, somehow? Once within DCs and once within Macs? Is this possible? My understanding is the list below is ranked by the OOBE and you could different nodes with high OOBE that arise from very different cell types that you’d not want to merge together?
Thanks so much, in advance.
All the best,
Josh
node oobe
59 0.2836879433
57 0.2592592593
60 0.2268518519
46 0.1829971182
49 0.1818181818
54 0.1736694678
56 0.1533546326
47 0.1404682274
51 0.1338582677
43 0.1255374033
36 0.1246226822
58 0.1213592233
48 0.1197771588
61 0.1168831169
33 0.1158497772
45 0.1148036254
39 0.1109399076
53 0.1069958848
50 0.0796812749
44 0.0705882353
37 0.0672131148
40 0.0599078341
34 0.0588235294
38 0.0510752688
55 0.0474777448
41 0.0471063257
42 0.0340531561
35 0.0196319018
52 0.0122699387
32 0.0006361323
I have a specific stats question, that is a bit deep in the weeds (for me anyway, and also reveals some of my shortcomings, in terms of understanding the background statistical techniques regarding random forests…).
I'm attempting to use the Out of Bag Error classification that is suggested in Seurat's pbmc33k code object on my own dataset. The general idea is to over cluster your single cell data and then merge the clusters that are very similar (as assessed by the out of bag error stat). I do have a couple of questions however:
1) How does one assess how many nodes need to be merged? In the pbmc example, the authors choose 8, based on "High OOBE" scores. From reading a bit, however, it appears those scores are fairly dataset specific. Do you have
any suggestions as to any methodologies for this choice? (I paste the OOBE results from my dataset below).
2) Also, it seems possible that one could have overclustering within 2 (or more) transcriptionally distinct
sets of cells (let's say within dendritic cells and within macrophages). Presumably, this means you'd have to merge twice, somehow? Once within DCs and once within Macs? Is this possible? My understanding is the list below is ranked by the OOBE and you could different nodes with high OOBE that arise from very different cell types that you’d not want to merge together?
Thanks so much, in advance.
All the best,
Josh
node oobe
59 0.2836879433
57 0.2592592593
60 0.2268518519
46 0.1829971182
49 0.1818181818
54 0.1736694678
56 0.1533546326
47 0.1404682274
51 0.1338582677
43 0.1255374033
36 0.1246226822
58 0.1213592233
48 0.1197771588
61 0.1168831169
33 0.1158497772
45 0.1148036254
39 0.1109399076
53 0.1069958848
50 0.0796812749
44 0.0705882353
37 0.0672131148
40 0.0599078341
34 0.0588235294
38 0.0510752688
55 0.0474777448
41 0.0471063257
42 0.0340531561
35 0.0196319018
52 0.0122699387
32 0.0006361323