代码之家 › 专栏 › 技术社区 › EdM

PAM集群-在另一个数据集中使用结果

pam data-mining cluster-analysis machine-learning r

EdM · 技术社区 · 8 年前

我已经成功地使用帕姆函数(R中的集群包)现在,我想使用这些结果将新的观察结果归因于先前定义的集群/MEDOID。

另一种解决问题的方法是 k 已由帕姆函数,哪个更接近于初始数据集中不存在的附加观察?

x<-matrix(c(1,1.2,0.9,2.3,2,1.8,
            3.2,4,3.1,3.9,3,4.4),6,2)
x
     [,1] [,2]
[1,]  1.0  3.2
[2,]  1.2  4.0
[3,]  0.9  3.1
[4,]  2.3  3.9
[5,]  2.0  3.0
[6,]  1.8  4.4
pam(x,2)

观测值1、3和5以及2、4和6聚集在一起,观测值1和6是MEDOID:

Medoids:
     ID        
[1,]  1 1.0 3.2
[2,]  6 1.8 4.4
Clustering vector:
[1] 1 2 1 2 1 2

现在,应该将哪个集群/medoid y归属/关联?

y<-c(1.5,4.5)

哦,如果你有几种解决方案,计算时间在我拥有的大数据集中很重要。

1 回复 | 直到 6 年前

Sandipan Dey 8 年前

一般情况下,对k个集群尝试此操作:

k <- 2 # pam with k clusters
res <- pam(x,k)

y <- c(1.5,4.5) # new point

# get the cluster centroid to which the new point is to be assigned to
# break ties by taking the first medoid in case there are multiple ones

# non-vectorized function
get.cluster1 <- function(res, y) which.min(sapply(1:k, function(i) sum((res$medoids[i,]-y)^2)))

# vectorized function, much faster
get.cluster2 <- function(res, y) which.min(colSums((t(res$medoids)-y)^2))

get.cluster1(res, y)
#[1] 2
get.cluster2(res, y)
#[1] 2

# comparing the two implementations (the vectorized function takes much les s time)
library(microbenchmark)
microbenchmark(get.cluster1(res, y), get.cluster2(res, y))

#Unit: microseconds
#                 expr    min     lq     mean median     uq     max neval cld
# get.cluster1(res, y) 31.219 32.075 34.89718 32.930 33.358 135.995   100   b
# get.cluster2(res, y) 17.107 17.962 19.12527 18.817 19.245  41.483   100  a

任意距离函数的扩展:

# distance function
euclidean.func <- function(x, y) sqrt(sum((x-y)^2))
manhattan.func <- function(x, y) sum(abs(x-y))

get.cluster3 <- function(res, y, dist.func=euclidean.func) which.min(sapply(1:k, function(i) dist.func(res$medoids[i,], y)))
get.cluster3(res, y) # use Euclidean as default
#[1] 2
get.cluster3(res, y, manhattan.func) # use Manhattan distance
#[1] 2

推荐文章

bz_jf · CNN训练损失太不稳定了

2 年前

ReactJs newbie · yolov4自定义培训,检测结果不正确

2 年前

Tushar Nautiyal · 我们需要在Flask应用程序中进行功能缩放吗

2 年前

Mahin Rahman · 我的培训和测试图表保持不变,有人能帮我解释一下,或者解释一下我哪里出错了?

2 年前

Mucida · BERT2:如何使用GPT2LMHeadModel开始一个句子,而不是完成它

2 年前

Bad Coder · 如何在Pyte中使用SMOTE?

2 年前

Sherwin R · 随机森林预测错误的输出形状

2 年前

Joseph · 重塑BatchDataset训练模型的输入-Tensorflow

2 年前

curiousninja · 如何从pandas中的特定列中删除非数值?

2 年前

Palkin Jangra · 如何迭代一列以获得每行的平均值?

2 年前