代码之家 › 专栏 › 技术社区 › s__

r,计算两个数据集的最小欧几里得距离,并自动标记。

euclidean-distance r

s__ · 技术社区 · 6 年前

我正在和 Euclidean Distance 一对数据集。首先,我的数据。

centers <- data.frame(x_ce = c(300,180,450,500),
                      y_ce = c(23,15,10,20),
                      center = c('a','b','c','d'))

points <- data.frame(point = c('p1','p2','p3','p4'),
                     x_p = c(160,600,400,245),
                     y_p = c(7,23,56,12))

我的目标 是为了找到 points 离中心的最小距离 centers ,并将中心名称附加到 点 数据集(显然是最小的),使这个过程自动化。

所以我从基地开始:

#Euclidean distance
sqrt(sum((x-y)^2))

事实上,我已经在我的头脑中,它应该如何工作,但我无法管理如何使它自动。

选择一行 点 以及 中心
计算行与每行之间的欧几里得距离 中心
选择最小距离
贴上最小距离的标签
对第二行重复…直到年底 点

因此,我设法手动完成了这项工作,并采取了所有步骤使其自动化:

# 1.  
x = (points[1,2:3])   # select the first of points
y1 = (centers[1,1:2]) # select the first center
y2 = (centers[2,1:2]) # select the second center
y3 = (centers[3,1:2]) # select the third center
y4 = (centers[4,1:2]) # select the fourth center

# 2.
# then the distances
distances <- data.frame(distance = c(
                                    sqrt(sum((x-y1)^2)),
                                    sqrt(sum((x-y2)^2)),
                                    sqrt(sum((x-y3)^2)),
                                    sqrt(sum((x-y4)^2))),
                                    center = centers$center
                                    )

# 3.
# then I choose the row with the smallest distance
d <- distances[which(distances$distance==min(distances$distance)),]

# 4.
# last, I put the label near the point
cbind(points[1,],d)

# 5. 
# then I restart for the second point

问题是我不能自动管理它。你有什么办法使这一程序对每一点都自动进行吗? 点 ? 此外,我是否在重新设计车轮,也就是说,它是否存在一个我不知道的更快的过程(作为一个函数)?

2 回复 | 直到 6 年前

AntoniosK 6 年前

centers <- data.frame(x_ce = c(300,180,450,500),
                      y_ce = c(23,15,10,20),
                      center = c('a','b','c','d'))

points <- data.frame(point = c('p1','p2','p3','p4'),
                     x_p = c(160,600,400,245),
                     y_p = c(7,23,56,12))

library(tidyverse)

points %>%
  mutate(c = list(centers)) %>%
  unnest() %>%                       # create all posible combinations of points and centers as a dataframe
  rowwise() %>%                      # for each combination
  mutate(d = sqrt(sum((c(x_p,y_p)-c(x_ce,y_ce))^2))) %>%   # calculate distance
  ungroup() %>%                                            # forget the grouping
  group_by(point, x_p, y_p) %>%                            # for each point
  summarise(closest_center = center[d == min(d)]) %>%      # keep the closest center
  ungroup()                                                # forget the grouping

# # A tibble: 4 x 4
#   point   x_p   y_p closest_center
#   <fct> <dbl> <dbl> <fct>         
# 1 p1      160     7 b             
# 2 p2      600    23 d             
# 3 p3      400    56 c             
# 4 p4      245    12 a

WaltS 6 年前

与 dplyr 包装,您可以使用 group_by 在每个点上循环 mutate 要形成距离列表,请设置 distance 作为列表的最小值,并设置 center 作为最小距离中心的名称。对于重复行或点名称的情况,我已经包含了两个备选方案。

    library(dplyr)
   centers <- data.frame(x_ce = c(300,180,450,500),
                        y_ce = c(23,15,10,20),
                        center = c('a','b','c','d'))
   points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                       x_p = c(160,600,400,245, 245),
                       y_p = c(7,23,56,12, 12))
#
#  If duplicate rows need to be removed
#
  result1 <- points %>% group_by(point) %>%  distinct() %>% 
                                  mutate(lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                                  distance=min(unlist(lst)), 
                                  center = centers$center[which.min(unlist(lst))]) %>%
             select(-lst)

结果是什么?

# A tibble: 4 x 5
# Groups:   point [4]
  point   x_p   y_p distance center
  <fct> <dbl> <dbl>    <dbl> <fct> 
1 p1      160     7     21.5 b     
2 p2      600    23    100.  d     
3 p3      400    56     67.9 c     
4 p4      245    12     56.1 a

和

#
# Alternative if point names are not unique
#
  points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                       x_p = c(160,600,400,245, 550),
                       y_p = c(7,23,56,12, 25))
  result2 <- points %>% rowwise() %>%
                    mutate( lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                               distance=min(unlist(lst)), 
                              center = centers$center[which.min(unlist(lst))]) %>%
                    ungroup() %>% select(-lst)

结果是

# A tibble: 5 x 5
  point   x_p   y_p distance center
  <fct> <dbl> <dbl>    <dbl> <fct> 
1 p1      160     7     21.5 b     
2 p2      600    23    100.  d     
3 p3      400    56     67.9 c     
4 p4      245    12     56.1 a     
5 p4      550    25     50.2 d