代码之家  ›  专栏  ›  技术社区  ›  s__

r,计算两个数据集的最小欧几里得距离,并自动标记。

  •  2
  • s__  · 技术社区  · 6 年前

    我正在和 Euclidean Distance 一对数据集。 首先,我的数据。

    centers <- data.frame(x_ce = c(300,180,450,500),
                          y_ce = c(23,15,10,20),
                          center = c('a','b','c','d'))
    
    points <- data.frame(point = c('p1','p2','p3','p4'),
                         x_p = c(160,600,400,245),
                         y_p = c(7,23,56,12))
    

    我的目标 是为了找到 points 离中心的最小距离 centers ,并将中心名称附加到 数据集(显然是最小的),使这个过程自动化。

    所以我从基地开始:

    #Euclidean distance
    sqrt(sum((x-y)^2))
    

    事实上,我已经在我的头脑中,它应该如何工作,但我无法管理如何使它自动。

    1. 选择一行 以及 中心
    2. 计算行与每行之间的欧几里得距离 中心
    3. 选择最小距离
    4. 贴上最小距离的标签
    5. 对第二行重复…直到年底

    因此,我设法手动完成了这项工作,并采取了所有步骤使其自动化:

    # 1.  
    x = (points[1,2:3])   # select the first of points
    y1 = (centers[1,1:2]) # select the first center
    y2 = (centers[2,1:2]) # select the second center
    y3 = (centers[3,1:2]) # select the third center
    y4 = (centers[4,1:2]) # select the fourth center
    
    # 2.
    # then the distances
    distances <- data.frame(distance = c(
                                        sqrt(sum((x-y1)^2)),
                                        sqrt(sum((x-y2)^2)),
                                        sqrt(sum((x-y3)^2)),
                                        sqrt(sum((x-y4)^2))),
                                        center = centers$center
                                        )
    
    # 3.
    # then I choose the row with the smallest distance
    d <- distances[which(distances$distance==min(distances$distance)),]
    
    # 4.
    # last, I put the label near the point
    cbind(points[1,],d)
    
    # 5. 
    # then I restart for the second point
    

    问题是我不能自动管理它。你有什么办法使这一程序对每一点都自动进行吗? ? 此外,我是否在重新设计车轮,也就是说,它是否存在一个我不知道的更快的过程(作为一个函数)?

    2 回复  |  直到 6 年前
        1
  •  2
  •   AntoniosK    6 年前
    centers <- data.frame(x_ce = c(300,180,450,500),
                          y_ce = c(23,15,10,20),
                          center = c('a','b','c','d'))
    
    points <- data.frame(point = c('p1','p2','p3','p4'),
                         x_p = c(160,600,400,245),
                         y_p = c(7,23,56,12))
    
    library(tidyverse)
    
    points %>%
      mutate(c = list(centers)) %>%
      unnest() %>%                       # create all posible combinations of points and centers as a dataframe
      rowwise() %>%                      # for each combination
      mutate(d = sqrt(sum((c(x_p,y_p)-c(x_ce,y_ce))^2))) %>%   # calculate distance
      ungroup() %>%                                            # forget the grouping
      group_by(point, x_p, y_p) %>%                            # for each point
      summarise(closest_center = center[d == min(d)]) %>%      # keep the closest center
      ungroup()                                                # forget the grouping
    
    # # A tibble: 4 x 4
    #   point   x_p   y_p closest_center
    #   <fct> <dbl> <dbl> <fct>         
    # 1 p1      160     7 b             
    # 2 p2      600    23 d             
    # 3 p3      400    56 c             
    # 4 p4      245    12 a
    
        2
  •  1
  •   WaltS    6 年前

    dplyr 包装,您可以使用 group_by 在每个点上循环 mutate 要形成距离列表,请设置 distance 作为列表的最小值,并设置 center 作为最小距离中心的名称。对于重复行或点名称的情况,我已经包含了两个备选方案。

        library(dplyr)
       centers <- data.frame(x_ce = c(300,180,450,500),
                            y_ce = c(23,15,10,20),
                            center = c('a','b','c','d'))
       points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                           x_p = c(160,600,400,245, 245),
                           y_p = c(7,23,56,12, 12))
    #
    #  If duplicate rows need to be removed
    #
      result1 <- points %>% group_by(point) %>%  distinct() %>% 
                                      mutate(lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                                      distance=min(unlist(lst)), 
                                      center = centers$center[which.min(unlist(lst))]) %>%
                 select(-lst)
    

    结果是什么?

    # A tibble: 4 x 5
    # Groups:   point [4]
      point   x_p   y_p distance center
      <fct> <dbl> <dbl>    <dbl> <fct> 
    1 p1      160     7     21.5 b     
    2 p2      600    23    100.  d     
    3 p3      400    56     67.9 c     
    4 p4      245    12     56.1 a 
    

    #
    # Alternative if point names are not unique
    #
      points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                           x_p = c(160,600,400,245, 550),
                           y_p = c(7,23,56,12, 25))
      result2 <- points %>% rowwise() %>%
                        mutate( lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                                   distance=min(unlist(lst)), 
                                  center = centers$center[which.min(unlist(lst))]) %>%
                        ungroup() %>% select(-lst)
    

    结果是

    # A tibble: 5 x 5
      point   x_p   y_p distance center
      <fct> <dbl> <dbl>    <dbl> <fct> 
    1 p1      160     7     21.5 b     
    2 p2      600    23    100.  d     
    3 p3      400    56     67.9 c     
    4 p4      245    12     56.1 a     
    5 p4      550    25     50.2 d