代码之家 › 专栏 › 技术社区 › JLuu

创建一个函数以标准化每个ID的标签

group-by numpy dataframe pandas python

JLuu · 技术社区 · 9 月前

我正在尝试创建一个函数,该函数将给定ID的标签列标准化为给定条件。

我想基于 最常用的标签 对于那个ID,如果没有公共/多数标签,那么只需将第一个观察值作为默认标准。

到目前为止,我的功能如下:

def standardize_labels(df, id_col, label_col):
    # Function to find the most common label or the first one if there's a tie
    def most_common_label(group):
        labels = group[label_col].value_counts()
        # Check if the top two labels have the same count
        if len(labels) > 1 and labels.iloc[0] == labels.iloc[1]:
            return group[label_col].iloc[0]
        return labels.idxmax()

    # Group by the ID column and apply the most_common_label function
    common_labels = df.groupby(id_col).apply(most_common_label)

    # Map the IDs in the original DataFrame to their common labels
    df['standardized_label'] = df[id_col].map(common_labels)

    return df

它 主要地 有效,但我注意到一些标签的趋势发生了变化,标签会根据给定的ID发生变化,如下所示:

身份证件	原始标签	标准化标签
222	洛杉矶地铁	洛杉矶地铁
222	洛杉矶地铁	洛杉矶地铁
222	洛杉矶地铁	洛杉矶地铁
222	洛杉矶地铁	洛杉矶地铁
222	洛杉矶地铁	洛杉矶地铁

相反,当我希望所有标准化的_label都是 洛杉矶地铁 因为这是每个ID的多数标签。

1 回复 | 直到 9 月前

cottontail 9 月前

代码对我来说是按预期工作的。但是,您可以使用 mode 以使其更易于阅读。此外,您还可以转换groupby中的函数以直接分配给列,这将使您的整个操作变成一行代码。

df['standardized_label'] = df.groupby('ID')['raw_label'].transform(lambda x: x.mode()[0])

或者您可以使用 groupby.apply 并绘制地图。无论如何,函数看起来应该是:

def standardize_labels(df, id_col, label_col):
    # Function to find the most common label or the first one if there's a tie
    def most_common_label(group):
        return group.mode()[0]

    # Group by the ID column and apply the most_common_label function
    common_labels = df.groupby(id_col)[label_col].apply(most_common_label)

    # Map the IDs in the original DataFrame to their common labels
    df['standardized_label'] = df[id_col].map(common_labels)

    return df

自从 value_counts() 在数据帧上工作,我们可以在没有groupby的情况下直接使用它。因此,函数可以更改为以下内容。这是的重构 a function 我写了一个不同的问题。

def standardize_labels(df, id_col, label_col):
    # Group by the ID column and apply the most_common_label function
    labels_counts = df.value_counts([id_col, label_col])
    dup_idx_msk = ~labels_counts.droplevel(label_col).index.duplicated()
    common_labels = labels_counts[dup_idx_msk]
    common_labels = common_labels.reset_index(level=1)[label_col]
    # Map the IDs in the original DataFrame to their common labels
    df['standardized_label'] = df[id_col].map(common_labels)
    return df

df = standardize_labels(df, 'ID', 'raw_label')