代码之家 › 专栏 › 技术社区 › zsad512

计算市场篮子中的唯一组合频率

sklearn-pandas frequency python

zsad512 · 技术社区 · 7 年前

数据组织如下:

[in] print(training_df.head(n=5))

[out]                     product_id
transaction_id                      
0000001                   [P06, P09]
0000002         [P01, P05, P06, P09]
0000003                   [P01, P06]
0000004                   [P01, P09]
0000005                   [P06, P09]

在本例中,[P06,P09]的频率为2,所有其他组合的频率为1。我创建了以下二进制矩阵,并计算了每个单独项目的频率:

# Create a matrix for the transactions
from sklearn.preprocessing import MultiLabelBinarizer

product_ids = ['P{:02d}'.format(i+1) for i in range(10)]

mlb = MultiLabelBinarizer(classes = product_ids)
training_df1 = training_df.drop('product_id', 1).join(pd.DataFrame(mlb.fit_transform(training_df['product_id']),
                          columns=mlb.classes_,
                          index=training_df.index))

# Calculate the support count for each product (frequency)
train_product_support = {}
for column in training_df1.columns:
    train_product_support[column] = sum(training_df1[column]>0)

如何计算数据中存在的1-4项的每个唯一组合的频率?

2 回复 | 直到 7 年前

jacoblaw 7 年前

df.groupby('product_id').count() ,这是我能想到的最好的了。我们制作了一个dict,将列表的字符串表示作为键,并计算其中出现的次数。

counts = dict()
for i in df['product_id']:
    key = i.__repr__()
    if key in counts:
        counts[key] += 1
    else:
        counts[key] = 1

dashiell 7 年前

df['frozensets'] = df.apply(lambda row: frozenset(row.product_id),axis=1)
df['frozensets'].value_counts()

从product\u id创建一列冻结集(可散列,忽略排序),然后计算每个唯一值的数量。

推荐文章

July · 如何定义数字间隔,然后四舍五入

1 年前

Community wiki · 对象名称前的单下划线和双下划线的含义是什么?

1 年前

Brian Johnson · 为什么在Python中列出字典列表会引发TypeError?[已关闭]

1 年前

user026 · 如何根据特定窗口的平均值(行数)创建新列?

1 年前

Ashok Shrestha · 需要追踪特定的颜色线并获取坐标

1 年前

Nicote Ool · 在FastApi和Vue3中获得422

1 年前

NeoExceptCodeBad · 如果我有很多垂直线,我如何找到它们的边缘?

1 年前

Abdulaziz · 如何对集合内的列表进行排序[重复]

1 年前

user2743931 · 带有src目录的Python setup.py

1 年前

asmgx · 为什么合并数据帧不能按照python中的预期方式工作

1 年前