数据组织如下:
[in] print(training_df.head(n=5))
[out] product_id
transaction_id
0000001 [P06, P09]
0000002 [P01, P05, P06, P09]
0000003 [P01, P06]
0000004 [P01, P09]
0000005 [P06, P09]
在本例中,[P06,P09]的频率为2,所有其他组合的频率为1。我创建了以下二进制矩阵,并计算了每个单独项目的频率:
# Create a matrix for the transactions
from sklearn.preprocessing import MultiLabelBinarizer
product_ids = ['P{:02d}'.format(i+1) for i in range(10)]
mlb = MultiLabelBinarizer(classes = product_ids)
training_df1 = training_df.drop('product_id', 1).join(pd.DataFrame(mlb.fit_transform(training_df['product_id']),
columns=mlb.classes_,
index=training_df.index))
# Calculate the support count for each product (frequency)
train_product_support = {}
for column in training_df1.columns:
train_product_support[column] = sum(training_df1[column]>0)
如何计算数据中存在的1-4项的每个唯一组合的频率?