代码之家 › 专栏 › 技术社区 › mrsquid

将熊猫系列输出到txt文件

file-io dataframe numpy pandas python

mrsquid · 技术社区 · 7 年前

我有一个熊猫系列的对象

<class 'pandas.core.series.Series'>

如下所示:

userId
1          3072 1196 838 2278 1259
2               648 475 1 151 1035
3               457 150 300 21 339
4          1035 7153 953 4993 2571
5           260 671 1210 2628 7153
6          4993 1210 2291 589 1196
7               150 457 111 246 25
8       1221 8132 30749 44191 1721
9           296 377 2858 3578 3256
10          2762 377 2858 1617 858
11           527 593 2396 318 1258
12        3578 2683 2762 2571 2580
13        7153 150 5952 35836 2028
14        1197 2580 2712 2762 1968
15        1245 1090 1080 2529 1261
16         296 2324 4993 7153 1203
17       1208 1234 6796 55820 1060
18            1377 1 1073 1356 592
19           778 1173 272 3022 909
20              329 534 377 73 272
21            608 904 903 1204 111
22       1221 1136 1258 4973 48516
23        1214 1200 1148 2761 2791
24             593 318 162 480 733
25               314 969 25 85 766
26        293 253 4878 46578 64614
27          1193 2716 24 2959 2841
28         318 260 58559 8961 4226
29            318 260 1196 2959 50
30        1077 1136 1230 1203 3481

642            123 593 750 1212 50
643         750 671 1663 2427 5618
644            780 3114 1584 11 62
645         912 2858 1617 1035 903
646           608 527 21 2710 1704
647         1196 720 5060 2599 594
648         46578 50 745 1223 5995
649            318 300 110 529 246
650            733 110 151 318 364
651         1240 1210 541 589 1247
652      4993 296 95510 122900 736
653            858 1225 1961 25 36
654        333 1221 3039 1610 4011
655           318 47 6377 527 2028
656          527 1193 1073 1265 73
657             527 349 454 357 97
658            457 590 480 589 329
659              474 508 1 288 477
660         904 1197 1247 858 1221
661           780 1527 3 1376 5481
662             110 590 50 593 733
663          2028 919 527 2791 110
664    1201 64839 1228 122886 1203
665        1197 858 7153 1221 6539
666            318 300 161 500 337
667            527 260 318 593 223
668            161 527 151 110 300
669          50 2858 4993 318 2628
670          296 5952 508 272 1196
671         1210 1200 7153 593 110

将其输出到txt文件(例如output.txt)以使格式如下所示的最佳方式是什么?

User-id1 movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
User-id2 movie-id1 movie-id2 movie-id3 movie-id4 movie-id5

最左边的值是userId,其他值是movieId。

以下是生成上述内容的代码:

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def predict(l):
    # finds the userIds corresponding to the top 5 similarities
    # calculate the prediction according to the formula
    return (df[l.index] * l).sum(axis=1) / l.sum()


# use userID as columns for convinience when interpretering the forumla
df = pd.read_csv('ratings.csv').pivot(columns='userId',
                                                index='movieId',
                                                values='rating')
df = df - df.mean()
similarity = pd.DataFrame(cosine_similarity(
    df.T.fillna(0)), index=df.columns, columns=df.columns)

res = df.apply(lambda col: ' '.join('{}'.format(mid) for mid in (0 * col).fillna(
    predict(similarity[col.name].nlargest(6).iloc[1:])).nlargest(5).index))



#Do not understand why this does not work for me but works below
df = pd.DataFrame.from_items(zip(res.index, res.str.split(' ')))
#print(df)
df.columns = ['movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']
df['customer_id'] = df.index
df = df[['customer_id', 'movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']]
df.to_csv('filepath.txt', sep=' ', index=False)

我尝试实施@emmet02解决方案,但出现了此错误,我不明白为什么会出现此错误:

ValueError: Length mismatch: Expected axis has 671 elements, new values have 5 elements

如果您有任何建议,请告诉我您是否需要更多信息或澄清。

5 回复 | 直到 7 年前

emmet02 6 年前

我建议把你的pd转过来。串联成pd。首先是数据帧。

df = pd.DataFrame.from_items(zip(series.index, series.str.split(' '))).T

只要序列具有相同数量的值(对于每个条目!),由空格分隔,这将返回此格式的数据帧

Out[49]: 
      0     1    2     3     4
0  3072   648  457  1035   260
1  1196   475  150  7153   671
2   838     1  300   953  1210
3  2278   151   21  4993  2628
4  1259  1035  339  2571  7153

接下来,我将适当地命名这些列

df.columns = ['movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']

最后,数据帧由客户id索引(我假设这是基于您的系列索引)。我们希望将其移动到数据框架中,然后重新组织列。

df['customer_id'] = df.index
df = df[['customer_id', 'movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']]

这就给您留下了这样的数据帧

  customer_id movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
0            0      3072       648       457      1035       260
1            1      1196       475       150      7153       671
2            2       838         1       300       953      1210
3            3      2278       151        21      4993      2628
4            4      1259      1035       339      2571      7153

我建议您使用

df.to_csv('filepath.csv', index=False)

但是,如果您想将其作为文本文件写入,并且只使用空格分隔,则可以使用相同的函数,但要传递分隔符。

df.to_csv('filepath.txt', sep=' ', index=False)

我不认为Series对象是您想要解决的问题的正确数据结构选择。将数字数据视为数字数据(并且在数据帧中)要比在imo中维护“空格分隔字符串”转换容易得多。

holypriest 7 年前

您可以使用以下方法,将 Series 对象(我调用 s )将这些列表转换为 DataFrame 对象(我调用 df ):

df = pd.DataFrame([[s.index[i]] + s.str.split(' ')[i] for i in range(0, len(s))])

这个 [s.index[i]] + s.str.split(' ')[i] part负责连接电影ID列表开头的索引,这是对系列中所有可用行执行的。

在那之后,你可以把 数据帧 到a .txt 使用空格作为分隔符的文件:

df.to_csv('output.txt', sep=' ', index=False)

如前所述,您还可以在转储列之前对其命名。

matanster 6 年前

同样值得避免的是csv写作黑客行为,当该系列是文本时,需要避免逃离/引用地狱。A洛杉矶:

with open(filename, 'w') as f:
    for entry in df['target_column']:
        f.write(entry)

当然,如果需要,您可以自己在循环中添加序列索引。

sgDysregulation 7 年前

我建议修改代码,如下所示

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


def predict(l):
    # finds the userIds corresponding to the top 5 similarities
    # calculate the prediction according to the formula
    return (df[l.index] * l).sum(axis=1) / l.sum()


# use userID as columns for convinience when interpretering the forumla
df = pd.read_csv('ratings.csv').pivot(columns='userId',
                                                index='movieId',
                                                values='rating')
df = df - df.mean()
similarity = pd.DataFrame(cosine_similarity(
    df.T.fillna(0)), index=df.columns, columns=df.columns)

res = df.apply(lambda col: (0 * col).fillna(
    predict(similarity[col.name].nlargest(6).iloc[1:])
).nlargest(5).index.tolist()
).apply(pd.Series).rename(
    columns=lambda col_name: 'movie-id{}'.format(col_name + 1)).reset_index(
).rename(columns={'userId': 'customer_id'})
# convert to csv
res.to_csv('filepath.txt', sep = ' ',index = False)

res.head()

In [2]: res.head()
Out[2]: 
   customer_id  movie-id1  movie-id2  movie-id3  movie-id4  movie-id5
0            1       3072       1196        838       2278       1259
1            2        648        475          1        151       1035
2            3        457        150        300         21        339
3            4       1035       7153        953       4993       2571
4            5        260        671       1210       2628       7153

显示文件

   In [3]: ! head -5 filepath.txt
customer_id movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
1 3072 1196 838 2278 1259
2 648 475 1 151 1035
3 457 150 300 21 339
4 1035 7153 953 4993 2571

Grijesh Chauhan Anand Krishnan 3 年前

_{旧问题,但添加答案以便获得帮助}

从问题的标题来看,用户似乎想将控制台输出转储到文件使用中。 to_string() 方法将数据帧(或系列)转储到文本文件中,格式与我们在控制台上看到的格式相同。例如,我复制了OP的示例,并使用 pd.read_clipboard() :

>>> df = pd.read_clipboard(index_col=0, names=['movie-id1', 
                                               'movie-id2', 
                                               'movie-id3', 
                                               'movie-id4', 
                                               'movie-id5'])
>>> df.index.name = 'userId'
>>> with open("/home/grijesh/Downloads/example.txt", 'w') as of:
         df.to_string(buf=of)

您还可以从中了解有关格式化代码的更多信息 io/formats/format.py

PS:用于相当大的数据集,效果很好-用于文本模式观察。