我有一个大熊猫数据帧(7gib),我从csv读取。我需要把这个数据帧和另一个数据帧合并,要小得多。假设它的大小可以忽略不计。
我还尝试在Mac上运行合并,也有16gib。默认情况下,系统消耗大约3gib的RAM。合并在Mac上完成,内存不超过10gib。
这怎么可能?熊猫的版本是一样的,数据帧是一样的。这里发生了什么?
编辑:
# Read the data for the stations, stored in a separate file
stations = pd.read_csv("stations_with_id.csv", index_col=0)
stations.set_index("id_station")
list_data = list()
data = pd.DataFrame()
# Merge all pollutants data in one dataframe
# Probably not the most optimized approach ever...
for pollutant in POLLUTANTS:
path_merged_data_per_pollutant = os.path.join("raw_data", f"{pollutant}_merged")
print(f"Pollutant: {pollutant}")
for f in os.listdir(path_merged_data_per_pollutant):
if ".csv" not in f:
print(f"passing {f}")
continue
print(f"loading {f}")
df = pd.read_csv(
os.path.join(path_merged_data_per_pollutant, f),
sep=";",
na_values="mq",
dtype={"concentration": "float64"},
)
# Drop useless colums and translate useful ones to english
# Do that here to limit memory usage
df = df.rename(index=str, columns=col_to_rename)
df = df[list(col_to_rename.values())]
# Date formatted as YYYY-MM
df["date"] = df["date"].str[:7]
df.set_index("id_station")
df = pd.merge(df, stations, left_on="id_station", right_on="id_station")
# Filter entries to France only (only the metropolitan area) based on GPS coordinates
df = df[(df.longitude > -5) & (df.longitude < 12)]
list_data.append(df)
print("\n")
data = pd.concat(list_data)
唯一不是字符串的列是
concentration
,并在读取csv时指定类型。
stations数据帧为<1个MiB。