代码之家 › 专栏 › 技术社区 › Java

如何使用第一列中的值从csv文件中提取某些数据块

regex pandas python

-1

Java · 技术社区 · 1 年前

我试图只提取从第三行开始的两个部分(“入学”和“重新入学”)中的数据;a的第2列 关键词 在第1列中。

请参阅底部csv文件中的数据集示例。

Admissions
Not Started:12  Sent:3            Completed:3

Division        Community         ResidentName   Date        DocumentStatus  Last Update
                Test Station      Jane Doe       9/12/2023   Sent            9/12/2023
                Test Station 2    John Doe       9/12/2023   Not Started       
                Alibaba Fizgerald Super Man      9/12/2023   Not Started  
                Iceland Kingdom   Super Woman    9/12/2023   Not Started

,,,,,

Readmissions  
Not Started:1   Sent:0            Completed:1 

Division        Community         Resident Name   Date         DocumentStatus  Last Update
                Station Kingdom   Pretty Woman    9/12/2023    Not Started  
                My Goodness       Ugly Man        7/21/2023    Completed       7/26/2023

,,,,

Discharge 

Division        Community         Resident Name   Date            
                Station Kingdom1  Pretty Woman2   8/22/2023      
                My Goodness1      Ugly Man1       4/8/2023     
                Landmark2         Nice Guys       9/12/2023     
                Iceland Kingdom2  Mr. Heroshi2    7/14/2023    
                More Kingdom 2    King Kong       8/31/2023

因此,逻辑是:

查找包含数据的行,其中column1= '招生' 或 '重新任务' 。

往下走3行,向右走1列。获取5列的所有数据,直到它碰到一个没有数据的行。

因此,我想获得蓝框部分作为每个的输出:

这是一幅插图:

我从 previous post ,但我有几个星期没能解决这个问题/找到解决方案,所以我发布了这个新消息。

import re 
import pandas as pd
from io import StringIO

def read_block(names, igidx=True):
    with open("Test1.csv") as f:
        pat = r"(\w+),+$\n+(.+?)(?=\n\w+,+\n$|\Z)"
        return pd.concat([
            pd.read_csv(StringIO(m.group(2)), skipinitialspace=True)
                .iloc[:, 1:].dropna(how="all") for m in re.finditer(
                    pat, f.read(), flags=re.M|re.S) if m.group(1) in names # optional
        ], keys=names, ignore_index=igidx)

df = read_block(names=["Admissions"])    
print(df)

底部是电流输出,但它连接了所有三个部分。

我希望使用关键字(“ 招生 “,” 重新分配 “)作为代码中的变量。

          Sent: 3    Completed: 3 Unnamed: 3       Unnamed: 4   Unnamed: 5
0           Community   Resident Name       Date  Document Status  Last Update
1        Test Station       Jane Doe   9/12/2023             Sent    9/12/2023
2      Test Station 2        John Doe  9/12/2023      Not Started          NaN
3   Alibaba Fizgerald       Super Man  9/12/2023      Not Started          NaN
4     Iceland Kingdom     Super Woman  9/12/2023      Not Started          NaN
5             Sent: 0    Completed: 1        NaN              NaN          NaN
6           Community   Resident Name       Date  Document Status  Last Update
7     Station Kingdom   Pretty Woman   9/12/2023      Not Started          NaN
8         My Goodness        Ugly Man  7/21/2023        Completed    7/26/2023
9           Community   Resident Name       Date              NaN          NaN
10  Station Kingdom1   Pretty Woman2   8/22/2023              NaN          NaN
11      My Goodness1        Ugly Man1   4/8/2023              NaN          NaN
12          Landmark2       Nice Guys  9/12/2023              NaN          NaN
13   Iceland Kingdom2    Mr. Heroshi2  7/14/2023              NaN          NaN
14     More Kingdom 2      King Kong   8/31/2023              NaN          NaN

如何修改当前代码使其工作?

1 回复 | 直到 1 年前

Timeless 1 年前

你的 previous 问题是 移动目标 ,我不排除这一次也有可能。旧的 regex 模式在您的新csv上失败,只是因为数据集的格式现在与 this 一( 最初为其制作图案 )。

现在,您可以尝试以下操作:

 pat = r"(\w+),+$\n[^,]+.+?\n,+\n(.+?)(?=\n,{2,})" # to update in `read_block`

 df = read_block(names=["Admissions", "Readmissions"], igidx=False)

演示:[ regex101 ]

向下滚动,因为我使用了给定的 screenshot /csv

输出:

print(df)

                                  Community ResidentName         Date DocumentStatus Last Update Resident Name
Admissions   0                 Test Station     Jane Doe    9/12/2023           Sent   9/12/2023           NaN
             1               Test Station 2     John Doe    9/12/2023    Not Started         NaN           NaN
             2  Alibaba Fizgerald Super Man    9/12/2023  Not Started            NaN         NaN           NaN
             3              Iceland Kingdom  Super Woman    9/12/2023   Not Started          NaN           NaN
Readmissions 0              Station Kingdom          NaN    9/12/2023    Not Started         NaN  Pretty Woman
             1                  My Goodness          NaN    7/21/2023      Completed  7/26/2023       Ugly Man

我希望在第一列中使用关键字(“Admissions”、“Readmissions”)作为代码中的变量,得到一个单独的输出。

然后使用 df.droplevel(1).reset_index(names="block") 。