代码之家  ›  专栏  ›  技术社区  ›  Java

如何使用第一列中的值从csv文件中提取某些数据块

  •  -1
  • Java  · 技术社区  · 1 年前

    我试图只提取从第三行开始的两个部分(“入学”和“重新入学”)中的数据;a的第2列 关键词 在第1列中。

    请参阅底部csv文件中的数据集示例。

    Admissions
    Not Started:12  Sent:3            Completed:3
    
    Division        Community         ResidentName   Date        DocumentStatus  Last Update
                    Test Station      Jane Doe       9/12/2023   Sent            9/12/2023
                    Test Station 2    John Doe       9/12/2023   Not Started       
                    Alibaba Fizgerald Super Man      9/12/2023   Not Started  
                    Iceland Kingdom   Super Woman    9/12/2023   Not Started 
    

    ,,,,,

    Readmissions  
    Not Started:1   Sent:0            Completed:1 
    
    Division        Community         Resident Name   Date         DocumentStatus  Last Update
                    Station Kingdom   Pretty Woman    9/12/2023    Not Started  
                    My Goodness       Ugly Man        7/21/2023    Completed       7/26/2023 
    

    ,,,,

    Discharge 
    
    Division        Community         Resident Name   Date            
                    Station Kingdom1  Pretty Woman2   8/22/2023      
                    My Goodness1      Ugly Man1       4/8/2023     
                    Landmark2         Nice Guys       9/12/2023     
                    Iceland Kingdom2  Mr. Heroshi2    7/14/2023    
                    More Kingdom 2    King Kong       8/31/2023
    

    因此,逻辑是:

    查找包含数据的行,其中column1= '招生' '重新任务'

    往下走3行,向右走1列。 获取5列的所有数据,直到它碰到一个没有数据的行。

    因此,我想获得蓝框部分作为每个的输出:

    这是一幅插图: enter image description here

    我从 previous post ,但我有几个星期没能解决这个问题/找到解决方案,所以我发布了这个新消息。

    import re 
    import pandas as pd
    from io import StringIO
    
    def read_block(names, igidx=True):
        with open("Test1.csv") as f:
            pat = r"(\w+),+$\n+(.+?)(?=\n\w+,+\n$|\Z)"
            return pd.concat([
                pd.read_csv(StringIO(m.group(2)), skipinitialspace=True)
                    .iloc[:, 1:].dropna(how="all") for m in re.finditer(
                        pat, f.read(), flags=re.M|re.S) if m.group(1) in names # optional
            ], keys=names, ignore_index=igidx)
    
    df = read_block(names=["Admissions"])    
    print(df)  
    

    底部是电流输出,但它连接了所有三个部分。

    我希望使用关键字(“ 招生 “,” 重新分配 “)作为代码中的变量。

              Sent: 3    Completed: 3 Unnamed: 3       Unnamed: 4   Unnamed: 5
    0           Community   Resident Name       Date  Document Status  Last Update
    1        Test Station       Jane Doe   9/12/2023             Sent    9/12/2023
    2      Test Station 2        John Doe  9/12/2023      Not Started          NaN
    3   Alibaba Fizgerald       Super Man  9/12/2023      Not Started          NaN
    4     Iceland Kingdom     Super Woman  9/12/2023      Not Started          NaN
    5             Sent: 0    Completed: 1        NaN              NaN          NaN
    6           Community   Resident Name       Date  Document Status  Last Update
    7     Station Kingdom   Pretty Woman   9/12/2023      Not Started          NaN
    8         My Goodness        Ugly Man  7/21/2023        Completed    7/26/2023
    9           Community   Resident Name       Date              NaN          NaN
    10  Station Kingdom1   Pretty Woman2   8/22/2023              NaN          NaN
    11      My Goodness1        Ugly Man1   4/8/2023              NaN          NaN
    12          Landmark2       Nice Guys  9/12/2023              NaN          NaN
    13   Iceland Kingdom2    Mr. Heroshi2  7/14/2023              NaN          NaN
    14     More Kingdom 2      King Kong   8/31/2023              NaN          NaN
    

    如何修改当前代码使其工作?

    1 回复  |  直到 1 年前
        1
  •  1
  •   Timeless    1 年前

    你的 previous 问题是 移动目标 ,我不排除这一次也有可能。旧的 regex 模式在您的新csv上失败,只是因为数据集的格式现在与 this 一( 最初为其制作图案 )。

    现在,您可以尝试以下操作:

     pat = r"(\w+),+$\n[^,]+.+?\n,+\n(.+?)(?=\n,{2,})" # to update in `read_block`
    
     df = read_block(names=["Admissions", "Readmissions"], igidx=False)
    

    演示:[ regex101 ]

    向下滚动,因为我使用了给定的 screenshot /csv

    输出:

    print(df)
    
                                      Community ResidentName         Date DocumentStatus Last Update Resident Name
    Admissions   0                 Test Station     Jane Doe    9/12/2023           Sent   9/12/2023           NaN
                 1               Test Station 2     John Doe    9/12/2023    Not Started         NaN           NaN
                 2  Alibaba Fizgerald Super Man    9/12/2023  Not Started            NaN         NaN           NaN
                 3              Iceland Kingdom  Super Woman    9/12/2023   Not Started          NaN           NaN
    Readmissions 0              Station Kingdom          NaN    9/12/2023    Not Started         NaN  Pretty Woman
                 1                  My Goodness          NaN    7/21/2023      Completed  7/26/2023       Ugly Man
        
    

    我希望在第一列中使用关键字(“Admissions”、“Readmissions”)作为代码中的变量,得到一个单独的输出。

    然后使用 df.droplevel(1).reset_index(names="block")