代码之家  ›  专栏  ›  技术社区  ›  JPV

使用正则表达式提取不同格式的日期并对其进行排序-pandas

  •  2
  • JPV  · 技术社区  · 7 年前

    我是文本挖掘新手,需要从*中提取日期。txt文件并对其进行排序。日期位于句子(每行)之间,其格式可能如下所示:

    04/20/2009; 04/20/09; 4/20/09; 4/3/09
    Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
    20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
    Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
    Feb 2009; Sep 2009; Oct 2010
    6/2008; 12/2009
    2009; 2010
    

    如果缺少一天,则考虑第一天;如果缺少一个月,则考虑一月。

    import pandas as pd
    
    doc = []
    with open('dates.txt') as file:
        for line in file:
            doc.append(line)
    
    df = pd.Series(doc)
    
    df2 = pd.DataFrame(df,columns=['text'])
    
    def myfunc(x):
        if len(x)==4:
            x = '01/01/'+x
        else:
            if not re.search('/',x):
                example = re.sub('[-]','/',x)
                terms = re.split('/',x)
                if (len(terms)==2):
                    if len(terms[-1])==2:
                        x = '01/'+terms[0]+'/19'+terms[-1]
                    else:
                        x = '01/'+terms[0]+'/'+terms[-1] 
                elif len(terms[-1])==2:
                    x = terms[0].zfill(2)+'/'+terms[1].zfill(2)+'/19'+terms[-1]
        return x
    
    df2['text'] = df2.text.str.replace(r'(((?:\d+[/-])?\d+[/-]\d+)|\d{4})', lambda x: myfunc(x.groups('Date')[0]))
    

    我只做了数字日期格式。但我有点困惑如何用数字日期来做。

    我知道这是一个粗略的代码,但这正是我得到的。

    1 回复  |  直到 7 年前
        1
  •  18
  •   Bharath M Shetty    7 年前

    我认为这是coursera文本挖掘任务之一。你可以使用正则表达式和抽取来得到解。 dates.txt

    doc = []
    with open('dates.txt') as file:
        for line in file:
            doc.append(line)
    
    df = pd.Series(doc)
    
    def date_sorter():
        # Get the dates in the form of words
        one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})')
        # Get the dates in the form of numbers
        two = df.str.extract(r'((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?:(?:\/|-)\d{2,4}))')
        # Get the dates where there is no days i.e only month and year  
        three = df.str.extract(r'((?:\d{1,2}(?:-|\/))?\d{4})')
        #Convert the dates to datatime and by filling the nans in two and three. Replace month name because of spelling mistake in the text file.
        dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
    return pd.Series(dates.sort_values())
    
    date_sorter()
    

    输出:

    9     1971-04-10
    84    1971-05-18
    2     1971-07-08
    53    1971-07-11
    28    1971-09-12
    474   1972-01-01
    153   1972-01-13
    13    1972-01-26
    129   1972-05-06
    98    1972-05-13
    111   1972-06-10
    225   1972-06-15
    31    1972-07-20
    171   1972-10-04
    191   1972-11-30
    486   1973-01-01
    335   1973-02-01
    415   1973-02-01
    36    1973-02-14
    405   1973-03-01
    323   1973-03-01
    422   1973-04-01
    375   1973-06-01
    380   1973-07-01
    345   1973-10-01
    57    1973-12-01
    481   1974-01-01
    436   1974-02-01
    104   1974-02-24
    299   1974-03-01
    

    如果只想返回索引,那么 return pd.Series(dates.sort_values().index)

    第一个正则表达式的解析

     #?: Non-capturing group 
    
    ((?:\d{,2}\s)? # The two digits group. `?` refers to preceding token or group. Here the digits of 2 or 1 and space occurring once or less.  
    
     (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* # The words in group ending with any letters `[]` occuring any number of times (`*`). 
    
     (?:-|\.|\s|,) # Pattern matching -,.,space 
    
     \s? #(`?` here it implies only to space i.e the preceding token)
    
     \d{,2}[a-z]* # less than or equal to two digits having any number of letters at the end (`*`). (Eg: may be 1st, 13th , 22nd , Jan , December etc ) . 
    
     (?:-|,|\s)?# The characters -/,/space may occur once and may not occur because of `?` at the end
    
     \s? # space may occur or may not occur at all (maximum is 1) (`?` here it refers only to space)
    
     \d{2,4}) # Match digit which is 2 or 4