代码之家  ›  专栏  ›  技术社区  ›  Thomas

使用模糊标签解析结构化文档中的数据

  •  1
  • Thomas  · 技术社区  · 7 年前

    我正试图将法律文件从古老的SGML文件转移到数据库中。在java中使用正则表达式,我运气很好。然而,我遇到了一个小问题。文件各部分的标签似乎在文件之间并不标准。例如,最常见的标签是:

    (<numeric>)
        (<alpah>)
            (<ROMAN>)
                (<ALPHA>)
    

    有人遇到过这样的问题吗?有人有什么建议吗?

    编辑:

    下面是我用来解析不同项的正则表达式:

    Section: ^<tab>(<b>)?\d{1,4}(\.\d+)?-((\d{1,4}(\.\d+)?)(-|\.)?){3}
    SubSection: \.?\s*(<\/b>|<tab>|^)\s*\(\d+(\.\d+)?\)\s+($|<b>|[A-Z"]|\([a-z](.\d+)?\)\s*(\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s*(\([A-Z](.\d+)?\))?)?\s*.)
    Paragraph: (^|<tab>|\s+|\(\d+(\.\d+)?\)\s+)\([a-z](.\d+)?\)(\s+$|\s+<b>|\s+[A-Z"]|\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)(\([A-Z](.\d+)?\))?\s*[A-Z"]?)
    SubParagraph: (\)|<tab>|<\/b>)\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s+($|[A-Z"<]|\([A-Z](.\d+)?\)\s*[A-Z"])
    SubSubParagraph: (<tab>|\)\s*)\([A-Z](.\d+)?\)\s+([A-Z"]|$)
    

    <tab><b>SECTION 5.</b>  In Colorado Revised Statutes, 13-5-142, <b>amend</b> (1)
    introductory portion, (1)(b), and (3)(b)(II) as follows:
    
    <tab><b>13-5-142.  National instant criminal background check system - reporting.</b>
    (1)  On and after March 20, 2013, the state court administrator shall send electronically
    the following information to the Colorado bureau of investigation created pursuant to
    section 24-33.5-401, referred to in this section as the "bureau":
    
    <tab>(b)  The name of each person who has been committed by order of the court to the
    custody of the office of behavioral health in the department of human services pursuant
    to section 27-81-112 or 27-82-108; and
    
    <tab>(3)  The state court administrator shall take all necessary steps to cancel a record
    made by the state court administrator in the national instant criminal background check
    system if:
    
    <tab>(b)  No less than three years before the date of the written request:
    
    <tab>(II)  The period of commitment of the most recent order of commitment or
    recommitment expired, or a court entered an order terminating the person's incapacity or
    discharging the person from commitment in the nature of habeas corpus, if the record in
    the national instant criminal background check system is based on an order of
    commitment to the custody of the office of behavioral health in the department of human
    services; except that the state court administrator shall not cancel any record pertaining to
    a person with respect to whom two recommitment orders have been entered pursuant to
    section 27-81-112 (7) and (8), or who was discharged from treatment pursuant to section
    27-81-112 (11) on the grounds that further treatment is not likely to bring about
    significant improvement in the person's condition; or
    
    2 回复  |  直到 7 年前
        1
  •  1
  •   Gene    7 年前

    你对问题的陈述很模糊,所以唯一可能的答案是一般方法。我做过这样格式不精确的文档转换。

    CS提供的一种工具是状态机。如果您可以检测到(例如,使用正则表达式)格式正在更改为新的约定,那么这是合适的。这会改变状态,在这种情况下,状态相当于译者在当前和后续文本块上使用的状态。它在下一个状态更改之前保持有效。总的来说,算法如下所示:

    translator = DEFAULT 
    while (chunks of input remain) {
      chunk = GetNextChunkOfInput // a line, paragraph, etc.
      new_translator = ScanChunkForStateChange(chunk, translator)
      if (new_translator != null) translator = new_translator // found a state change!
      print(translator.Translate(chunk))  // use the translator on the chunk
    }
    

    在这个框架内,设计翻译器和状态变化谓词是一个复杂的过程。你所希望做的就是尝试,检查输出,并解决问题,不断重复,直到没有更好的结果。在这一点上,您可能已经在输入中发现了一个最大结构,因此仅使用模式匹配的算法(不尝试建模语义,例如使用人工智能)不会让您走得更远。

        2
  •  0
  •   imhotap    7 年前

    您发布的文本片段可以由SGML解析器在 DOCTYPE <tab> 在您的示例中,表示实际 tab data.ent ,然后创建以下SGML文件, doc.sgm

    <!DOCTYPE doc [
        <!ELEMENT doc O O (tab)+>
        <!ELEMENT tab - O (((b,c?)|c),text)>
        <!ELEMENT text O O (#PCDATA|b)+>
        <!ELEMENT b - - (#PCDATA)>
        <!ELEMENT c - - (#PCDATA)>
        <!ENTITY data SYSTEM "data.ent">
        <!ENTITY startc "<c>">
        <!ENTITY endc "</c>">
        <!SHORTREF intab "(" startc ")" endc>
        <!USEMAP intab tab>
        <!USEMAP #EMPTY text>
    ]>
    &data
    

    使用这些DTD规则解析数据的结果(使用 osgmlnorm doc.sgm

    <DOC>
      <TAB>
        <B>SECTION 5.</B>
        <TEXT>In Colorado Revised Statutes, 13-5-142, <B>amend</B> (1)
          introductory portion, (1)(b), and (3)(b)(II) as follows:
        </TEXT>
      </TAB>
      <TAB>
        <B>13-5-142.  National instant criminal background check system
          reporting.</B>
        <C>1</C>
        <TEXT>On and after March 20, 2013, the state court administrator
          shall send electronically the following information to the
          Colorado bureau of investigation created pursuant to section
          24-33.5-401, referred to in this section as the "bureau":
        </TEXT>
      </TAB>
      <TAB>
        <C>b</C>
        <TEXT>The name of each person who has been committed by order
          of the court to the custody of the office of behavioral health
          in the department of human services pursuant to section 27-81-112
          or 27-82-108; and
        </TEXT>
      </TAB>
      <TAB>
        <C>3</C>
        <TEXT>The state court administrator shall take all necessary steps
          to cancel a record made by the state court administrator in the
          national instant criminal background check system if:
        </TEXT>
      </TAB>
      <TAB>
        <C>b</C>
        <TEXT>No less than three years before the date of the written
          request:
        </TEXT>
      </TAB>
      <TAB>
        <C>II</C>
        <TEXT>The period of commitment of the most recent order of
          commitment or recommitment expired, or a court entered an order
          terminating the person's incapacity or discharging the person
          from commitment in the nature of habeas corpus, if the record in 
          the national instant criminal background check system is based on
          an order of commitment to the custody of the office of behavioral
          health in the department of human services; except that the state
          court administrator shall not cancel any record pertaining to
          a person with respect to whom two recommitment orders have been
          entered pursuant to section 27-81-112 (7) and (8), or who was
          discharged from treatment pursuant to section 27-81-112 (11) on
          the grounds that further treatment is not likely to bring about
          significant improvement in the person's condition; or
        </TEXT>
      </TAB>
    </DOC>
    

    说明:

    • 我创建的SGML DTD使用SGML标记推断来推断合成的 DOC 元素作为文档元素,以及人工 TEXT C 主要目的是将文档结构作为一系列 TAB 元素,每个元素包含一个节标识符(例如 <b>SECTION 5.</b> (c)
    • 我还制作了一个特殊元素 C 放在大括号中的文本( ( ) 字符);开始-结束元素 标签 由于以下原因,由SGML处理器自动插入 DTD的 SHORTREF 映射规则;这些告诉SGML 元素,SGML应替换所有 ( startc 实体(扩展到 <C> 字符由 的价值 endc </C> )
    • <!USEMAP #EMPTY text> 关闭中括号的扩展 文本 a的身体部位 桌棋类游戏 (7) (8) 在里面 正文文本不会被更改(尽管可以像HTML一样更改为 链接以及使用SGML)

    如果您使用 < 表示制表符(ASCII 9),SGML也可以处理它,例如通过将制表符转换为 <TAB> SHORTREF公司 与所示规则类似的规则。

    osgmlnorm 安装的程序;可以使用 sudo apt-get install opensp 如果你在Ubuntu上,在其他Linux版本和Mac操作系统上也是如此。对于您的应用程序,您可能需要使用 osx