代码之家  ›  专栏  ›  技术社区  ›  shantanuo

比较常用词并用占位符替换

awk
  •  1
  • shantanuo  · 技术社区  · 4 年前

    我试图比较两个文本文件和awk似乎是工作:

    # cat remove.txt
    test
    junk
    trash
    unwanted
    bad
    worse
    
    # cat corpus.txt
    this is a test message to check if bad words are removed correctly. The second line may or may not have unwanted words. The third line also need not be as clean as first and second line.
    There can be paragraphs in the text corpus and the entire file should be checked for trash.
    

    这个命令按预期运行,但我需要用XXX替换这些单词,而不是简单地删除它。

    awk -v RS='[[:space:]]+' 'FNR == NR {seen[$1]; next} 
    !($1 in seen {ORS=RT; print}' remove.txt corpus.txt
    

    输出:

    this is a message to check if words are removed correctly. The second line may or may not have words. The third line also need not be as clean as first and second line.
    There can be paragraphs in the text corpus and the entire file should be checked for trash.
    

    预期输出如下所示。。。

    this is a xxx message to check if xxx words are removed correctly. The second line may or may not have xxx words. The third line also need not be as clean as first and second line.
    There can be paragraphs in the text corpus and the entire file should be checked for trash.
    

    如果我去掉那些常用词,就没法知道它以前在哪里。(需要占位符)这是大约400MB的英语语料库,可能包含非英语unicode字符。

    2 回复  |  直到 4 年前
        1
  •  2
  •   anubhava    4 年前

    你可以用这个 awk :

    awk -v RS='[[:space:]]+' 'FNR == NR {seen[$1]; next} $1 in seen {$1 = "xxx"} {ORS=RT} 1' remove.txt corpus.txt
    
    this is a xxx message to check if xxx words are removed correctly. The second line may or may not have xxx words. The third line also need not be as clean as first and second line.
    There can be paragraphs in the text corpus and the entire file should be checked for trash.
    

    $1 in seen {$1 = "xxx"}  # if word is from remove list then set it to xxx
    {ORS=RT}                 # set output record separator as RT
    1                        # print each record
    
        2
  •  2
  •   RavinderSingh13 Nikita Bakshi    4 年前

    有了你们展示的样品,你们能试一下吗。用GNU编写和测试 awk

    awk -v RS='[[:space:]]+' '
    FNR == NR{
      seen[$1]
      next
    }
    {
      $1=($1 in seen?"XXX":$1)
      ORS=RT
      print
    }
    ' remove.txt corpus.txt
    

    说明:

    awk -v RS='[[:space:]]+' '     ##Starting awk program from here and setting RS as spaces here.
    FNR == NR{                     ##Checking condition which will be TRUE when remove.txt is being read.
      seen[$1]                     ##Creating seen with index of 1st field.
      next                         ##next will skip all further statements from here.
    }
    {
      $1=($1 in seen?"XXX":$1)     ##Checking condition if $1 is in seen then set it to XXX else keep $1.
      ORS=RT                       ##Setting ORS value as RT.
      print                        ##Printing current line here.
    }
    ' remove.txt corpus.txt        ##Mentioning Input_file names here.