代码之家 › 专栏 › 技术社区 › shantanuo

比较常用词并用占位符替换

awk

shantanuo · 技术社区 · 4 年前

我试图比较两个文本文件和awk似乎是工作:

# cat remove.txt
test
junk
trash
unwanted
bad
worse

# cat corpus.txt
this is a test message to check if bad words are removed correctly. The second line may or may not have unwanted words. The third line also need not be as clean as first and second line.
There can be paragraphs in the text corpus and the entire file should be checked for trash.

这个命令按预期运行,但我需要用XXX替换这些单词,而不是简单地删除它。

awk -v RS='[[:space:]]+' 'FNR == NR {seen[$1]; next} 
!($1 in seen {ORS=RT; print}' remove.txt corpus.txt

输出:

this is a message to check if words are removed correctly. The second line may or may not have words. The third line also need not be as clean as first and second line.
There can be paragraphs in the text corpus and the entire file should be checked for trash.

预期输出如下所示。。。

this is a xxx message to check if xxx words are removed correctly. The second line may or may not have xxx words. The third line also need not be as clean as first and second line.
There can be paragraphs in the text corpus and the entire file should be checked for trash.

如果我去掉那些常用词,就没法知道它以前在哪里。(需要占位符)这是大约400MB的英语语料库,可能包含非英语unicode字符。

2 回复 | 直到 4 年前

anubhava 4 年前

你可以用这个 awk :

awk -v RS='[[:space:]]+' 'FNR == NR {seen[$1]; next} $1 in seen {$1 = "xxx"} {ORS=RT} 1' remove.txt corpus.txt

this is a xxx message to check if xxx words are removed correctly. The second line may or may not have xxx words. The third line also need not be as clean as first and second line.
There can be paragraphs in the text corpus and the entire file should be checked for trash.

$1 in seen {$1 = "xxx"}  # if word is from remove list then set it to xxx
{ORS=RT}                 # set output record separator as RT
1                        # print each record

RavinderSingh13 Nikita Bakshi 4 年前

有了你们展示的样品,你们能试一下吗。用GNU编写和测试 awk

awk -v RS='[[:space:]]+' '
FNR == NR{
  seen[$1]
  next
}
{
  $1=($1 in seen?"XXX":$1)
  ORS=RT
  print
}
' remove.txt corpus.txt

说明:

awk -v RS='[[:space:]]+' '     ##Starting awk program from here and setting RS as spaces here.
FNR == NR{                     ##Checking condition which will be TRUE when remove.txt is being read.
  seen[$1]                     ##Creating seen with index of 1st field.
  next                         ##next will skip all further statements from here.
}
{
  $1=($1 in seen?"XXX":$1)     ##Checking condition if $1 is in seen then set it to XXX else keep $1.
  ORS=RT                       ##Setting ORS value as RT.
  print                        ##Printing current line here.
}
' remove.txt corpus.txt        ##Mentioning Input_file names here.

推荐文章

mashimena · 如何在Linux中提取列然后通过计算添加新列

1 年前

user2954003 · AWK使用正则表达式匹配字符串并与前一个字符串组合

2 年前

Giancarlo D · 在IPv4地址匹配后使用SED删除行尾的冒号

2 年前

John Smith · 在特定行的末尾添加文本

2 年前

Code With Banchi · 在sed命令中使用变量-sed-e异常:“s”的选项未知

2 年前

Aravinth Kalai · 如何使用Linux shell命令[duplicate]对两列求和并将值保存到第三列

2 年前

nickcrv06 · 使用介于特殊字符之间的awk提取文本

2 年前

nickcrv06 · 在两个常量字符串之间提取单词

2 年前

JCAvila2 · 我需要了解awk Linux命令的帮助

2 年前

sasikumar karuppiah · Awk脚本提取多个不同的分隔符行

2 年前