代码之家 › 专栏 › 技术社区 › Jay Gray

如何阅读词典并替换文件中的单词?

awk

Jay Gray · 技术社区 · 6 年前

我们有一个看起来像这样的源文件(“source-a”)(如果您看到蓝色文本,它来自stackoverflow,而不是文本文件):

The container of white spirit was made of aluminium.
We will use an aromatic method to analyse properties of white spirit.
No one drank white spirit at stag night.
Many people think that a potato crisp is savoury, but some would rather eat mashed potato.
...
more sentences

“source-a”中的每个句子都在自己的行上,以换行符结尾(\n)

我们有一个字典/转换文件(“converse-b”),如下所示:

aluminium<tab>aluminum
analyse<tab>analyze
white spirit<tab>mineral spirits
stag night<tab>bachelor party
savoury<tab>savory
potato crisp<tab>potato chip
mashed potato<tab>mashed potatoes

“converse-b”是一个以制表符分隔的两列文件。每个等价映射( 左项 <tab> 权利条款 )在自己的行上,并以换行符(\n)终止

如何读取“converse-b”,并替换“source-a”中的术语,其中“converse-b”列-1中的术语替换为列-2中的术语,然后写入输出文件(“output-c”)?

例如,“output-c”如下所示:

The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.

棘手的部分是土豆这个词。

如果“简单” awk 解不能处理奇异项(土豆) 和一个复数术语(土豆),我们将使用手动替换方法。这个 AWK 解决方案可以跳过这个用例。

换句话说,一个 AWK 解决方案可以规定它只适用于无歧义词或由空格分隔的无歧义词组成的词。

安 AWK 解决方案将使我们达到90%的完成率;剩下的10%将手动完成。

1 回复 | 直到 6 年前

karakfa 6 年前

sed 可能更适合,因为它只是短语/单词的替换。请注意,如果相同的单词出现在多个短语中,则先到先得;因此,请相应地更改字典顺序。

$ sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' dict) content

The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
...
more sentences

文件替换 塞德 语句将字典条目转换为sed表达式和main 塞德 将它们用于内容替换。

NB: 注意 生产质量 脚本应该考虑单词大小写和单词边界,以消除不需要的子字符串替换,这里将忽略这些替换。

推荐文章

mashimena · 如何在Linux中提取列然后通过计算添加新列

1 年前

user2954003 · AWK使用正则表达式匹配字符串并与前一个字符串组合

2 年前

Giancarlo D · 在IPv4地址匹配后使用SED删除行尾的冒号

2 年前

John Smith · 在特定行的末尾添加文本

2 年前

Code With Banchi · 在sed命令中使用变量-sed-e异常:“s”的选项未知

2 年前

Aravinth Kalai · 如何使用Linux shell命令[duplicate]对两列求和并将值保存到第三列

2 年前

nickcrv06 · 使用介于特殊字符之间的awk提取文本

2 年前

nickcrv06 · 在两个常量字符串之间提取单词

2 年前

JCAvila2 · 我需要了解awk Linux命令的帮助

2 年前

sasikumar karuppiah · Awk脚本提取多个不同的分隔符行

2 年前