代码之家 › 专栏 › 技术社区 › Brian

wget`--拒绝regex`不工作?

wget download regex

Brian · 技术社区 · 6 年前

为什么可以下载以下命令 index.html 从…起 www.example.com ?

wget --reject-regex .* http://www.example.com/

$ wget --reject-regex .* http://www.example.com/
--2018-03-05 11:21:26--  http://.keystone_install_lock/
Resolving .keystone_install_lock... failed: nodename nor servname provided, or not known.
wget: unable to resolve host address â.keystone_install_lockâ
--2018-03-05 11:21:26--  http://www.example.com/
Resolving www.example.com... 93.184.216.34
Connecting to www.example.com|93.184.216.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1270 (1.2K) [text/html]
Saving to: âindex.htmlâ

index.html                                                    100%[=================================================================================================================================================>]   1.24K  --.-KB/s    in 0s

2018-03-05 11:21:27 (4.49 MB/s) - âindex.htmlâ saved [1270/1270]

FINISHED --2018-03-05 11:21:27--
Total wall clock time: 0.4s
Downloaded: 1 files, 1.2K in 0s (4.49 MB/s)

手册页,共页 wget 表示

--接受正则表达式urlregex

--拒绝regex urlregex

指定正则表达式以接受或拒绝完整的URL。

和正则表达式 .* 匹配所有内容。(您可以使用 freeformatter.com )

我认为一切 wget公司 下载将被拒绝,因为 --reject-regex .* 选项

* 匹配项 www.example。通用域名格式 ,不是吗?

为什么wget不忽略中的所有内容 www.example。通用域名格式 ?

3 回复 | 直到 6 年前

builder-7000 6 年前

--regect-regex 将仅拒绝URL链接,而不拒绝中的标记文本 index.html 。例如,如果网站包含指向CSS文件的URL main.css 然后此命令将递归下载网站,但排除 主要的css :

wget -r --reject-regex 'main.css' www.somewebsite.com

要忽略网站中的某些文本,请使用 sed 。举几个例子:

# Ignores the word 'Sans'
wget -qO- example.com | sed "s/Sans//g" > index.html

# Ignores everything
wget -qO- example.com | sed "s/.*//g" > index.html

Quantum7 6 年前

使用 -np 选项拒绝索引文件。 --reject-regex 仅适用于递归文件(索引文件中的任何链接)。

   -np
   --no-parent
       Do not ever ascend to the parent directory when retrieving recursively.
       This is a useful option, since it guarantees that only the
       files below a certain hierarchy will be downloaded.

Louis Strous 3 年前

部分答案是 .* 在命令中,您的shell可能会将其扩展为当前工作目录中的匹配文件名列表,因为它没有包含在适当的引号中。这个 .keystone_install_lock 在您得到的输出中,很可能是当前工作目录中的文件名。wget甚至在尝试连接到之前报告它 www.example.com . 尝试

wget --reject-regex '.*' http://www.example.com/

或者可能是 "" 而不是 '' ,具体取决于您使用的shell。

使用该命令,我仍然可以获得索引。已检索html,因此我的答案不完整。

具有 -np 正如Quantum7所建议的,我仍然得到索引。html,所以这也不能完成答案。