代码之家 › 专栏 › 技术社区 › shinjuo

关于perl-HTML解析的一点帮助

expression parsing perl html

shinjuo · 技术社区 · 15 年前

#!/usr/bin/perl -w
use LWP::Simple;

my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html")
    or die "Could not fetch NWS page.";
$html =~ m{Hail Reports} || die;
my $hail = $1;
print "$hail\n";

第二,我认为正则表达式是实现我想要的功能的最简单的方法,但是我不确定是否可以用它们来实现。我希望我的程序搜索的话冰雹报告和发回我之间的冰雹报告和话风报告的信息。这是正则表达式可以实现的还是我应该使用不同的方法?

     <tr><th colspan="8">Hail Reports (<a href="last3hours_hail.csv">CSV</a>)&nbsp;(<a href="last3hours_raw_hail.csv">Raw Hail CSV</a>)(<a href="/faq/#6.10">?</a>)</th></tr> 

#The Data here will change throughout the day so normally there will be more info.
      <tr><td colspan="8" class="highlight" align="center">No reports received</td></tr> 
      <tr><th colspan="8">Wind Reports (<a href="last3hours_wind.csv">CSV</a>)&nbsp;(<a href="last3hours_raw_wind.csv">Raw Wind CSV</a>)(<a href="/faq/#6.10">?</a>)</th></tr>

4 回复 | 直到 15 年前

d5e5 15 年前

#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple;

my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html")
    or die "Could not fetch NWS page.";

$html =~ m{Hail Reports(.*)Wind Reports}s || die; #Parentheses indicate capture group
my $hail = $1; # $1 contains whatever matched in the (.*) part of above regex
print "$hail\n";

Jim Davis 15 年前

未初始化的值警告来自$1—它没有在任何地方定义或设置。

对于行级别而不是字节级别“between”,可以使用:

for (split(/\n/, $html)) {
    print if (/Hail Reports/ .. /Wind Reports/ and !/(?:Hail|Wind) Reports/);
}

user376314 user376314 15 年前

使用单行和多行匹配。另外,它只会拾取第一个匹配的中间文本,这会比贪婪快一点。

#!/usr/bin/perl -w

use strict;
use LWP::Simple;

   sub main{
      my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html")
                 or die "Could not fetch NWS page.";

      # match single and multiple lines + not greedy
      my ($hail, $between, $wind) = $html =~ m/(Hail Reports)(.*?)(Wind Reports)/sm
                 or die "No Hail/Wind Reports";

      print qq{
               Hail:         $hail
               Wind:         $wind
               Between Text: $between
            };
   }

   main();

runrig 15 年前

括号捕获正则表达式中的字符串。正则表达式中没有括号,因此$1没有设置为任何值。如果你有:

$html =~ m{(Hail Reports)} || die;

如果$1存在于$html变量中,那么它将被设置为“Hail Reports”。既然你只想知道它是否匹配,那么现在你真的不需要捕捉任何东西,你可以写下这样的东西:

unless ( $html =~ /Hail Reports/ ) {
  die "No Hail Reports in HTML";
}

要捕获字符串之间的内容,可以执行以下操作:

if ( $html =~ /(?<=Hail Reports)(.*?)(?=Wind Reports)/s ) {
  print "Got $1\n";
}

推荐文章

Manny · 如何比较Perl中的字符串?

2 年前

BioRod · 我不能用Perl打印键和值

2 年前

user17227456 · Perl CLI代码无法追加字符串行

3 年前

LearnToBeBetter · 读取文件,搜索字符串,打印字符串

3 年前

KJ7LNW · 一些波斯语文本的宽字符印刷,但其他文本则没有

3 年前

Amit M · 如何用FFI:Platypus替换cpan Perl实用程序P5NCI

3 年前

con · 如何搜索大型数据结构并返回一系列给出特定值的键/数组?

3 年前

rohithguptha potti · 在LINUX操作系统上执行一些Perl命令时,这些模块可以在LINUX中使用,也可以不在LINUX中使用

3 年前

Tonys AnsonÄ« Misirgis · 当“网站”选项卡关闭时,服务器如何知道关闭websocket的连接

7 年前

Pranay Nanda · 使用regex解析许可证文件

7 年前