代码之家  ›  专栏  ›  技术社区  ›  Norman Ramsey

如何修复不兼容的HTML,以便Expat解析它(htmltidy不起作用)

  •  1
  • Norman Ramsey  · 技术社区  · 15 年前

    我正试图从你那里搜集信息 http://www.nfl.com/scores (特别是,找出游戏何时结束,这样我的电脑就可以停止录制)。我可以很容易地下载HTML,它声称符合标准:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    

    但是

    1. 试图用 Expat not well-formed (invalid token)

    2. 这个 W3C's online validation service 报告399个错误和121个警告。

    3. 我试着运行HTML tidy(刚刚打电话) tidy -xml 选项,但tidy报告56个警告和117个错误,无法恢复良好的XML文件。错误如下所示:

      line 409 column 122 - Warning: unescaped & or unknown entity "&role"
      ...
      line 409 column 172 - Warning: unescaped & or unknown entity "&tabSeq"
      ...
      line 1208 column 65 - Error: unexpected </td> in <br>
      line 1209 column 57 - Error: unexpected </tr> in <br>
      line 1210 column 49 - Error: unexpected </table> in <br>
      

      但是当我检查输入时,“未知实体”似乎是正确引用的URL的一部分,所以我不知道是否有双引号丢失了。

    我知道有 某物 什么工具可以修复不兼容的HTML,这样我就可以用Expat解析它了?

    3 回复  |  直到 8 年前
        1
  •  4
  •   Jed Smith    15 年前

    他们在分数框上使用了某种Javascript,所以你必须玩更聪明的把戏(我的换行符):

    /* box of awesome */
    // iscurrentweek ? true;
    (new nfl.scores.Game('2009112905','54635',{state:'pre',container:'scorebox-2009112905',
    wrapper:'sb-wrapper-2009112905',template:($('scorebox-2009112905').innerHTML),homeabbr:'NYJ',
    awayabbr:'CAR'}));
    

    但是,为了回答您的问题,BeautifulSoup(似乎)很好地解析了它:

    fp = urlopen("http://www.nfl.com/scores")
    data = ""
    while 1:
        r = fp.read()
        if not r:
            break
        data += r
    fp.close()
    
    soup = BeautifulSoup(data)
    print soup.contents[2].contents[1].contents[1]
    

    产出:

    <title>NFL Scores: 2009 - Week 12</title>
    

    Yahoo's NFL scoreboard ,在我看来……事实上,我已经开始尝试了。


    编辑:

    def main():
        soup = BeautifulSoup(YAHOO_SCOREBOARD)
        on_first_team = True
        scores = []
        hold = None
    
        # Iterate the tr that contains a team's box score
        for item in soup(name="tr", attrs={"align": "center", "class": "ysptblclbg5"}):
            # Easy
            team = item.b.a.string
    
            # Get the box scores since we're industrious
            boxscore = []
            for quarter in item(name="td", attrs={"class": "yspscores"}):
                boxscore.append(int(quarter.string))
    
            # Final score
            sub = item(name="span", attrs={"class": "yspscores"})[0]
            if sub.b:
                # Winning score
                final = int(sub.b.string)
            else:
                data = sub.string.replace("&nbsp;", "")
                if ":" in data:
                    # Catch TV: XXX and 0:00pm ET
                    final = None
                else:
                    try: final = int(data)
                    except: final = None
    
            if on_first_team:
                hold = { team : (boxscore, final) }
                on_first_team = False
            else:
                hold[team] = (boxscore, final)
                scores.append(hold)
                on_first_team = True
    
        for game in scores:
            print "--- Game ---"
            for team in game:
                print team, game[team]
    

    --- Game ---
    Green Bay ([0, 13, 14, 7], 34)
    Detroit ([7, 0, 0, 5], 12)
    --- Game ---
    Oakland ([0, 0, 7, 0], 7)
    Dallas ([3, 14, 0, 7], 24)
    

    看那个,我也拿到了方块分数。。。对于一场尚未发生的比赛,我们得到:

    --- Game ---
    Washington ([], None)
    Philadelphia ([], None)
    

    不管怎样,这是一个可以让你跳下去的钉子。祝你好运

        2
  •  3
  •   rtucker    15 年前

    http://www.nfl.com/liveupdate/scorestrip/ss.xml

    这可能比HTML记分板更容易解析。

        3
  •  2
  •   bmargulies    15 年前

    调查 tagsoup