代码之家 › 专栏 › 技术社区 › Griffin Kennedy

如何将PGN数据读入数据帧

chess dataframe regex r

Griffin Kennedy · 技术社区 · 7 年前

[Event "FIDE World Cup 2017"]
[Site "Tbilisi GEO"]
[Date "2017.09.05"]
[Round "1.1"]
[White "Carlsen, Magnus"]
[Black "Balogun, Oluwafemi"]
[Result "1-0"]
[WhiteTitle "GM"]
[BlackTitle "FM"]
[WhiteElo "2822"]
[BlackElo "2255"]
[ECO "B00"]
[Opening "King's pawn opening"]
[WhiteFideId "1503014"]
[BlackFideId "8501246"]
[EventDate "2017.09.03"]

1. e4 d6 2. d4 g6 3. Bc4 Nf6 4. Qe2 Nc6 5. Nf3 Bg7 6. O-O Bg4 7. c3 O-O         
8. h3 Bxf3 9. Qxf3 e5 10. Rd1 Qe8 11. d5 Ne7 12. Qe2 Nh5 13. Bb5 Qc8 
14. Na3 a6 15. Ba4 f5 16. Bc2 f4 17. Qg4 Qxg4 18. hxg4 Nf6 19. g5 Nd7 
20. Nc4 b6 21. b4 h6 22. gxh6 Bxh6 23. g4 Nf6 24. f3 Bg5 25. Kg2 Kg7 
26. a4 Bh4 27. Bd2 g5 28. Rh1 Ng6 29. Kf1 Rh8 30. Ke2 Bg3 31. a5 b5 32. 
Na3 Ne7 33. c4 c6 34. dxc6 Nxc6 35. Bc3 Rxh1 36. Rxh1 bxc4 37. Nxc4 Rb8 
38. Nxd6 Kg6 39. Nf5 1-0

[Event "FIDE World Cup 2017"]    
etc...

我想用这些数据创建一个数据框,其中列标题是每个字符串左侧的单词,数据是字符串。然后是PGN字符串的单独列。

我尝试过这一点,受到了来自 R: How to read in a PGN as a Data Frame

pgn <- read.table("~/Desktop/GitHub/Chess/test.pgn", quote="", 
stringsAsFactors=FALSE)

# get column names
column_names <- sub("\\[(\\w+).+", "\\1", pgn[1:17,1])
column_names[17] <- "PGN"
#create DF
pgn.df <- data.frame(matrix(sub("\\[\\w+ \\\"(.+)\\\"\\]", "\\1", 
                     pgn[,1]),byrow=TRUE, ncol=17))

names(pgn.df) <- column_names

这里的问题是我的pgn信息是多行的。有没有办法在正则表达式中解释这一点?还是一种自动更改文件以使pgn成为单行的方法?

谢谢

3 回复 | 直到 7 年前

hrbrmstr 7 年前

安装:

devtools::install_github("hrbrmstr/pigeon")

使用( tidyverse 软件包实际上不需要,但IMO它打印数据帧比内置的base R打印功能更干净):

library(pigeon)
library(tidyverse)

fide <- read_pgn(system.file("extdata", "r7.pgn", package="pigeon"))

fide
## # A tibble: 2 x 12
##            Event    Site       Date Round               White               Black  Result WhiteElo BlackElo   ECO
## *          <chr>   <chr>      <chr> <chr>               <chr>               <chr>   <chr>    <chr>    <chr> <chr>
## 1 World Cup 2017 Tbilisi 2017.09.23  44.1 Aronian Levon (ARM)    Ding Liren (CHN) 1/2-1/2     2799     2777   A18
## 2 World Cup 2017 Tbilisi 2017.09.24  45.1    Ding Liren (CHN) Aronian Levon (ARM) 1/2-1/2     2777     2799   E06
## # ... with 2 more variables: LiveChessVersion <chr>, Moves <list>

glimpse(fide)
## Observations: 2
## Variables: 12
## $ Event            <chr> "World Cup 2017", "World Cup 2017"
## $ Site             <chr> "Tbilisi", "Tbilisi"
## $ Date             <chr> "2017.09.23", "2017.09.24"
## $ Round            <chr> "44.1", "45.1"
## $ White            <chr> "Aronian Levon (ARM)", "Ding Liren (CHN)"
## $ Black            <chr> "Ding Liren (CHN)", "Aronian Levon (ARM)"
## $ Result           <chr> "1/2-1/2", "1/2-1/2"
## $ WhiteElo         <chr> "2799", "2777"
## $ BlackElo         <chr> "2777", "2799"
## $ ECO              <chr> "A18", "E06"
## $ LiveChessVersion <chr> "1.4.8", "1.4.8"
## $ Moves            <list> [c("c4", "Nf6", "Nc3", "e6", "e4", "d5", "cxd5", "exd5", "e5", "Ne4", "Nf3", "Bf5", "Be2"...

这是一个更大的测试:

tf <- tempfile(fileext = ".zip")
td <- tempdir()
download.file("https://www.pgnmentor.com/players/Adams.zip",  tf)
fil <- unzip(tf, exdir = td)

adams <- read_pgn(fil)

adams
## # A tibble: 2,982 x 11
##             Event      Site       Date Round              White              Black  Result WhiteElo BlackElo   ECO
##  *          <chr>     <chr>      <chr> <chr>              <chr>              <chr>   <chr>    <chr>    <chr> <chr>
##  1 Lloyds Bank op    London 1984.??.??     1     Adams, Michael    Sedgwick, David     1-0                     C05
##  2 Lloyds Bank op    London 1984.??.??     3     Adams, Michael  Dickenson, Neil F     1-0              2230   C07
##  3 Lloyds Bank op    London 1984.??.??     4       Hebden, Mark     Adams, Michael     1-0     2480            B10
##  4 Lloyds Bank op    London 1984.??.??     5    Pasman, Michael     Adams, Michael     0-1     2310            D42
##  5 Lloyds Bank op    London 1984.??.??     6     Adams, Michael   Levitt, Jonathan 1/2-1/2              2370   B99
##  6 Lloyds Bank op    London 1984.??.??     9     Adams, Michael Saeed, Saeed Ahmed     1-0              2430   B56
##  7         BCF-ch Edinburgh 1985.??.??     1     Adams, Michael   Singh, Sukh Dave 1/2-1/2     2360     2080   B70
##  8         BCF-ch Edinburgh 1985.??.??     2 Abayasekera, Roger     Adams, Michael     1-0     2200     2360   B13
##  9         BCF-ch Edinburgh 1985.??.??     3     Adams, Michael    Jackson, Sheila 1/2-1/2     2360     2225   C85
## 10         BCF-ch Edinburgh 1985.??.??     4     Muir, Andrew J     Adams, Michael 1/2-1/2     2285     2360   E45
## # ... with 2,972 more rows, and 1 more variables: Moves <list>

glimpse(adams)
## Observations: 2,982
## Variables: 11
## $ Event    <chr> "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds ...
## $ Site     <chr> "London", "London", "London", "London", "London", "London", "Edinburgh", "Edinburgh", "Edinburgh",...
## $ Date     <chr> "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1985.??.??", ...
## $ Round    <chr> "1", "3", "4", "5", "6", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "?", "1", "...
## $ White    <chr> "Adams, Michael", "Adams, Michael", "Hebden, Mark", "Pasman, Michael", "Adams, Michael", "Adams, M...
## $ Black    <chr> "Sedgwick, David", "Dickenson, Neil F", "Adams, Michael", "Adams, Michael", "Levitt, Jonathan", "S...
## $ Result   <chr> "1-0", "1-0", "1-0", "0-1", "1/2-1/2", "1-0", "1/2-1/2", "1-0", "1/2-1/2", "1/2-1/2", "1-0", "1/2-...
## $ WhiteElo <chr> "", "", "2480", "2310", "", "", "2360", "2200", "2360", "2285", "2360", "2250", "2360", "2225", "2...
## $ BlackElo <chr> "", "2230", "", "", "2370", "2430", "2080", "2360", "2225", "2360", "2245", "2360", "2260", "2360"...
## $ ECO      <chr> "C05", "C07", "B10", "D42", "B99", "B56", "B70", "B13", "C85", "E45", "C84", "B10", "C85", "A22", ...
## $ Moves    <list> [c("e4", "e6", "d4", "d5", "Nd2", "Nf6", "e5", "Nfd7", "f4", "c5", "c3", "Nc6", "Ndf3", "cxd4", "...

使用成熟的C“库”(从技术上讲,它不是一个库,但我把它塞进了一个库中)的一个好处是,它不仅仅做模式匹配。如果游戏文件格式不正确,它将无法正确解析(因为它不应该这样)。

我需要通过ASAN/UBSAN/Valgrind运行它,以确保没有内存泄漏,但如果这最终有用,请告诉我,我会在pkg上圆角。

wp78de 7 年前

我仍然建议在准备步骤中使用(更新的)替换正则表达式来删除不需要的中断,如下所示:

/(?:[^\[\]\n\S])\s*\n/ /g

here (以PGN作为输入文本)。但是我对像你这样的特殊角色在R中的转义有一些问题。
因此,我决定改用Perl。

use strict;
use File::Slurp;
my $text = read_file($ARGV[0]);
$text =~ s/(?:[^\[\]\n\S])\s*\n/ /g;
write_file($ARGV[0], $text);

这可以从R这样调用

system("perl ~/Desktop/regex.pl ~/Desktop/test.pgn")

DangerCat 2 年前

https://pypi.org/project/pgn2data/

from converter.pgn_data import PGNData as pgnd
import pandas as pd

# This creates two output files, one for game info 
# (white_elo, black_elo, rating_diff, time_control... etc), 
# and one for moves.
 
filename = 'path to .pgn file'
pgn_data = pgnd(filename)
result = pgn_data.export()
result.print_summary()

# Then read the csv with pandas
# Change path to where your files output

path = 'Documents/github/project/folder/'
df_info = pd.read_csv(path + '_game_info.csv')
df_moves = pd.read_csv(path + '_moves.csv')