更新:
对于更复杂的行,包括自定义的一元分析器。
使用列表monad进行简单分析
与megaparsec相比,list monad提供了更好的回溯“parser”。例如,要分析单元格:
row :: [String]
row = ["23", "95489", "0", "20", "9888"]
在满足特定绑定(例如小于30)的恰好三列值中,可以使用以下方法生成所有可能的分析:
{-# OPTIONS_GHC -Wall #-}
import Control.Monad
import Control.Applicative
rowResults :: [String] -> [[Double]]
rowResults = cols 3
where cols :: Int -> [String] -> [[Double]]
cols 0 [] = pure [] -- good, finished on time
cols 0 _ = empty -- bad, didn't use all the data
-- otherwise, parse exactly @n@ columns from cells @xs@
cols n xs = do
-- form @d@ from one or two cells
(d, ys) <- num1 xs <|> num2 xs
-- only accept @d < 30@
guard $ d < 30
ds <- cols (n-1) ys
return $ d : ds
-- read number from a single cell
num1 (x:xs) | ok1 x = pure (read x, xs)
num1 _ = empty
-- read number from two cells
num2 (x:y:zs) | ok1 x && ok2 y = pure (read (x ++ "." ++ y), zs)
num2 _ = empty
-- first cell: "0" is okay, but otherwise can't start with "0"
ok1 "0" = True
ok1 (c:_) | c /= '0' = True
ok1 _ = False
-- second cell: can't end with "0" (or *be* "0")
ok2 xs = last xs /= '0'
上面的基于列表的解析器试图通过假设“xxx,yyy”是一个数字,“xxx”不会以零开头(除非它只是“0”),而“yyy”不会以零结尾(或者就此而言,是单个“0”)来减少歧义。如果这不正确,只需修改
ok1
和
ok2
视情况而定。
应用于
row
,这将给出一个明确的分析:
> rowResults row
[[23.95489,0.0,20.9888]]
应用于不明确的行,它提供所有分析:
> rowResults ["0", "12", "5", "0", "8601"]
[[0.0,12.5,0.8601],[0.0,12.5,0.8601],[0.12,5.0,0.8601]]
无论如何,我建议使用标准的csv解析器将文件解析为
String
这样的单元格:
dat :: [[String]]
dat = [ ["23", "95489", "0", "20", "9888"]
, ["0", "12", "5", "0", "8601"]
, ["23", "2611", "2", "233", "14", "422"]
]
然后使用
rowResults
上面获取不明确的行数:
> map fst . filter ((>1) . snd) . zip [1..] . map (length . rowResults) $ dat
[2]
>
或不可分析:
> map fst . filter ((==0) . snd) . zip [1..] . map (length . rowResults) $ dat
[]
>
假设没有不可分析的行,则可以重新生成一个可能的固定文件,即使某些行不明确,但只需获取每行的第一个成功解析:
> putStr $ unlines . map (intercalate "," . map show . head . rowResults) $ dat
23.95489,0.0,20.9888
0.0,12.5,0.8601
23.2611,2.233,14.422
>
使用基于列表monad的自定义monad进行更复杂的解析
对于更复杂的分析,例如,如果要分析如下行:
type Stream = [String]
row0 :: Stream
row0 = ["Apple", "15", "1", "5016", "2", "5", "3", "1801", "11/13/2018", "X101"]
使用字符串和数字的混合,实际上编写一个基于列表monad的一元解析器并不难,它可以生成所有可能的解析器。
关键思想是将解析器定义为一个函数,该函数接受一个流并生成一个可能的解析器列表,其中每个可能的解析器表示为
元组
从流的开始部分成功地分析了对象,并与流的其余部分进行了配对。我们的并行解析器被包装成一个新类型,如下所示:
newtype PParser a = PParser (Stream -> [(a, Stream)]) deriving (Functor)
注意与类型的相似性
ReadS
从
Text.ParserCombinators.ReadP
从技术上讲,它也是一个“所有可能的分析”解析器(尽管通常只需要一个,但从
reads
呼叫):
type ReadS a = String -> [(a, String)]
总之,我们可以定义
Monad
的实例
PParser
就像这样:
instance Applicative PParser where
pure x = PParser (\s -> [(x, s)])
(<*>) = ap
instance Monad PParser where
PParser p >>= f = PParser $ \s1 -> do -- in list monad
(x, s2) <- p s1
let PParser q = f x
(y, s3) <- q s2
return (y, s3)
这里没有什么太棘手的事情:
pure x
返回单个可能的分析,即结果
x
具有不变的流
s
,而
p >>= f
应用第一个分析器
p
要生成可能的分析列表,请在列表monad中逐个使用它们来计算下一个分析程序
q
要使用(与通常的一元操作一样,它可以依赖于第一个解析的结果),并生成返回的可能的最终解析的列表。
这个
Alternative
和
MonadPlus
实例非常简单——它们只是从单子列表中消除了空性和交替:
instance Alternative PParser where
empty = PParser (const empty)
PParser p <|> PParser q = PParser $ \s -> p s <|> q s
instance MonadPlus PParser where
要运行解析器,我们有:
parse :: PParser a -> Stream -> [a]
parse (PParser p) s = map fst (p s)
现在我们可以介绍原语:
-- read a token as-is
token :: PParser String
token = PParser $ \s -> case s of
(x:xs) -> pure (x, xs)
_ -> empty
-- require an end of stream
eof :: PParser ()
eof = PParser $ \s -> case s of
[] -> pure ((), s)
_ -> empty
和组合器:
-- combinator to convert a String to any readable type
convert :: (Read a) => PParser String -> PParser a
convert (PParser p) = PParser $ \s1 -> do
(x, s2) <- p s1 -- for each possible String
(y, "") <- reads x -- get each possible full read
-- (normally only one)
return (y, s2)
以及csv行中各种“术语”的解析器:
-- read a string from a single cell
str :: PParser String
str = token
-- read an integer (any size) from a single cell
int :: PParser Int
int = convert (mfilter ok1 token)
-- read a double from one or two cells
dbl :: PParser Double
dbl = dbl1 <|> dbl2
where dbl1 = convert (mfilter ok1 token)
dbl2 = convert $ do
t1 <- mfilter ok1 token
t2 <- mfilter ok2 token
return $ t1 ++ "." ++ t2
-- read a double that's < 30
dbl30 :: PParser Double
dbl30 = do
x <- dbl
guard $ x < 30
return x
-- rules for first cell of numbers:
-- "0" is okay, but otherwise can't start with "0"
ok1 :: String -> Bool
ok1 "0" = True
ok1 (c:_) | c /= '0' = True
ok1 _ = False
-- rules for second cell of numbers:
-- can't be "0" or end in "0"
ok2 :: String -> Bool
ok2 xs = last xs /= '0'
然后,对于特定的行模式,我们可以编写一个行解析器,就像我们通常使用一元解析器那样:
-- a row
data Row = Row String Int Double Double Double
Int String String deriving (Show)
rowResults :: PParser Row
rowResults = Row <$> str <*> int <*> dbl30 <*> dbl30 <*> dbl30
<*> int <*> str <*> str <* eof
并获得所有可能的分析:
> parse rowResults row0
[Row "Apple" 15 1.5016 2.0 5.3 1801 "11/13/2018" "X101"
,Row "Apple" 15 1.5016 2.5 3.0 1801 "11/13/2018" "X101"]
>
完整的程序是:
{-# LANGUAGE DeriveFunctor #-}
{-# OPTIONS_GHC -Wall #-}
import Control.Monad
import Control.Applicative
type Stream = [String]
newtype PParser a = PParser (Stream -> [(a, Stream)]) deriving (Functor)
instance Applicative PParser where
pure x = PParser (\s -> [(x, s)])
(<*>) = ap
instance Monad PParser where
PParser p >>= f = PParser $ \s1 -> do -- in list monad
(x, s2) <- p s1
let PParser q = f x
(y, s3) <- q s2
return (y, s3)
instance Alternative PParser where
empty = PParser (const empty)
PParser p <|> PParser q = PParser $ \s -> p s <|> q s
instance MonadPlus PParser where
parse :: PParser a -> Stream -> [a]
parse (PParser p) s = map fst (p s)
-- read a token as-is
token :: PParser String
token = PParser $ \s -> case s of
(x:xs) -> pure (x, xs)
_ -> empty
-- require an end of stream
eof :: PParser ()
eof = PParser $ \s -> case s of
[] -> pure ((), s)
_ -> empty
-- combinator to convert a String to any readable type
convert :: (Read a) => PParser String -> PParser a
convert (PParser p) = PParser $ \s1 -> do
(x, s2) <- p s1 -- for each possible String
(y, "") <- reads x -- get each possible full read
-- (normally only one)
return (y, s2)
-- read a string from a single cell
str :: PParser String
str = token
-- read an integer (any size) from a single cell
int :: PParser Int
int = convert (mfilter ok1 token)
-- read a double from one or two cells
dbl :: PParser Double
dbl = dbl1 <|> dbl2
where dbl1 = convert (mfilter ok1 token)
dbl2 = convert $ do
t1 <- mfilter ok1 token
t2 <- mfilter ok2 token
return $ t1 ++ "." ++ t2
-- read a double that's < 30
dbl30 :: PParser Double
dbl30 = do
x <- dbl
guard $ x < 30
return x
-- rules for first cell of numbers:
-- "0" is okay, but otherwise can't start with "0"
ok1 :: String -> Bool
ok1 "0" = True
ok1 (c:_) | c /= '0' = True
ok1 _ = False
-- rules for second cell of numbers:
-- can't be "0" or end in "0"
ok2 :: String -> Bool
ok2 xs = last xs /= '0'
-- a row
data Row = Row String Int Double Double Double
Int String String deriving (Show)
rowResults :: PParser Row
rowResults = Row <$> str <*> int <*> dbl30 <*> dbl30 <*> dbl30
<*> int <*> str <*> str <* eof
row0 :: Stream
row0 = ["Apple", "15", "1", "5016", "2", "5", "3", "1801", "11/13/2018", "X101"]
main = print $ parse rowResults row0
现成的解决方案
我觉得有点奇怪,我找不到一个现有的解析器库来提供这种“所有可能的解析器”。里面的东西
text.parserCombinators.readp文本.parserCombinators.readp
采用正确的方法,但它假定您正在从
弦
而不是来自其他流的任意令牌(在我们的例子中,
弦
S来自A
[String]
)
也许其他人可以指出一个现成的解决方案,它可以避免您必须扮演自己的解析器类型、实例和原语。