代码之家 › 专栏 › 技术社区 › juhist

如何创建Yacc/Lex规则来嵌入C源代码片段?

lex yacc c

juhist · 技术社区 · 6 年前

我正在实现一个带有嵌入式lexer和parser的定制解析器生成器,以事件驱动的状态机方式解析HTTP头。以下是最终的解析器生成器在解析单个头字段时可能使用的一些定义,而末尾没有CRLF:

token host<prio=1> = "[Hh][Oo][Ss][Tt]" ;
token ospace = "[ \t]*" ;
token htoken = "[-!#$%&'*+.^_`|~0-9A-Za-z]+" ;
token hfield = "[\t\x20-\x7E\x80-\xFF]*" ;
token space = " " ;
token htab = "\t" ;
token colon = ":" ;

obsFoldStart = 1*( space | htab ) ;
hdrField =
  obsFoldStart hfield
| host colon ospace hfield<print>
| htoken colon ospace hfield
  ;

lexer基于最大munch规则,令牌根据上下文动态打开和关闭,因此它们之间没有冲突 htoken hfield ,优先级值解决了 host 和 赫托肯 . 我计划将解析器实现为LL(1)表解析器。我还没有决定是通过模拟不确定性有限自动机来实现regexp令牌匹配,还是将其分解为确定性有限自动机。

现在,我想在解析器生成器输入中包含一些C源代码:

hdrField =
  obsFoldStart hfield
| host {
  parserState->userdata.was_host = 1;
} colon ospace hfield<print>
| htoken {
  parserState->userdata.was_host = 0;
} colon ospace hfield
  ;

因此,我需要的是某种方法来读取文本标记,当相同数量的 } 读取的字符数大于 {

怎么做?我正在使用 BEGIN(COMMENTS) BEGIN(INITIAL) 但我不相信这样的策略对嵌入式C源代码有效。此外,注释处理可能会使嵌入式C源代码的处理复杂化很多,因为我不相信一个标记中间可以有注释。

基本上,我需要嵌入的C语言片段作为一个C字符串来存储到我的数据结构中。

1 回复 | 直到 6 年前

Scheff's Cat 6 年前

所以,我取了一些生成的lex代码并使其独立。

我希望,我可以用C++代码,虽然我认识到 c 只是。嗯,这只关系到你的生活,不是这样的 std::string .)

scanC.l :

%{

#include <iostream>
#include <string>

#ifdef _WIN32
/// disables #include <unistd.h>
#define YY_NO_UNISTD_H
#endif // _WIN32

// buffer for collected C/C++ code
static std::string cCode;
// counter for braces
static int nBraces = 0;

%}

/* Options */

/* make never interactive (prevent usage of certain C functions) */
%option never-interactive
/* force lexer to process 8 bit ASCIIs (unsigned characters) */
%option 8bit
/* prevent usage of yywrap */
%option noyywrap


EOL ("\n"|"\r"|"\r\n")
SPC ([ \t]|"\\"{EOL})*
LITERAL "\""("\\".|[^\\"])*"\""

%s CODE

%%

<INITIAL>"{" { cCode = '{'; nBraces = 1; BEGIN(CODE); }
<INITIAL>. |
<INITIAL>{EOL} { std::cout << yytext; }
<INITIAL><<EOF>> { return 0; }

<CODE>"{" {
  cCode += '{'; ++nBraces;
  //updateFilePos(yytext, yyleng);
} break;
<CODE>"}" {
  cCode += '}'; //updateFilePos(yytext, yyleng);
  if (!--nBraces) {
    BEGIN(INITIAL);
    //return new Token(filePosCCode, Token::TkCCode, cCode.c_str());
    std::cout << '\n'
      << "Embedded C code:\n"
      << cCode << "// End of embedded C code\n";
  }
} break;

<CODE>"/*" { // C comments
  cCode += "/*"; //_filePosCComment = _filePos;
  //updateFilePos(yytext, yyleng);
  char c1 = ' ';
  do {
    char c0 = c1; c1 = yyinput();
    switch (c1) {
      case '\r': break;
      case '\n':
        cCode += '\n'; //updateFilePos(&c1, 1);
        break;
      default:
        if (c0 == '\r' && c1 != '\n') {
          c0 = '\n'; cCode += '\n'; //updateFilePos(&c0, 1);
        } else {
          cCode += c1; //updateFilePos(&c1, 1);
        }
    }
    if (c0 == '*' && c1 == '/') break;
  } while (c1 != EOF);
  if (c1 == EOF) {
    //ErrorFile error(_filePosCComment, "'/*' without '*/'!");
    //throw ErrorFilePrematureEOF(_filePos);
    std::cerr << "ERROR! '/*' without '*/'!\n";
    return -1;
  }
} break;
<CODE>"//"[^\r\n]* | /* C++ one-line comments */
<CODE>"'"("\\".|[^\\'])+"'" | /*"/* C/C++ character constants */
<CODE>{LITERAL} | /* C/C++ string constants */
<CODE>"#"[^\r\n]* | /* preprocessor commands */
<CODE>[ \t]+ | /* non-empty white space */
<CODE>[^\r\n] { // any other character except EOL
  cCode += yytext;
  //updateFilePos(yytext, yyleng);
} break;
<CODE>{EOL} { // special handling for EOL
  cCode += '\n';
  //updateFilePos(yytext, yyleng);
} break;
<CODE><<EOF>> { // premature EOF
  //ErrorFile error(_filePosCCode,
  //  compose("%1 '{' without '}'!", _nBraces));
  //_errorManager.add(error);
  //throw ErrorFilePrematureEOF(_filePos);
  std::cerr << "ERROR! Premature end of input. (Not enough '}'s.)\n";
}

%%

int main(int argc, char **argv)
{
  return yylex();
}

要扫描的示例文本 scanC.txt

Hello juhist.

The text without braces doesn't need to have any syntax.
It just echoes the characters until it finds a block:
{ // the start of C code
  // a C++ comment
  /* a C comment
   * (Remember that nested /*s are not supported.)
   */
  #define MAX 1024
  static char buffer[MAX] = "", empty="\"\"";

  /* It is important that tokens are recognized to a limited amount.
   * Otherwise, it would be too easy to fool the scanner with }}}
   * where they have no meaning.
   */
  char *theSameForStringConstants = "}}}";
  char *andCharConstants = '}}}';

  int main() { return yylex(); }
}
This code should be just copied
(with a remark that the scanner recognized the C code a such.)

Greetings, Scheff.

编译和测试 cygwin64

$ flex --version
flex 2.6.4

$ flex -o scanC.cc scanC.l

$ g++ --version
g++ (GCC) 7.3.0

$ g++ -std=c++11 -o scanC scanC.cc

$ ./scanC < scanC.txt
Hello juhist.

The text without braces doesn't need to have any syntax.
It just echoes the characters until it finds a block:

Embedded C code:
{ // the start of C code
  // a C++ comment
  /* a C comment
   * (Remember that nested /*s are not supported.)
   */
  #define MAX 1024
  static char buffer[MAX] = "", empty="\"\"";

  /* It is important that tokens are recognized to a limited amount.
   * Otherwise, it would be too easy to fool the scanner with }}}
   * where they have no meaning.
   */
  char *theSameForStringConstants = "}}}";
  char *andCharConstants = '}}}';

  int main() { return yylex(); }

}// End of embedded C code
This code should be just copied
(with a remark that the scanner recognized the C code a such.)

Greetings, Scheff.
$

笔记:

这是从一个助手工具(不是为了销售)。因此,这不是防弹的,但对于高效代码来说已经足够了。
用宏与非平衡宏的创造性组合来愚弄这个工具当然是可能的 { }

所以,这至少是进一步发展的开始。

为了对照C lex规范检查这一点,我 ANSI C grammar, Lex specification