代码之家 › 专栏 › 技术社区 › loretoparisi

javascript在正确的位置映射具有多个匹配项的regex

text-processing regex javascript

loretoparisi · 技术社区 · 6 年前

我有一个 array 要映射的标记,以及获取输入语句中每个标记的开始和结束位置的regex。当令牌出现一次时,此操作正常。当令牌出现多次时,贪婪的 Regex 将获取文本中标记的所有匹配位置,因此第i个标记出现的结果位置将由最后找到的位置映射。

举例来说,给出了文本

var text = "Steve down walks warily down the street down\nWith the brim pulled way down low";

令牌的第一次出现 down 映射到文本中与 RegExp 因此,我有:

 {
    "index": 2,
    "word": "down",
    "characterOffsetBegin": 70,
    "characterOffsetEnd": 73
  }

运行此示例可以清楚地看到:

var text = "Steve down walks warily down the street down\nWith the brim pulled way down low";
var tokens = text.split(/\s+/g);
var annotations = tokens.map((word, tokenIndex) => { // for each token
  let item = {
    "index": (tokenIndex + 1),
    "word": word
  }
  var wordRegex = RegExp("\\b(" + word + ")\\b", "g");
  var match = null;
  while ((match = wordRegex.exec(text)) !== null) {
    var wordStart = match.index;
    var wordEnd = wordStart + word.length - 1;
    item.characterOffsetBegin = wordStart;
    item.characterOffsetEnd = wordEnd;
  }
  return item;
});
console.log(annotations)

其中令牌的第一次出现 向下 应该是第一个匹配位置:

 {
    "index": 2,
    "word": "down",
    "characterOffsetBegin": 6,
    "characterOffsetEnd": 9
  }

因此,假设我已经为文本中每次出现的令牌映射了令牌位置,即 向下 对于第一个匹配,第二个匹配,第二个匹配等,我可以用 charOffsetBegin 和 charOffsetEnd 因此,这样做:

                var newtext = '';
                results.sentences.forEach(sentence => {
                    sentence.tokens.forEach(token => {
                        newtext += text.substring(token.characterOffsetBegin, token.characterOffsetEnd + 1) + ' ';
                    });
                    newtext += '\n';
                });

2 回复 | 直到 6 年前

Felix Kling 6 年前

问题不是这个表达式太贪婪,而是你在寻找 每一个 将输入字符串中的标记与 while 循环。

你必须做两件事:

找到匹配项后停止迭代。
跟踪以前的匹配,以便您可以忽略它们。

我相信这就是你想要的:

var text = "Steve down walks warily down the street down\nWith the brim pulled way down low";
var tokens = text.split(/\s+/g);
const seen = new Map();

var annotations = tokens.map((word, tokenIndex) => { // for each token
  let item = {
    "index": (tokenIndex + 1),
    "word": word
  }
  var wordRegex = RegExp("\\b(" + word + ")\\b", "g");
  var match = null;
  while ((match = wordRegex.exec(text)) !== null) {
    if (match.index > (seen.get(word) || -1)) {
      var wordStart = match.index;
      var wordEnd = wordStart + word.length - 1;
      item.characterOffsetBegin = wordStart;
      item.characterOffsetEnd = wordEnd;

      seen.set(word, wordEnd);
      break;
    }
  }
  return item;
});
console.log(annotations)

这个 seen map跟踪令牌最近匹配的结束位置。

由于无法告诉regex引擎在特定位置之前忽略所有内容,因此我们仍在使用 虽然 循环,但忽略在上一个匹配之前发生的任何匹配, if (match.index > (seen.get(word) || -1)) .

Tiny Giant 6 年前

@费利克斯的回答涵盖了你问题的原因,但我想再进一步。

我将把所有东西放在一个类(或构造函数)中以保持它的包含性,并将从每个令牌的文本中提取匹配项的逻辑从令牌迭代中分离出来。

class Annotations {
  constructor(text) {
    if(typeof text !== 'string') return null
    const opt = { enumerable: false, configurable: false, writeable: false }
    Object.defineProperty(this, 'text', { value: text, ...opt })
    Object.defineProperty(this, 'tokens', { value: text.split(/\s+/g), ...opt })
    for(let token of this.tokens) this[token] = Array.from(this.matchAll(token))
  }
  * matchAll(token) {
    if(typeof token === 'string' && this.text.indexOf(token) > -1) {
      const expression = new RegExp("\\b" + token + "\\b", "g")
      let match = expression.exec(this.text)

      while(match !== null) {
        const start = match.index
        const end = start + token.length - 1
        yield { start, end }
        match = expression.exec(this.text)
      }
    }
  }
}

const annotations = new Annotations("Steve down walks warily down the street down\nWith the brim pulled way down low")

console.log(annotations.text)
console.log(annotations.tokens)
console.log(annotations)
console.log(Array.from(annotations.matchAll('foo'))) // []

.as-console-wrapper { max-height: 100% !important }