代码之家 › 专栏 › 技术社区 › Jonathan Sampson

代码高尔夫:从文本快速构建关键字列表,包括实例

rosetta-stone language-agnostic text-parsing code-golf

Jonathan Sampson · 技术社区 · 5 年前

我已经用PHP为自己设计了这个解决方案,但我很好奇它是如何以不同的方式完成的——甚至更好。我主要感兴趣的两种语言是PHP和JavaScript,但我感兴趣的是,在今天的任何其他主要语言(主要是C、Java、Java等)中都可以看到这一速度。

仅返回出现次数大于x的单词
仅返回长度大于y的单词
忽略“and,is,the,etc”等常用术语
在处理之前,请随意删除标点符号(即“john”变为“john”)。
返回集合/数组中的结果

额外信贷

把引用的陈述放在一起(即“它们太好了,显然不真实”)。
如果“太好了,不可能是真的”,那么实际的陈述就是

额外学分

你的剧本能根据单词在一起的频率来决定应该放在一起的单词吗?这是在不事先知道单词的情况下完成的。例子:
*“果蝇在医学研究方面是一件大事。对果蝇进行了大量的研究,取得了许多突破。在未来,果蝇将继续被研究,但我们的方法可能会改变。*
很明显,这里的单词是“果蝇”,我们很容易找到。你的搜索脚本也能确定这一点吗?

源文本: http://sampsonresume.com/labs/c.txt

应答格式

除了操作持续的时间外,还可以看到代码、输出的结果。

13 回复 | 直到 5 年前

liori 15 年前

GNU脚本

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr

结果:

  7 be
  6 to
[...]
  1 2.
  1 -

出现大于x时:

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'

仅返回长度大于y的单词(在第二个grep中输入y+1点):

sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c

忽略“and,is,the,etc”等常用术语(假设常用术语在“ignored”文件中)

sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c

在处理之前,请随意删除标点符号(即“john”变为“john”):

sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c

返回集合/数组中的结果:它已经类似于shell的数组:第一列是count,第二列是word。

Luke Girvin Nathan Bedford 14 年前

Perl只有43个字符。

perl -MYAML -anE'$_{$_}++for@F;say Dump\%_'

下面是它的使用示例:

echo a a a b b c  d e aa | perl -MYAML -anE'$_{$_}++for@F;say Dump \%_'

---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1

如果只需要列出小写版本,则还需要两个字符。

perl -MYAML -anE'$_{lc$_}++for@F;say Dump\%_'

要处理指定的文本,需要58个字符。

curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'\W+' -anE'$_{lc$_}++for@F;END{say Dump\%_}'

real    0m0.679s
user    0m0.304s
sys     0m0.084s

下面是最后一个扩展了一点的示例。

#! perl
use 5.010;
use YAML;

while( my $line = <> ){
  for my $elem ( split '\W+', $line ){
    $_{ lc $elem }++
  }
  END{
    say Dump \%_;
  }
}

Juliet 15 年前

弗斯 304个字符

let f =
    let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
    fun length occurrence msg ->
        System.Text.RegularExpressions.Regex.Split(msg, @"[^\w-']+")
        |> Seq.countBy (fun a -> a)
        |> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)

Robert K 15 年前

红宝石

当“缩小”时,此实现将变为165个字符长。它使用 array#inject 给出一个起始值(默认值为0的散列对象),然后循环遍历元素,然后将元素卷进散列;然后从最小频率中选择结果。

注意,我没有计算要跳过的单词的大小,这是一个外部常量。当对常量进行计数时,解决方案的长度为244个字符。

撇号和破折号不会被去掉,而是包含在内;它们的使用会修改单词,因此,如果不删除符号之外的所有信息,就不能简单地去掉。

实施

CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's)
def get_keywords(text, minFreq=0, minLen=2)
  text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i).
    inject(Hash.new(0)) do |result,w|
      w.downcase!
      result[w] += 1 unless CommonWords.include?(w)
      result
    end.select { |k,n| n >= minFreq }
end

试验台

require 'net/http'

keywords = get_keywords(Net::HTTP.get('www.sampsonresume.com','/labs/c.txt'), 3)
keywords.sort.each { |name,count| puts "#{name} x #{count} times" }

试验结果

code x 4 times
declarations x 4 times
each x 3 times
execution x 3 times
expression x 4 times
function x 5 times
keywords x 3 times
language x 3 times
languages x 3 times
new x 3 times
operators x 4 times
programming x 3 times
statement x 7 times
statements x 4 times
such x 3 times
types x 3 times
variables x 3 times
which x 4 times

Noldorin 15 年前

C 3.0(带LINQ)

这是我的解决方案。它利用LINQ/扩展方法的一些非常好的特性来保持代码的简短。

public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
    var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
        "for", "by", "an", "be", "may", "has", "can", "its"};
    var words = Regex.Replace(text.ToLower(), @"[,.?\/;:\(\)]", string.Empty).Split(' ');
    var occurrences = words.Distinct().Except(commonWords).Select(w =>
        new { Word = w, Count = words.Count(s => s == w) });
    return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
        .ToDictionary(wo => wo.Word, wo => wo.Count);
}

然而,这远远不是最有效的方法, O(n^2) 用字数,而不是 O(n) 我相信在这种情况下这是最佳的。我会看看是否可以创建一个稍微长一点的更有效的方法。

下面是对示例文本运行函数的结果(最小出现次数:3,最小长度:2)。

  3 x such
  4 x code
  4 x which
  4 x declarations
  5 x function
  4 x statements
  3 x new
  3 x types
  3 x keywords
  7 x statement
  3 x language
  3 x expression
  3 x execution
  3 x programming
  4 x operators
  3 x variables

我的测试程序:

static void Main(string[] args)
{
    string sampleText;
    using (var client = new WebClient())
        sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
    var keywords = GetKeywords(sampleText, 3, 2);
    foreach (var entry in keywords)
        Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
    Console.ReadKey(true);
}

Alex Feinman 7 年前

#! perl
use strict;
use warnings;

while (<>) {
  for my $word (split) {
    $words{$word}++;
  }
}
for my $word (keys %words) {
  print "$word occurred $words{$word} times.";
}

这就是简单的形式。如果需要排序、筛选等:

while (<>) {
  for my $word (split) {
    $words{$word}++;
  }
}
for my $word (keys %words) {
  if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
    print "$word occurred $words{$word} times.";
  }
}

您还可以很容易地对输出进行排序:

...
for my $word (keys %words) {
  if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
    push @output, "$word occurred $words{$word} times.";
  }
}
$re = qr/occurred (\d+) /;
print sort {
  $a = $a =~ $re;
  $b = $b =~ $re;
  $a <=> $b
} @output;

一个真正的Perl黑客可以很容易地在一行或两行上获取这些信息,但我追求的是可读性。

Edit: this is how I would rewrite this last example

...
for my $word (
  sort { $words{$a} <=> $words{$b} } keys %words
){
  next unless length($word) >= $MINLEN;
  last unless $words{$word) >= $MIN_OCCURRENCE;

  print "$word occurred $words{$word} times.";
}

或者如果我需要它运行得更快,我甚至可以这样写:

for my $word_data (
  sort {
    $a->[1] <=> $b->[1] # numerical sort on count
  } grep {
    # remove values that are out of bounds
    length($_->[0]) >= $MINLEN &&      # word length
    $_->[1] >= $MIN_OCCURRENCE # count
  } map {
    # [ word, count ]
    [ $_, $words{$_} ]
  } keys %words
){
  my( $word, $count ) = @$word_data;
  print "$word occurred $count times.";
}

它使用地图来提高效率, grep删除多余元素, 当然,排序就是进行排序。 (按顺序进行)

这是 Schwartzian transform .

gooli 15 年前

另一个python解决方案,247个字符。实际的代码是一行由134个字符组成的高度密集的python行,用一个表达式计算整个过程。

x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
from itertools import groupby as gb
d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
    gb(sorted(open("c.txt").read().lower().split())))
    if l>x and len(w)>y and w not in W)

一个更长的版本,有大量的评论供您阅读:

# High and low count boundaries.
x = 3
y = 2

# Common words string split into a list by spaces.
Words = "and is the as of to or in for by an be may has can its".split()

# A special function that groups similar strings in a list into a 
# (string, grouper) pairs. Grouper is a generator of occurences (see below).
from itertools import groupby

# Reads the entire file, converts it to lower case and splits on whitespace 
# to create a list of words
sortedWords = sorted(open("c.txt").read().lower().split())

# Using the groupby function, groups similar words together.
# Since grouper is a generator of occurences we need to use len(list(grouper)) 
# to get the word count by first converting the generator to a list and then
# getting the length of the list.
wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))

# Filters the words by number of occurences and common words using yet another 
# list comprehension.
filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)

# Creates a dictionary from the list of tuples.
result = dict(filteredWordCounts)

print result

这里的主要技巧是使用itertools.groupby函数来计算排序列表中出现的次数。不知道它是否真的保存了字符,但它确实允许在一个表达式中进行所有处理。

结果:

{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}

Kamarey 14 年前

C代码:

IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y)
{
    // common words, that will be ignored
    var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word);
    // regular expression to find quoted text
    var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled);

    return
        // remove quoted text (it will be processed later)
        regex.Replace(text, "")
        // remove case dependency
        .ToLower()
        // split text by all these chars
        .Split(".,'\\/[]{}()`~@#$%^&*-=+?!;:<>| \n\r".ToCharArray())
        // add quoted text
        .Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value))
        // group words by the word and count them
        .GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count()))
        // apply filter(min word count and word length) and remove common words 
        .Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key));
}

processText(文本,3,2)调用的输出:

3 x languages
3 x such
4 x code
4 x which
3 x based
3 x each
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
3 x variables
7 x statement
4 x expression
3 x execution
3 x programming
3 x operators

leppie 15 年前

C中:

使用linq,特别是group by,然后按group count过滤,并返回扁平(selectmany)列表。
使用LINQ,按长度过滤。
使用Linq,用“BadWords”筛选。包含。

Gregory Higley 15 年前

雷布尔

也许是冗长的,所以肯定不是赢家,而是完成任务。

min-length: 0
min-count: 0

common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ]

add-word: func [
    word [string!]
    /local
        count
        letter
        non-letter
        temp
        rules
        match
][    
    ; Strip out punctuation
    temp: copy {}
    letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ]
    non-letter: complement letter
    rules: [
        some [
            copy match letter (append temp match)
            |
            non-letter
        ]
    ]
    parse/all word rules
    word: temp

    ; If we end up with nothing, bail
    if 0 == length? word [
        exit
    ]

    ; Check length
    if min-length > length? word [
        exit
    ]

    ; Ignore common words
    ignore: 
    if find common-words word [
        exit
    ]

    ; OK, its good. Add it.
    either found? count: select words word [
        words/(word): count + 1
    ][
        repend words [word 1]
    ]
]

rules: [
    some [
        {"}
        copy word to {"} (add-word word)
        {"}
        |
        copy word to { } (add-word word)
        { }
    ]
    end
]

words: copy []
parse/all read %c.txt rules

result: copy []
foreach word words [
    if string? word [
        count: words/:word
        if count >= min-count [
            append result word
        ]
    ]
]

sort result
foreach word result [ print word ]

输出是:

act
actions
all
allows
also
any
appear
arbitrary
arguments
assign
assigned
based
be
because
been
before
below
between
braces
branches
break
builtin
but
C
C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better
call
called
calls
can
care
case
char
code
columnbased
comma
Comments
common
compiler
conditional
consisting
contain
contains
continue
control
controlflow
criticized
Cs
curly brackets
declarations
define
definitions
degree
delimiters
designated
directly
dowhile
each
effect
effects
either
enclosed
enclosing
end
entry
enum
evaluated
evaluation
evaluations
even
example
executed
execution
exert
expression
expressionExpressions
expressions
familiarity
file
followed
following
format
FORTRAN
freeform
function
functions
goto
has
high
However
identified
ifelse
imperative
include
including
initialization
innermost
int
integer
interleaved
Introduction
iterative
Kernighan
keywords
label
language
languages
languagesAlthough
leave
limit
lineEach
loop
looping
many
may
mimicked
modify
more
most
name
needed
new
next
nonstructured
normal
object
obtain
occur
often
omitted
on
operands
operator
operators
optimization
order
other
perhaps
permits
points
programmers
programming
provides
rather
reinitialization
reliable
requires
reserve
reserved
restrictions
results
return
Ritchie
say
scope
Sections
see
selects
semicolon
separate
sequence
sequence point
sequential
several
side
single
skip
sometimes
source
specify
statement
statements
storage
struct
Structured
structuresAs
such
supported
switch
syntax
testing
textlinebased
than
There
This
turn
type
types
union
Unlike
unspecified
use
used
uses
using
usually
value
values
variable
variables
variety
which
while
whitespace
widespread
will
within
writing

Jeremy Mullin 15 年前

蟒蛇 (258个字符,包括首行66个字符和删除标点符号30个字符):

W="and is the as of to or in for by an be may has can its".split()
x=3;y=2;d={}
for l in open('c.txt') :
    for w in l.lower().translate(None,',.;\'"!()[]{}').split() :
        if w not in W: d[w]=d.get(w,0)+1
for w,n in d.items() :
    if n>y and len(w)>x : print n,w

输出:

4 code
3 keywords
3 languages
3 execution
3 each
3 language
4 expression
4 statements
3 variables
7 statement
5 function
4 operators
4 declarations
3 programming
4 which
3 such
3 types

Kuroki Kaze 15 年前

下面是我的变体,在php中:


   $str = implode(file('c.txt'));
$tok = strtok($str, " .,;()\r\n\t");

$splitters = '\s.,\(\);?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );

foreach($array as $key) {
    $res[$key] = $res[$key]+1;
}

$splitters = '\s.,\(\)\{\};?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );

foreach($array as $key) {
    $res[$key] = $res[$key]+1;
}

unset($res['the']);
unset($res['and']);
unset($res['to']);
unset($res['of']);
unset($res['by']);
unset($res['a']);
unset($res['as']);
unset($res['is']);
unset($res['in']);
unset($res['']);

arsort($res);
//var_dump($res); // concordance
foreach ($res AS $word => $rarity)
    echo $word . ' <b>x</b> ' . $rarity . '<br/>';

foreach ($array as $word) { // words longer than n (=5)
//    if(strlen($word) > 5)echo $word.'<br/>';
}

   
   
   
    输出:
   
   statement x 7
be x 7
C x 5
may x 5
for x 5
or x 5
The x 5
as x 5
expression x 4
statements x 4
code x 4
function x 4
which x 4
an x 4
declarations x 3
new x 3
execution x 3
types x 3
such x 3
variables x 3
can x 3
languages x 3
operators x 3
end x 2
programming x 2
evaluated x 2
functions x 2
definitions x 2
keywords x 2
followed x 2
contain x 2
several x 2
side x 2
most x 2
has x 2
its x 2
called x 2
specify x 2
reinitialization x 2
use x 2
either x 2
each x 2
all x 2
built-in x 2
source x 2
are x 2
storage x 2
than x 2
effects x 1
including x 1
arguments x 1
order x 1
even x 1
unspecified x 1
evaluations x 1
operands x 1
interleaved x 1
However x 1
value x 1
branches x 1
goto x 1
directly x 1
designated x 1
label x 1
non-structured x 1
also x 1
enclosing x 1
innermost x 1
loop x 1
skip x 1
There x 1
within x 1
switch x 1
Expressions x 1
integer x 1
variety x 1
see x 1
below x 1
will x 1
on x 1
selects x 1
case x 1
executed x 1
based x 1
calls x 1
from x 1
because x 1
many x 1
widespread x 1
familiarity x 1
C's x 1
mimicked x 1
Although x 1
reliable x 1
obtain x 1
results x 1
needed x 1
other x 1
syntax x 1
often x 1
Introduction x 1
say x 1
Programming x 1
Language x 1
C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better. x 1
Ritchie x 1
Kernighan x 1
been x 1
criticized x 1
For x 1
example x 1
care x 1
more x 1
leave x 1
return x 1
call x 1
&& x 1
|| x 1
entry x 1
include x 1
next x 1
before x 1
sequence point x 1
sequence x 1
points x 1
comma x 1
operator x 1
but x 1
compiler x 1
requires x 1
programmers x 1
exert x 1
optimization x 1
object x 1
This x 1
permits x 1
high x 1
degree x 1
occur x 1
Structured x 1
using x 1
struct x 1
union x 1
enum x 1
define x 1
Declarations x 1
file x 1
contains x 1
Function x 1
turn x 1
assign x 1
perhaps x 1
Keywords x 1
char x 1
int x 1
Sections x 1
name x 1
variable x 1
reserve x 1
usually x 1
writing x 1
type x 1
Each x 1
line x 1
format x 1
rather x 1
column-based x 1
text-line-based x 1
whitespace x 1
arbitrary x 1
FORTRAN x 1
77 x 1
free-form x 1
allows x 1
restrictions x 1
Comments x 1
C99 x 1
following x 1
// x 1
until x 1
*/ x 1
/* x 1
appear x 1
between x 1
delimiters x 1
enclosed x 1
braces x 1
supported x 1
if x 1
-else x 1
conditional x 1
Unlike x 1
reserved x 1
sequential x 1
provides x 1
control-flow x 1
identified x 1
do-while x 1
while x 1
any x 1
omitted x 1
break x 1
continue x 1
expressions x 1
testing x 1
iterative x 1
looping x 1
separate x 1
initialization x 1
normal x 1
modify x 1
control x 1
structures x 1
As x 1
imperative x 1
single x 1
act x 1
sometimes x 1
curly brackets x 1
limit x 1
scope x 1
language x 1
uses x 1
evaluation x 1
assigned x 1
values x 1
To x 1
effect x 1
semicolon x 1
actions x 1
common x 1
consisting x 1
used x 1

   
    
     var_dump
    
    语句只显示一致性。此变量保留双引号表达式。
   
   
    对于提供的文件,此代码结束于
    
     零点零四七
    
    秒。虽然较大的文件会消耗大量的内存(因为
    
     file
    
    函数)。

Sinan Ünür 14 年前

这不会赢得任何高尔夫球奖,但它会将引用的短语放在一起,并考虑到停止词(和利用 CPAN 模块 Lingua::StopWords 和 Text::ParseWords )

此外,我使用 to_S 从 Lingua::EN::Inflect::Number 只计算单词的单数形式。

你可能还想看看 Lingua::CollinsParser .

#!/usr/bin/perl

use strict; use warnings;

use Lingua::EN::Inflect::Number qw( to_S );
use Lingua::StopWords qw( getStopWords );
use Text::ParseWords;

my $stop = getStopWords('en');

my %words;

while ( my $line = <> ) {
    chomp $line;
    next unless $line =~ /\S/;
    next unless my @words = parse_line(' ', 1, $line);

    ++ $words{to_S $_} for
        grep { length and not $stop->{$_} }
        map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc }
        @words;
}

print "=== only words appearing 4 or more times ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { $words{$_} > 3 } keys %words;

print "=== only words that are 12 characters or longer ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { 11 < length } keys %words;

输出:

=== only words appearing 4 or more times ===
statement : 11
function : 7
expression : 6
may : 5
code : 4
variable : 4
operator : 4
declaration : 4
c : 4
type : 4
=== only words that are 12 characters or longer ===
reinitialization : 2
control-flow : 1
sequence point : 1
optimization : 1
curly brackets : 1
text-line-based : 1
non-structured : 1
column-based : 1
initialization : 1