代码之家  ›  专栏  ›  技术社区  ›  Jonathan Sampson

代码高尔夫:从文本快速构建关键字列表,包括实例

  •  12
  • Jonathan Sampson  · 技术社区  · 5 年前

    我已经用PHP为自己设计了这个解决方案,但我很好奇它是如何以不同的方式完成的——甚至更好。我主要感兴趣的两种语言是PHP和JavaScript,但我感兴趣的是,在今天的任何其他主要语言(主要是C、Java、Java等)中都可以看到这一速度。

    1. 仅返回出现次数大于x的单词
    2. 仅返回长度大于y的单词
    3. 忽略“and,is,the,etc”等常用术语
    4. 在处理之前,请随意删除标点符号(即“john”变为“john”)。
    5. 返回集合/数组中的结果

    额外信贷

    1. 把引用的陈述放在一起(即“它们太好了,显然不真实”)。
      如果“太好了,不可能是真的”,那么实际的陈述就是

    额外学分

    1. 你的剧本能根据单词在一起的频率来决定应该放在一起的单词吗?这是在不事先知道单词的情况下完成的。例子:
      *“果蝇在医学研究方面是一件大事。对果蝇进行了大量的研究,取得了许多突破。在未来,果蝇将继续被研究,但我们的方法可能会改变。*
      很明显,这里的单词是“果蝇”,我们很容易找到。你的搜索脚本也能确定这一点吗?

    源文本: http://sampsonresume.com/labs/c.txt

    应答格式

    1. 除了操作持续的时间外,还可以看到代码、输出的结果。
    13 回复  |  直到 5 年前
        1
  •  11
  •   liori    15 年前

    GNU脚本

    sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr
    

    结果:

      7 be
      6 to
    [...]
      1 2.
      1 -
    

    出现大于x时:

    sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'
    

    仅返回长度大于y的单词(在第二个grep中输入y+1点):

    sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c
    

    忽略“and,is,the,etc”等常用术语(假设常用术语在“ignored”文件中)

    sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c
    

    在处理之前,请随意删除标点符号(即“john”变为“john”):

    sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c
    

    返回集合/数组中的结果:它已经类似于shell的数组:第一列是count,第二列是word。

        2
  •  6
  •   Luke Girvin Nathan Bedford    14 年前

    Perl只有43个字符。

    perl -MYAML -anE'$_{$_}++for@F;say Dump\%_'
    

    下面是它的使用示例:

    echo a a a b b c  d e aa | perl -MYAML -anE'$_{$_}++for@F;say Dump \%_'
    
    ---
    a: 3
    aa: 1
    b: 2
    c: 1
    d: 1
    e: 1
    

    如果只需要列出小写版本,则还需要两个字符。

    perl -MYAML -anE'$_{lc$_}++for@F;say Dump\%_'
    

    要处理指定的文本,需要58个字符。

    curl http://sampsonresume.com/labs/c.txt |
    perl -MYAML -F'\W+' -anE'$_{lc$_}++for@F;END{say Dump\%_}'
    
    real    0m0.679s
    user    0m0.304s
    sys     0m0.084s
    

    下面是最后一个扩展了一点的示例。

    #! perl
    use 5.010;
    use YAML;
    
    while( my $line = <> ){
      for my $elem ( split '\W+', $line ){
        $_{ lc $elem }++
      }
      END{
        say Dump \%_;
      }
    }
    
        3
  •  4
  •   Juliet    15 年前

    弗斯 304个字符

    let f =
        let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
        fun length occurrence msg ->
            System.Text.RegularExpressions.Regex.Split(msg, @"[^\w-']+")
            |> Seq.countBy (fun a -> a)
            |> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)
    
        4
  •  3
  •   Robert K    15 年前

    红宝石

    当“缩小”时,此实现将变为165个字符长。它使用 array#inject 给出一个起始值(默认值为0的散列对象),然后循环遍历元素,然后将元素卷进散列;然后从最小频率中选择结果。

    注意,我没有计算要跳过的单词的大小,这是一个外部常量。当对常量进行计数时,解决方案的长度为244个字符。

    撇号和破折号不会被去掉,而是包含在内;它们的使用会修改单词,因此,如果不删除符号之外的所有信息,就不能简单地去掉。

    实施

    CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's)
    def get_keywords(text, minFreq=0, minLen=2)
      text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i).
        inject(Hash.new(0)) do |result,w|
          w.downcase!
          result[w] += 1 unless CommonWords.include?(w)
          result
        end.select { |k,n| n >= minFreq }
    end
    

    试验台

    require 'net/http'
    
    keywords = get_keywords(Net::HTTP.get('www.sampsonresume.com','/labs/c.txt'), 3)
    keywords.sort.each { |name,count| puts "#{name} x #{count} times" }
    

    试验结果

    code x 4 times
    declarations x 4 times
    each x 3 times
    execution x 3 times
    expression x 4 times
    function x 5 times
    keywords x 3 times
    language x 3 times
    languages x 3 times
    new x 3 times
    operators x 4 times
    programming x 3 times
    statement x 7 times
    statements x 4 times
    such x 3 times
    types x 3 times
    variables x 3 times
    which x 4 times
    
        5
  •  3
  •   Noldorin    15 年前

    C 3.0(带LINQ)

    这是我的解决方案。它利用LINQ/扩展方法的一些非常好的特性来保持代码的简短。

    public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
    {
        var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
            "for", "by", "an", "be", "may", "has", "can", "its"};
        var words = Regex.Replace(text.ToLower(), @"[,.?\/;:\(\)]", string.Empty).Split(' ');
        var occurrences = words.Distinct().Except(commonWords).Select(w =>
            new { Word = w, Count = words.Count(s => s == w) });
        return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
            .ToDictionary(wo => wo.Word, wo => wo.Count);
    }
    

    然而,这远远不是最有效的方法, O(n^2) 用字数,而不是 O(n) 我相信在这种情况下这是最佳的。我会看看是否可以创建一个稍微长一点的更有效的方法。

    下面是对示例文本运行函数的结果(最小出现次数:3,最小长度:2)。

      3 x such
      4 x code
      4 x which
      4 x declarations
      5 x function
      4 x statements
      3 x new
      3 x types
      3 x keywords
      7 x statement
      3 x language
      3 x expression
      3 x execution
      3 x programming
      4 x operators
      3 x variables
    

    我的测试程序:

    static void Main(string[] args)
    {
        string sampleText;
        using (var client = new WebClient())
            sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
        var keywords = GetKeywords(sampleText, 3, 2);
        foreach (var entry in keywords)
            Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
        Console.ReadKey(true);
    }
    
        6
  •  3
  •   Alex Feinman    7 年前
    #! perl
    use strict;
    use warnings;
    
    while (<>) {
      for my $word (split) {
        $words{$word}++;
      }
    }
    for my $word (keys %words) {
      print "$word occurred $words{$word} times.";
    }
    

    这就是简单的形式。如果需要排序、筛选等:

    while (<>) {
      for my $word (split) {
        $words{$word}++;
      }
    }
    for my $word (keys %words) {
      if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
        print "$word occurred $words{$word} times.";
      }
    }
    

    您还可以很容易地对输出进行排序:

    ...
    for my $word (keys %words) {
      if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
        push @output, "$word occurred $words{$word} times.";
      }
    }
    $re = qr/occurred (\d+) /;
    print sort {
      $a = $a =~ $re;
      $b = $b =~ $re;
      $a <=> $b
    } @output;
    

    一个真正的Perl黑客可以很容易地在一行或两行上获取这些信息,但我追求的是可读性。



    Brad

    Edit: this is how I would rewrite this last example

    ...
    for my $word (
      sort { $words{$a} <=> $words{$b} } keys %words
    ){
      next unless length($word) >= $MINLEN;
      last unless $words{$word) >= $MIN_OCCURRENCE;
    
      print "$word occurred $words{$word} times.";
    }
    

    或者如果我需要它运行得更快,我甚至可以这样写:

    for my $word_data (
      sort {
        $a->[1] <=> $b->[1] # numerical sort on count
      } grep {
        # remove values that are out of bounds
        length($_->[0]) >= $MINLEN &&      # word length
        $_->[1] >= $MIN_OCCURRENCE # count
      } map {
        # [ word, count ]
        [ $_, $words{$_} ]
      } keys %words
    ){
      my( $word, $count ) = @$word_data;
      print "$word occurred $count times.";
    }
    

    它使用地图来提高效率, grep删除多余元素, 当然,排序就是进行排序。 (按顺序进行)

    这是 Schwartzian transform .

        7
  •  2
  •   gooli    15 年前

    另一个python解决方案,247个字符。实际的代码是一行由134个字符组成的高度密集的python行,用一个表达式计算整个过程。

    x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
    from itertools import groupby as gb
    d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
        gb(sorted(open("c.txt").read().lower().split())))
        if l>x and len(w)>y and w not in W)
    

    一个更长的版本,有大量的评论供您阅读:

    # High and low count boundaries.
    x = 3
    y = 2
    
    # Common words string split into a list by spaces.
    Words = "and is the as of to or in for by an be may has can its".split()
    
    # A special function that groups similar strings in a list into a 
    # (string, grouper) pairs. Grouper is a generator of occurences (see below).
    from itertools import groupby
    
    # Reads the entire file, converts it to lower case and splits on whitespace 
    # to create a list of words
    sortedWords = sorted(open("c.txt").read().lower().split())
    
    # Using the groupby function, groups similar words together.
    # Since grouper is a generator of occurences we need to use len(list(grouper)) 
    # to get the word count by first converting the generator to a list and then
    # getting the length of the list.
    wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))
    
    # Filters the words by number of occurences and common words using yet another 
    # list comprehension.
    filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)
    
    # Creates a dictionary from the list of tuples.
    result = dict(filteredWordCounts)
    
    print result
    

    这里的主要技巧是使用itertools.groupby函数来计算排序列表中出现的次数。不知道它是否真的保存了字符,但它确实允许在一个表达式中进行所有处理。

    结果:

    {'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}
    
        8
  •  2
  •   Kamarey    14 年前

    C代码:

    IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y)
    {
        // common words, that will be ignored
        var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word);
        // regular expression to find quoted text
        var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled);
    
        return
            // remove quoted text (it will be processed later)
            regex.Replace(text, "")
            // remove case dependency
            .ToLower()
            // split text by all these chars
            .Split(".,'\\/[]{}()`~@#$%^&*-=+?!;:<>| \n\r".ToCharArray())
            // add quoted text
            .Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value))
            // group words by the word and count them
            .GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count()))
            // apply filter(min word count and word length) and remove common words 
            .Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key));
    }
    

    processText(文本,3,2)调用的输出:

    3 x languages
    3 x such
    4 x code
    4 x which
    3 x based
    3 x each
    4 x declarations
    5 x function
    4 x statements
    3 x new
    3 x types
    3 x keywords
    3 x variables
    7 x statement
    4 x expression
    3 x execution
    3 x programming
    3 x operators
    
        9
  •  1
  •   leppie    15 年前

    C中:

    1. 使用linq,特别是group by,然后按group count过滤,并返回扁平(selectmany)列表。

    2. 使用LINQ,按长度过滤。

    3. 使用Linq,用“BadWords”筛选。包含。

        10
  •  1
  •   Gregory Higley    15 年前

    雷布尔

    也许是冗长的,所以肯定不是赢家,而是完成任务。

    min-length: 0
    min-count: 0
    
    common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ]
    
    add-word: func [
        word [string!]
        /local
            count
            letter
            non-letter
            temp
            rules
            match
    ][    
        ; Strip out punctuation
        temp: copy {}
        letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ]
        non-letter: complement letter
        rules: [
            some [
                copy match letter (append temp match)
                |
                non-letter
            ]
        ]
        parse/all word rules
        word: temp
    
        ; If we end up with nothing, bail
        if 0 == length? word [
            exit
        ]
    
        ; Check length
        if min-length > length? word [
            exit
        ]
    
        ; Ignore common words
        ignore: 
        if find common-words word [
            exit
        ]
    
        ; OK, its good. Add it.
        either found? count: select words word [
            words/(word): count + 1
        ][
            repend words [word 1]
        ]
    ]
    
    rules: [
        some [
            {"}
            copy word to {"} (add-word word)
            {"}
            |
            copy word to { } (add-word word)
            { }
        ]
        end
    ]
    
    words: copy []
    parse/all read %c.txt rules
    
    result: copy []
    foreach word words [
        if string? word [
            count: words/:word
            if count >= min-count [
                append result word
            ]
        ]
    ]
    
    sort result
    foreach word result [ print word ]
    

    输出是:

    act
    actions
    all
    allows
    also
    any
    appear
    arbitrary
    arguments
    assign
    assigned
    based
    be
    because
    been
    before
    below
    between
    braces
    branches
    break
    builtin
    but
    C
    C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better
    call
    called
    calls
    can
    care
    case
    char
    code
    columnbased
    comma
    Comments
    common
    compiler
    conditional
    consisting
    contain
    contains
    continue
    control
    controlflow
    criticized
    Cs
    curly brackets
    declarations
    define
    definitions
    degree
    delimiters
    designated
    directly
    dowhile
    each
    effect
    effects
    either
    enclosed
    enclosing
    end
    entry
    enum
    evaluated
    evaluation
    evaluations
    even
    example
    executed
    execution
    exert
    expression
    expressionExpressions
    expressions
    familiarity
    file
    followed
    following
    format
    FORTRAN
    freeform
    function
    functions
    goto
    has
    high
    However
    identified
    ifelse
    imperative
    include
    including
    initialization
    innermost
    int
    integer
    interleaved
    Introduction
    iterative
    Kernighan
    keywords
    label
    language
    languages
    languagesAlthough
    leave
    limit
    lineEach
    loop
    looping
    many
    may
    mimicked
    modify
    more
    most
    name
    needed
    new
    next
    nonstructured
    normal
    object
    obtain
    occur
    often
    omitted
    on
    operands
    operator
    operators
    optimization
    order
    other
    perhaps
    permits
    points
    programmers
    programming
    provides
    rather
    reinitialization
    reliable
    requires
    reserve
    reserved
    restrictions
    results
    return
    Ritchie
    say
    scope
    Sections
    see
    selects
    semicolon
    separate
    sequence
    sequence point
    sequential
    several
    side
    single
    skip
    sometimes
    source
    specify
    statement
    statements
    storage
    struct
    Structured
    structuresAs
    such
    supported
    switch
    syntax
    testing
    textlinebased
    than
    There
    This
    turn
    type
    types
    union
    Unlike
    unspecified
    use
    used
    uses
    using
    usually
    value
    values
    variable
    variables
    variety
    which
    while
    whitespace
    widespread
    will
    within
    writing
    
        11
  •  1
  •   Jeremy Mullin    15 年前

    蟒蛇 (258个字符,包括首行66个字符和删除标点符号30个字符):

    W="and is the as of to or in for by an be may has can its".split()
    x=3;y=2;d={}
    for l in open('c.txt') :
        for w in l.lower().translate(None,',.;\'"!()[]{}').split() :
            if w not in W: d[w]=d.get(w,0)+1
    for w,n in d.items() :
        if n>y and len(w)>x : print n,w
    

    输出:

    4 code
    3 keywords
    3 languages
    3 execution
    3 each
    3 language
    4 expression
    4 statements
    3 variables
    7 statement
    5 function
    4 operators
    4 declarations
    3 programming
    4 which
    3 such
    3 types
    
        12
  •  0
  •   Kuroki Kaze    15 年前

    下面是我的变体,在php中:

    $str = implode(file('c.txt'));
    $tok = strtok($str, " .,;()\r\n\t");
    
    $splitters = '\s.,\(\);?:'; // string splitters
    $array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );
    
    foreach($array as $key) {
        $res[$key] = $res[$key]+1;
    }
    
    $splitters = '\s.,\(\)\{\};?:'; // string splitters
    $array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );
    
    foreach($array as $key) {
        $res[$key] = $res[$key]+1;
    }
    
    unset($res['the']);
    unset($res['and']);
    unset($res['to']);
    unset($res['of']);
    unset($res['by']);
    unset($res['a']);
    unset($res['as']);
    unset($res['is']);
    unset($res['in']);
    unset($res['']);
    
    arsort($res);
    //var_dump($res); // concordance
    foreach ($res AS $word => $rarity)
        echo $word . ' <b>x</b> ' . $rarity . '<br/>';
    
    foreach ($array as $word) { // words longer than n (=5)
    //    if(strlen($word) > 5)echo $word.'<br/>';
    }
    

    输出:

    statement x 7
    be x 7
    C x 5
    may x 5
    for x 5
    or x 5
    The x 5
    as x 5
    expression x 4
    statements x 4
    code x 4
    function x 4
    which x 4
    an x 4
    declarations x 3
    new x 3
    execution x 3
    types x 3
    such x 3
    variables x 3
    can x 3
    languages x 3
    operators x 3
    end x 2
    programming x 2
    evaluated x 2
    functions x 2
    definitions x 2
    keywords x 2
    followed x 2
    contain x 2
    several x 2
    side x 2
    most x 2
    has x 2
    its x 2
    called x 2
    specify x 2
    reinitialization x 2
    use x 2
    either x 2
    each x 2
    all x 2
    built-in x 2
    source x 2
    are x 2
    storage x 2
    than x 2
    effects x 1
    including x 1
    arguments x 1
    order x 1
    even x 1
    unspecified x 1
    evaluations x 1
    operands x 1
    interleaved x 1
    However x 1
    value x 1
    branches x 1
    goto x 1
    directly x 1
    designated x 1
    label x 1
    non-structured x 1
    also x 1
    enclosing x 1
    innermost x 1
    loop x 1
    skip x 1
    There x 1
    within x 1
    switch x 1
    Expressions x 1
    integer x 1
    variety x 1
    see x 1
    below x 1
    will x 1
    on x 1
    selects x 1
    case x 1
    executed x 1
    based x 1
    calls x 1
    from x 1
    because x 1
    many x 1
    widespread x 1
    familiarity x 1
    C's x 1
    mimicked x 1
    Although x 1
    reliable x 1
    obtain x 1
    results x 1
    needed x 1
    other x 1
    syntax x 1
    often x 1
    Introduction x 1
    say x 1
    Programming x 1
    Language x 1
    C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better. x 1
    Ritchie x 1
    Kernighan x 1
    been x 1
    criticized x 1
    For x 1
    example x 1
    care x 1
    more x 1
    leave x 1
    return x 1
    call x 1
    && x 1
    || x 1
    entry x 1
    include x 1
    next x 1
    before x 1
    sequence point x 1
    sequence x 1
    points x 1
    comma x 1
    operator x 1
    but x 1
    compiler x 1
    requires x 1
    programmers x 1
    exert x 1
    optimization x 1
    object x 1
    This x 1
    permits x 1
    high x 1
    degree x 1
    occur x 1
    Structured x 1
    using x 1
    struct x 1
    union x 1
    enum x 1
    define x 1
    Declarations x 1
    file x 1
    contains x 1
    Function x 1
    turn x 1
    assign x 1
    perhaps x 1
    Keywords x 1
    char x 1
    int x 1
    Sections x 1
    name x 1
    variable x 1
    reserve x 1
    usually x 1
    writing x 1
    type x 1
    Each x 1
    line x 1
    format x 1
    rather x 1
    column-based x 1
    text-line-based x 1
    whitespace x 1
    arbitrary x 1
    FORTRAN x 1
    77 x 1
    free-form x 1
    allows x 1
    restrictions x 1
    Comments x 1
    C99 x 1
    following x 1
    // x 1
    until x 1
    */ x 1
    /* x 1
    appear x 1
    between x 1
    delimiters x 1
    enclosed x 1
    braces x 1
    supported x 1
    if x 1
    -else x 1
    conditional x 1
    Unlike x 1
    reserved x 1
    sequential x 1
    provides x 1
    control-flow x 1
    identified x 1
    do-while x 1
    while x 1
    any x 1
    omitted x 1
    break x 1
    continue x 1
    expressions x 1
    testing x 1
    iterative x 1
    looping x 1
    separate x 1
    initialization x 1
    normal x 1
    modify x 1
    control x 1
    structures x 1
    As x 1
    imperative x 1
    single x 1
    act x 1
    sometimes x 1
    curly brackets x 1
    limit x 1
    scope x 1
    language x 1
    uses x 1
    evaluation x 1
    assigned x 1
    values x 1
    To x 1
    effect x 1
    semicolon x 1
    actions x 1
    common x 1
    consisting x 1
    used x 1
    

    var_dump 语句只显示一致性。此变量保留双引号表达式。

    对于提供的文件,此代码结束于 零点零四七 秒。虽然较大的文件会消耗大量的内存(因为 file 函数)。

        13
  •  0
  •   Sinan Ünür    14 年前

    这不会赢得任何高尔夫球奖,但它会将引用的短语放在一起,并考虑到停止词(和利用 CPAN 模块 Lingua::StopWords Text::ParseWords )

    此外,我使用 to_S Lingua::EN::Inflect::Number 只计算单词的单数形式。

    你可能还想看看 Lingua::CollinsParser .

    #!/usr/bin/perl
    
    use strict; use warnings;
    
    use Lingua::EN::Inflect::Number qw( to_S );
    use Lingua::StopWords qw( getStopWords );
    use Text::ParseWords;
    
    my $stop = getStopWords('en');
    
    my %words;
    
    while ( my $line = <> ) {
        chomp $line;
        next unless $line =~ /\S/;
        next unless my @words = parse_line(' ', 1, $line);
    
        ++ $words{to_S $_} for
            grep { length and not $stop->{$_} }
            map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc }
            @words;
    }
    
    print "=== only words appearing 4 or more times ===\n";
    print "$_ : $words{$_}\n" for sort {
        $words{$b} <=> $words{$a}
    } grep { $words{$_} > 3 } keys %words;
    
    print "=== only words that are 12 characters or longer ===\n";
    print "$_ : $words{$_}\n" for sort {
        $words{$b} <=> $words{$a}
    } grep { 11 < length } keys %words;
    

    输出:

    === only words appearing 4 or more times ===
    statement : 11
    function : 7
    expression : 6
    may : 5
    code : 4
    variable : 4
    operator : 4
    declaration : 4
    c : 4
    type : 4
    === only words that are 12 characters or longer ===
    reinitialization : 2
    control-flow : 1
    sequence point : 1
    optimization : 1
    curly brackets : 1
    text-line-based : 1
    non-structured : 1
    column-based : 1
    initialization : 1