代码之家 › 专栏 › 技术社区 › NoahD

如何使用Perl计算文件中的字符、单词和行数?

counting file perl

NoahD · 技术社区 · 16 年前

使用Perl(不使用wc)计算文本文件的字符、单词和行数的好/最好方法是什么?

10 回复 | 直到 10 年前

bmdhacks 16 年前

这是Perl代码。计算单词可能有点主观,但我只是说,它是不是空白的任何字符串。

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\s+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");

Nic Gibson 16 年前

bmdhacks答案的一个可能产生更好结果的变体是使用\s+(甚至更好的\w+)作为分隔符。考虑字符串“The Quick Brown Fox”(如果不明显,则添加空格)。使用单个空白字符的分隔符将给出6而不是4的字数。因此,尝试:

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\W+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");

使用\w+作为分隔符将停止标点符号(以及其他内容)作为单词计数。

TStamper 16 年前

这个 Word Count tool 计算文本文件中的字符、单词和行数

tchrist 14 年前

在这里。试试这个熟悉Unicode的wc程序版本。

它跳过非文件参数(管道、目录、套接字等)。
它采用UTF-8文本。
它将任何Unicode空白作为单词分隔符进行计数。
如果存在一个 .ENCODING 在文件名末尾,比如 foo.cp1252 , foo.latin1 , foo.utf16 等。
它还可以处理以各种格式压缩的文件。
它给出了 段落、线条、文字、字形、字符 和 字节。
它了解所有Unicode换行符序列。
它警告有换行错误的损坏的文本文件。

这里是一个运行它的例子:

Â Â Â ParasÂ Â Â Â LinesÂ Â Â Â WordsÂ Â Â GraphsÂ Â Â Â CharsÂ Â Â Â BytesÂ File
Â Â Â Â Â Â Â 2Â Â Â Â Â 2270Â Â Â Â 82249Â Â Â 504169Â Â Â 504333Â Â Â 528663Â /tmp/ap
Â Â Â Â Â Â Â 1Â Â Â Â Â 2404Â Â Â Â 11163Â Â Â Â 63164Â Â Â Â 63164Â Â Â Â 66336Â /tmp/b3
    uwc:Â missingÂ linebreakÂ atÂ endÂ ofÂ corruptedÂ textfiileÂ /tmp/bad
Â Â Â Â Â Â 1*Â Â Â Â Â Â Â 2*Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 19Â Â Â Â Â Â Â 19Â Â Â Â Â Â Â 19Â /tmp/bad
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â 14Â Â Â Â Â Â Â 52Â Â Â Â Â Â 273Â Â Â Â Â Â 273Â Â Â Â Â Â 293Â /tmp/es
Â Â Â Â Â Â 57Â Â Â Â Â Â 383Â Â Â Â Â 1369Â Â Â Â 11997Â Â Â Â 11997Â Â Â Â 12001Â /tmp/funny
Â Â Â Â Â Â Â 1Â Â Â 657068Â Â 3175429Â 31205970Â 31209138Â 32633834Â /tmp/lw
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â /tmp/nf.cp1252
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 34Â /tmp/nf.euc-jp
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â /tmp/nf.latin1
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â /tmp/nf.macroman
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 54Â /tmp/nf.ucs2
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 56Â /tmp/nf.utf16
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 54Â /tmp/nf.utf16be
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 54Â /tmp/nf.utf16le
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â Â Â Â Â Â 112Â /tmp/nf.utf32
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â Â Â Â Â Â 108Â /tmp/nf.utf32be
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â Â Â Â Â Â 108Â /tmp/nf.utf32le
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 39Â /tmp/nf.utf7
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 27Â Â Â Â Â Â Â 31Â /tmp/nf.utf8
Â Â Â Â Â Â Â 1Â Â Â Â 26906Â Â Â 101528Â Â Â 635841Â Â Â 636026Â Â Â 661202Â /tmp/o2
     131Â Â Â Â Â Â 346Â Â Â Â Â 1370Â Â Â Â Â 9590Â Â Â Â Â 9590Â Â Â Â Â 4486Â /tmp/perl5122delta.pod.gz
     291Â Â Â Â Â Â 814Â Â Â Â Â 3941Â Â Â Â 25318Â Â Â Â 25318Â Â Â Â Â 9878Â /tmp/perl51310delta.pod.bz2
Â Â Â Â Â Â Â 1Â Â Â Â Â 2551Â Â Â Â Â 5345Â Â Â 132655Â Â Â 132655Â Â Â 133178Â /tmp/tailsort-pl.utf8
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â 89Â Â Â Â Â Â 334Â Â Â Â Â 1784Â Â Â Â Â 1784Â Â Â Â Â 2094Â /tmp/til
Â Â Â Â Â Â Â 1Â Â Â Â Â Â Â Â 4Â Â Â Â Â Â Â 18Â Â Â Â Â Â Â 88Â Â Â Â Â Â Â 88Â Â Â Â Â Â 106Â /tmp/w
Â Â Â Â Â 276Â Â Â Â Â 1736Â Â Â Â Â 5773Â Â Â Â 53782Â Â Â Â 53782Â Â Â Â 53804Â /tmp/www

来吧:

#!/usr/bin/env perl 
#########################################################################
# uniwc - improved version of wc that works correctly with Unicode
#
# Tom Christiansen <tchrist@perl.com>
# Mon Feb 28 15:59:01 MST 2011
#########################################################################

use 5.10.0;

use strict;
use warnings FATAL => "all";
use sigtrap qw[ die untrapped normal-signals ];

use Carp;

$SIG{__WARN__}  = sub {
    confess("FATALIZED WARNING: @_")  unless $^S;
};

$SIG{__DIE__}  = sub {
    confess("UNCAUGHT EXCEPTION: @_")  unless $^S;
};

$| = 1;

my $Errors = 0;
my $Headers = 0;

sub yuck($) {
    my $errmsg = $_[0];
    $errmsg =~ s/(?<=[^\n])\z/\n/;
    print STDERR "$0: $errmsg";
}

process_input(\&countem);

sub countem { 
    my ($_, $file) = @_;

    my (
        @paras, @lines, @words,
        $paracount, $linecount, $wordcount, 
        $grafcount, $charcount, $bytecount,
    );

    if ($charcount = length($_)) {
        $wordcount = eval { @words = split m{ \p{Space}+  }x }; 
        yuck "error splitting words: $@" if $@;

        $linecount = eval { @lines = split m{ \R     }x }; 
        yuck "error splitting lines: $@" if $@;

        $grafcount = 0;
        $grafcount++ while /\X/g;
        #$grafcount = eval { @lines = split m{ \R     }x }; 
        yuck "error splitting lines: $@" if $@;

        $paracount = eval { @paras = split m{ \R{2,} }x }; 
        yuck "error splitting paras: $@" if $@;

        if ($linecount && !/\R\z/) {
            yuck("missing linebreak at end of corrupted textfiile $file");
            $linecount .= "*";
            $paracount .= "*";
        } 
    }

    $bytecount = tell;
    if (-e $file) {
        $bytecount = -s $file;
        if ($bytecount != -s $file) {
            yuck "filesize of $file differs from bytecount\n";
            $Errors++;
        }
    } 
    my $mask = "%8s " x 6 . "%s\n";
    printf  $mask => qw{ Paras Lines Words Graphs Chars Bytes File } unless $Headers++;

    printf $mask => map( { show_undef($_) } 
                                $paracount, $linecount, 
                                $wordcount, $grafcount, 
                                $charcount, $bytecount,
                       ), $file;
} 

sub show_undef {
    my $value = shift;
    return defined($value)
             ? $value
             : "undef";
} 

END { 
    close(STDOUT) || die "$0: can't close STDOUT: $!";
    exit($Errors != 0);
}

sub process_input {

    my $function = shift();

    my $enc;

    if (@ARGV == 0 && -t) {
        warn "$0: reading from stdin, type ^D to end or ^C to kill.\n";
    }

    unshift(@ARGV, "-") if @ARGV == 0;

FILE:

    for my $file (@ARGV) {
        # don't let magic open make an output handle

        next if -e $file && ! -f _;

        my $quasi_filename = fix_extension($file);

        $file = "standard input" if $file eq q(-);
        $quasi_filename =~ s/^(?=\s*[>|])/< /;

        no strict "refs";
        my $fh = $file;   # is *so* a lexical filehandle! âº
        unless (open($fh, $quasi_filename)) {
            yuck("couldn't open $quasi_filename: $!");
            next FILE;
        }
        set_encoding($fh, $file) || next FILE;

        my $whole_file = eval {
            use warnings "FATAL" => "all";
            local $/;
            scalar <$fh>;
        };

        if ($@) {
            $@ =~ s/ at \K.*? line \d+.*/$file line $./;
            yuck($@);
            next FILE;
        }

        $function->($whole_file, $file);

        unless (close $fh) {
            yuck("couldn't close $quasi_filename at line $.: $!");
            next FILE;
        }

    } # foreach file

}

sub set_encoding(*$) {
    my ($handle, $path) = @_;

    my $enc_name = "utf8";

    if ($path && $path =~ m{ \. ([^\s.]+) \z }x) {
        my $ext = $1;
        die unless defined $ext;
        require Encode;
        if (my $enc_obj = Encode::find_encoding($ext)) {
            my $name = $enc_obj->name || $ext;
            $enc_name = "encoding($name)";
        }
    }

    return 1 if eval {
        use warnings FATAL => "all";
        no strict "refs";
        binmode($handle, ":$enc_name");
        1;
    };

    for ($@) {
        s/ at .* line \d+\.//;
        s/$/ for $path/;
    }

    yuck("set_encoding: $@");

    return undef;
}

sub fix_extension {
    my $path = shift();
    my %Compress = (
        Z       =>  "zcat",
        z       => "gzcat",            # for uncompressing
        gz      => "gzcat",
        bz      => "bzcat",
        bz2     => "bzcat",
        bzip    => "bzcat",
        bzip2   => "bzcat",
        lzma    => "lzcat",
    );

    if ($path =~ m{ \. ( [^.\s] +) \z }x) {
        if (my $prog = $Compress{$1}) {
            return "$prog $path |";
        } 
    } 

    return $path;

}

Community CDub 8 年前

我在搜索字符计数解决方案时偶然发现了这个问题。不可否认,我对Perl几乎一无所知,所以其中一些可能是不基本的,但下面是我对Newt解决方案的一些调整。

首先,有一个内置的行计数变量,所以我只是使用它。我想这可能更有效一点。事实上,字符数包括换行字符,这可能不是您想要的,所以我选择了$uuu。 Perl还抱怨split()的执行方式(隐式split,请参见: Why does Perl complain "Use of implicit split to @_ is deprecated"? )所以我调整了一下。我的输入文件是UTF-8,所以我打开了它们。这可能有助于在包含非ASCII字符的输入文件中获得正确的字符计数。

代码如下:

open(FILE, "<:encoding(UTF-8)", "file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);
my @wordcounter;
while (<FILE>) {
    chomp($_);
    $chars += length($_);
    @wordcounter = split(/\W+/, $_);
    $words += @wordcounter;
}
$lines = $.;
close FILE;
print "\nlines=$lines, words=$words, chars=$chars\n";

Schwern 10 年前

有 Perl Power Tools 其目标是重建所有unix bin实用程序的项目,主要是为那些缺少unix的操作系统上的实用程序。是的,他们做到了。 wc . 实现是多余的,但它是 POSIX compliant .

当您看到符合GNU的 true .

Paul Tomblin 16 年前

非严肃回答:

system("wc foo");

ephemient 16 年前

以固定大小的块读取文件可能比逐行读取更有效。这个 wc Binary会这样做。

#!/usr/bin/env perl

use constant BLOCK_SIZE => 16384;

for my $file (@ARGV) {
    open my $fh, '<', $file or do {
        warn "couldn't open $file: $!\n";
        continue;
    };

    my ($chars, $words, $lines) = (0, 0, 0);

    my ($new_word, $new_line);
    while ((my $size = sysread $fh, local $_, BLOCK_SIZE) > 0) {
        $chars += $size;
        $words += /\s+/g;
        $words-- if $new_word && /\A\s/;
        $lines += () = /\n/g;

        $new_word = /\s\Z/;
        $new_line = /\n\Z/;
    }
    $lines-- if $new_line;

    print "\t$lines\t$words\t$chars\t$file\n";
}

ÐÐµÑÐ¾Ð² 16 年前

要能够计算字符而不是字节,请考虑以下情况:
(尝试使用以utf8格式保存的中文或西里尔文字母和文件)

use utf8;

my $file='file.txt';
my $LAYER = ':encoding(UTF-8)';
open( my $fh, '<', $file )
  || die( "$file couldn't be opened: $!" );
binmode( $fh, $LAYER );
read $fh, my $txt, -s $file;
close $fh;

print length $txt,$/;
use bytes;
print length $txt,$/;

Jassi 11 年前

这可能对Perl初学者有所帮助。我尝试模拟MS字数统计功能,并添加了一个在Linux中没有使用wc显示的功能。

行数
词数
带空格的字符数
不带空格的字符数(wc不会在其输出中给出此值,但Microsoft Word会显示此值。)

以下是网址: Counting words,characters and lines in a file