代码之家 › 专栏 › 技术社区 › Aakash Goel

如何在Perl中对文本文件中的行进行排序?

sorting scripting perl

Aakash Goel · 技术社区 · 14 年前

我有几个文本文件( A.txt 和 B.txt )看起来是这样的(可能每行有10000行)

processa,id1=123,id2=5321
processa,id1=432,id2=3721
processa,id1=3,id2=521
processb,id1=9822,id2=521
processa,id1=213,id2=1
processc,id1=822,id2=521

我需要检查文件中的每一行 文本 出现在 文本 同时( 文本 可能还有更多,没关系)。

问题是这两个文件中的行可以是任意顺序的,所以我想我将在 O(nlogn) 然后把每一行 文本 到下一行 文本 在里面 O(n) . 我可以实现一个散列,但是文件很大,并且这种比较只在重新生成这些文件之后发生一次,所以我认为这不是一个好主意。

用Perl对文件进行排序的最佳方法是什么? 任何订购都可以,只要一些命令。

例如,在字典排序中,这将是

processa,id1=123,id2=5321
processa,id1=213,id2=1
processa,id1=3,id2=521
processa,id1=432,id2=3721
processb,id1=9822,id2=521
processc,id1=822,id2=521

如前所述,只要Perl快速完成,任何排序都可以。

在像这样打开文件之后,我想在Perl代码中完成它

open (FH, "<A.txt");

任何评论、想法等都会有帮助。

6 回复 | 直到 14 年前

zigdon 14 年前

要对脚本中的文件进行排序,仍然需要将整个文件加载到内存中。如果你这么做的话,我不确定排序和将它加载到散列中相比有什么好处?

像这样的方法会奏效:

my %seen;
open(A, "<A.txt") or die "Can't read A: $!";
while (<A>) {
    $seen{$_}=1;
}
close A;

open(B, "<B.txt") or die "Can't read B: $!";
while(<B>) {
  delete $seen{$_};
}
close B;

print "Lines found in A, missing in B:\n";
join "\n", keys %seen;

FMc TLP 14 年前

这是另一种方法。这个想法是创建一个灵活的数据结构,使您能够轻松地回答多种问题。 grep .

use strict;
use warnings;

my ($fileA, $fileB) = @ARGV;

# Load all lines: $h{LINE}{FILE_NAME} = TALLY
my %h;
$h{$_}{$ARGV} ++ while <>;

# Do whatever you need.
my @all_lines = keys %h;
my @in_both   = grep {     keys %{$h{$_}} == 2       } keys %h;
my @in_A      = grep {     exists $h{$_}{$fileA}     } keys %h;
my @only_in_A = grep { not exists $h{$_}{$fileB}     } @in_A;
my @in_A_mult = grep {            $h{$_}{$fileA} > 1 } @in_A;

recursive9 14 年前

嗯,我经常使用Perl解析非常大(600MB)的Apache日志文件,并使用散列存储信息。我还使用相同的散列在一个脚本实例中浏览了其中大约30个文件。假设你有足够的内存,这没什么大不了的。

DVK 14 年前

我可以问一下为什么必须用原生Perl来实现这一点吗?如果调用系统调用或3的成本不是问题(例如,您很少这样做,而且不是在一个紧密的循环中),为什么不简单地做:

my $cmd = "sort $file1 > $file1.sorted";
$cmd .= "; sort $file2 > $file2.sorted";
$cmd .= "; comm -23 $file1.sorted $file2.sorted |wc -l";
my $count = `$cmd`;
$count =~ s/\s+//g;
if ($count != 0) {
    print "Stuff in A exists that aren't in B\n";
}

请注意 comm 参数可能不同,具体取决于您想要什么。

cjm 14 年前

像往常一样, CPAN 对此有个答案。或者 Sort::External 或 File::Sort 看起来会有用的。我也从来没有机会尝试过,所以我不知道哪一个更适合你。

另一种可能是 AnyDBM_File 创建可能超出可用内存的基于磁盘的哈希。如果不尝试,我就不能说使用DBM文件比排序快还是慢,但是代码可能会更简单。

HerbN 14 年前

测试if A.txt 是 B.txt

open FILE.B, "B.txt";
open FILE.A, "A.txt";

my %bFile;

while(<FILE.B>) {
   ($process, $id1, $id2) = split /,/;
   $bFile{$process}{$id1}{$id2}++;
}

$missingRows = 0;

while(<FILE.A>) {
   $missingRows++ unless $bFile{$process}{$id1}{$id2};
   # If we've seen a given entry already don't add it
   next if $missingRows; # One miss means they aren't all verified
}

$is_Atxt_Subset_Btxt = $missingRows?FALSE:TRUE;

这将给您一个测试,测试a中的所有行是否在B中,只读取B中的所有行,然后在读取a时测试数组的每个成员。