代码之家 › 专栏 › 技术社区 › Ralph Shillington

如何使用PowerShell拆分文本文件?

powershell

Ralph Shillington · 技术社区 · 15 年前

我需要将一个大的(500 MB)文本文件(一个log4net异常文件)拆分为可管理的块,比如100个5 MB文件就可以了。

我认为这应该是一个在公园里为PowerShell散步。我该怎么做?

14 回复 | 直到 15 年前

thomasb DaveRead 6 年前

对于PowerShell来说,这是一项比较简单的任务,因为标准的get-content cmdlet处理非常大的文件不太好。我建议您使用.NET StreamReader class 要在PowerShell脚本中逐行读取文件,请使用 Add-Content 用于将每一行写入文件名中索引不断增加的文件。像这样:

$upperBound = 50MB # calculated by Powershell
$ext = "log"
$rootName = "log_"

$reader = new-object System.IO.StreamReader("C:\Exceptions.log")
$count = 1
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
while(($line = $reader.ReadLine()) -ne $null)
{
    Add-Content -path $fileName -value $line
    if((Get-ChildItem -path $fileName).Length -ge $upperBound)
    {
        ++$count
        $fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
    }
}

$reader.Close()

Hugo Buff 8 年前

关于一些现有答案的一句警告-对于非常大的文件,它们运行得非常慢。对于一个1.6_GB的日志文件,我在几个小时后放弃了,意识到在第二天回到工作岗位之前,它不会完成。

两个问题:呼叫 Add-Content 打开、查找并关闭源文件中每行的当前目标文件。每次读取一点源文件并查找新行也会减慢速度,但我猜添加内容是主要的罪魁祸首。

以下变体产生的输出稍微不那么令人愉快:它将在行中间拆分文件,但它在不到一分钟的时间内拆分我的1.6 GB日志:

$from = "C:\temp\large_log.txt"
$rootName = "C:\temp\large_log_chunk"
$ext = "txt"
$upperBound = 100MB


$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = $idx = 0
try {
    do {
        "Reading $upperBound"
        $count = $fromFile.Read($buff, 0, $buff.Length)
        if ($count -gt 0) {
            $to = "{0}.{1}.{2}" -f ($rootName, $idx, $ext)
            $toFile = [io.file]::OpenWrite($to)
            try {
                "Writing $count to $to"
                $tofile.Write($buff, 0, $count)
            } finally {
                $tofile.Close()
            }
        }
        $idx ++
    } while ($count -gt 0)
}
finally {
    $fromFile.Close()
}

Ivan 10 年前

基于行数拆分的简单一行(在本例中为100行):

$i=0; Get-Content .....log -ReadCount 100 | %{$i++; $_ | Out-File out_$i.txt}

Vincent De Smet 9 年前

与这里的所有答案相同,但是使用streamreader/streamwriter在新的行上进行拆分(一行一行,而不是试图将整个文件一次读取到内存中)。这种方法可以以我所知道的最快的方式分割大文件。

注意:我很少做错误检查,所以我不能保证它能顺利地适用于您的案件。是为我做的( 1.7 GB TXT文件 400万行,每个文件分为100000行 95秒 )

#split test
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
$filename = "C:\Users\Vincent\Desktop\test.txt"
$rootName = "C:\Users\Vincent\Desktop\result"
$ext = ".txt"

$linesperFile = 100000#100k
$filecount = 1
$reader = $null
try{
    $reader = [io.file]::OpenText($filename)
    try{
        "Creating file number $filecount"
        $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
        $filecount++
        $linecount = 0

        while($reader.EndOfStream -ne $true) {
            "Reading $linesperFile"
            while( ($linecount -lt $linesperFile) -and ($reader.EndOfStream -ne $true)){
                $writer.WriteLine($reader.ReadLine());
                $linecount++
            }

            if($reader.EndOfStream -ne $true) {
                "Closing file"
                $writer.Dispose();

                "Creating file number $filecount"
                $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
                $filecount++
                $linecount = 0
            }
        }
    } finally {
        $writer.Dispose();
    }
} finally {
    $reader.Dispose();
}
$sw.Stop()

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"

输出拆分1.7 GB文件:

...
Creating file number 45
Reading 100000
Closing file
Creating file number 46
Reading 100000
Closing file
Creating file number 47
Reading 100000
Closing file
Creating file number 48
Reading 100000
Split complete in  95.6308289 seconds

Josh 15 年前

我经常需要做同样的事情。诀窍是将标题重复到每个拆分块中。我编写了下面的Cmdlet(PowerShellv2 CTP 3),它完成了这个技巧。

##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Count
# (Or -c) The maximum number of lines in each file.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv 3000 -rc 1
#
#.LINK 
# Out-TempFile
##############################################################################
function Split-File {

    [CmdletBinding(DefaultParameterSetName='Path')]
    param(

        [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$Path,

        [Alias("PSPath")]
        [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$LiteralPath,

        [Alias('c')]
        [Parameter(Position=2,Mandatory=$true)]
        [Int32]$Count,

        [Alias('d')]
        [Parameter(Position=3)]
        [String]$Destination='.',

        [Alias('rc')]
        [Parameter()]
        [Int32]$RepeatCount

    )

    process {

        # yeah! the cmdlet supports wildcards
        if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} }
        elseif ($Path) { $ResolveArgs = @{Path=$Path} }

        Resolve-Path @ResolveArgs | %{

            $InputName = [IO.Path]::GetFileNameWithoutExtension($_)
            $InputExt  = [IO.Path]::GetExtension($_)

            if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }

            # get the input file in manageable chunks

            $Part = 1
            Get-Content $_ -ReadCount:$Count | %{

                # make an output filename with a suffix
                $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))

                # In the first iteration the header will be
                # copied to the output file as usual
                # on subsequent iterations we have to do it
                if ($RepeatCount -and $Part -gt 1) {
                    Set-Content $OutputFile $Header
                }

                # write this chunk to the output file
                Write-Host "Writing $OutputFile"
                Add-Content $OutputFile $_

                $Part += 1

            }

        }

    }

}

user202448 14 年前

我在试图将多个联系人拆分为单个VCARD VCF文件以分离文件时发现了这个问题。这是我根据李的密码做的。我必须查找如何创建一个新的streamreader对象,并将null更改为$null。

$reader = new-object System.IO.StreamReader("C:\Contacts.vcf")
$count = 1
$filename = "C:\Contacts\{0}.vcf" -f ($count) 

while(($line = $reader.ReadLine()) -ne $null)
{
    Add-Content -path $fileName -value $line

    if($line -eq "END:VCARD")
    {
        ++$count
        $filename = "C:\Contacts\{0}.vcf" -f ($count)
    }
}

$reader.Close()

Community 7 年前

这些答案对我的源文件来说太慢了。我的源文件是10 MB到800 MB之间的SQL文件,需要将它们拆分为大致相等的行数文件。

我发现以前的一些答案使用添加内容的速度非常慢。等待数小时以完成分割并不罕见。

我没有尝试 Typhlosaurus's answer 但它只按文件大小进行拆分,而不按行数进行拆分。

以下内容适合我的目的。

$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
Write-Host "Reading source file..."
$lines = [System.IO.File]::ReadAllLines("C:\Temp\SplitTest\source.sql")
$totalLines = $lines.Length

Write-Host "Total Lines :" $totalLines

$skip = 0
$count = 100000; # Number of lines per file

# File counter, with sort friendly name
$fileNumber = 1
$fileNumberString = $filenumber.ToString("000")

while ($skip -le $totalLines) {
    $upper = $skip + $count - 1
    if ($upper -gt ($lines.Length - 1)) {
        $upper = $lines.Length - 1
    }

    # Write the lines
    [System.IO.File]::WriteAllLines("C:\Temp\SplitTest\result$fileNumberString.txt",$lines[($skip..$upper)])

    # Increment counters
    $skip += $count
    $fileNumber++
    $fileNumberString = $filenumber.ToString("000")
}

$sw.Stop()

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"

对于一个54 MB的文件,我得到输出…

Reading source file...
Total Lines : 910030
Split complete in  1.7056578 seconds

我希望其他人寻找一个简单的,基于行的分割脚本,符合我的要求,将发现这是有用的。

Peter Mortensen icecrime 9 年前

还有一个快速(有些脏)的内衬:

$linecount=0; $i=0; Get-Content .\BIG_LOG_FILE.txt | %{ Add-Content OUT$i.log "$_"; $linecount++; if ($linecount -eq 3000) {$I++; $linecount=0 } }

通过更改硬编码的3000值,可以调整每批的第一行数。

Santiago Fernandez 15 年前

我已经根据每个部分的大小对拆分文件做了一些修改。

##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Size
# (Or -s) The maximum size of each file. Size must be expressed in MB.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv -s 20 -rc 1
#
#.LINK 
# Out-TempFile
##############################################################################
function Split-File {

    [CmdletBinding(DefaultParameterSetName='Path')]
    param(

        [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$Path,

        [Alias("PSPath")]
        [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$LiteralPath,

        [Alias('s')]
        [Parameter(Position=2,Mandatory=$true)]
        [Int32]$Size,

        [Alias('d')]
        [Parameter(Position=3)]
        [String]$Destination='.',

        [Alias('rc')]
        [Parameter()]
        [Int32]$RepeatCount

    )

    process {

  # yeah! the cmdlet supports wildcards
        if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} }
        elseif ($Path) { $ResolveArgs = @{Path=$Path} }

        Resolve-Path @ResolveArgs | %{

            $InputName = [IO.Path]::GetFileNameWithoutExtension($_)
            $InputExt  = [IO.Path]::GetExtension($_)

            if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }

   Resolve-Path @ResolveArgs | %{

    $InputName = [IO.Path]::GetFileNameWithoutExtension($_)
    $InputExt  = [IO.Path]::GetExtension($_)

    if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }

    # get the input file in manageable chunks

    $Part = 1
    $buffer = ""
    Get-Content $_ -ReadCount:1 | %{

     # make an output filename with a suffix
     $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))

     # In the first iteration the header will be
     # copied to the output file as usual
     # on subsequent iterations we have to do it
     if ($RepeatCount -and $Part -gt 1) {
      Set-Content $OutputFile $Header
     }

     # test buffer size and dump data only if buffer is greater than size
     if ($buffer.length -gt ($Size * 1MB)) {
      # write this chunk to the output file
      Write-Host "Writing $OutputFile"
      Add-Content $OutputFile $buffer
      $Part += 1
      $buffer = ""
     } else {
      $buffer += $_ + "`r"
     }
    }
   }
        }
    }
}

Shantanu Gupta 8 年前

这样做:

文件1

还有一个快速(有些脏)的内衬:

    $linecount=0; $i=0; 
    Get-Content .\BIG_LOG_FILE.txt | %
    { 
      Add-Content OUT$i.log "$_"; 
      $linecount++; 
      if ($linecount -eq 3000) {$I++; $linecount=0 } 
    }

通过更改硬编码的3000值,可以调整每批的第一行数。

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -First 5000 | out-File C:\temp\file1.txt -Encoding ASCII

文件2

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 5000 | Select -First 5000 | out-File C:\temp\file2.txt -Encoding ASCII

文件3

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 10000 | Select -First 5000 | out-File C:\temp\file3.txt -Encoding ASCII

等等

NicolasG 8 年前

听起来像是Unix命令拆分的作业:

split MyBigFile.csv

只需在不到10分钟的时间内将55 GB的csv文件分割成21K块。

不过,它不是PowerShell的本地产品,但附带了例如用于Windows的Git包。 https://git-scm.com/download/win

Covenant 9 年前

我的要求有点不同。我经常使用逗号分隔和制表符分隔的ASCII文件,其中一行是单个数据记录。它们真的很大,所以我需要把它们分成可管理的部分(同时保留标题行)。

因此,我返回到我的经典vbscript方法,并将一个可以在任何Windows计算机上运行的小.vbs脚本猛击在一起(它将由Windows上的wscript.exe脚本主机引擎自动执行)。

这个方法的好处是它使用了文本流,所以底层数据不会被加载到内存中(或者至少不会一次全部加载)。结果是它的速度非常快,运行起来不需要太多内存。我刚刚在i7上使用这个脚本拆分的测试文件的文件大小约为1 GB,文本行数约为1200万,被拆分为25个部分文件(每个部分约有500K行),处理时间约为2分钟,并且在任何时候都没有超过3MB的内存。

这里要注意的是,它依赖于文本文件具有“行”(意味着每条记录用CRLF分隔),因为文本流对象使用“readline”函数一次处理一行。但是,如果你使用的是tsv或csv文件,那就太完美了。

Option Explicit

Private Const INPUT_TEXT_FILE = "c:\bigtextfile.txt"  
Private Const REPEAT_HEADER_ROW = True                
Private Const LINES_PER_PART = 500000                 

Dim oFileSystem, oInputFile, oOutputFile, iOutputFile, iLineCounter, sHeaderLine, sLine, sFileExt, sStart

sStart = Now()

sFileExt = Right(INPUT_TEXT_FILE,Len(INPUT_TEXT_FILE)-InstrRev(INPUT_TEXT_FILE,".")+1)
iLineCounter = 0
iOutputFile = 1

Set oFileSystem = CreateObject("Scripting.FileSystemObject")
Set oInputFile = oFileSystem.OpenTextFile(INPUT_TEXT_FILE, 1, False)
Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)

If REPEAT_HEADER_ROW Then
    iLineCounter = 1
    sHeaderLine = oInputFile.ReadLine()
    Call oOutputFile.WriteLine(sHeaderLine)
End If

Do While Not oInputFile.AtEndOfStream
    sLine = oInputFile.ReadLine()
    Call oOutputFile.WriteLine(sLine)
    iLineCounter = iLineCounter + 1
    If iLineCounter Mod LINES_PER_PART = 0 Then
        iOutputFile = iOutputFile + 1
        Call oOutputFile.Close()
        Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
        If REPEAT_HEADER_ROW Then
            Call oOutputFile.WriteLine(sHeaderLine)
        End If
    End If
Loop

Call oInputFile.Close()
Call oOutputFile.Close()
Set oFileSystem = Nothing

Call MsgBox("Done" & vbCrLf & "Lines Processed:" & iLineCounter & vbCrLf & "Part Files: " & iOutputFile & vbCrLf & "Start Time: " & sStart & vbCrLf & "Finish Time: " & Now())

GMasucci 8 年前

由于行在日志中是可变的,所以我认为最好采用每个文件多行的方法。以下代码片段在19秒内处理了400万行日志文件(18.83..秒)将其拆分为500000行块:

$sourceFile = "c:\myfolder\mylargeTextyFile.csv"
$partNumber = 1
$batchSize = 500000
$pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv"

[System.Text.Encoding]$enc = [System.Text.Encoding]::GetEncoding(65001)  # utf8 this one

$fs=New-Object System.IO.FileStream ($sourceFile,"OpenOrCreate", "Read", "ReadWrite",8,"None") 
$streamIn=New-Object System.IO.StreamReader($fs, $enc)
$streamout = new-object System.IO.StreamWriter $pathAndFilename

$line = $streamIn.readline()
$counter = 0
while ($line -ne $null)
{
    $streamout.writeline($line)
    $counter +=1
    if ($counter -eq $batchsize)
    {
        $partNumber+=1
        $counter =0
        $streamOut.close()
        $pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv"
        $streamout = new-object System.IO.StreamWriter $pathAndFilename

    }
    $line = $streamIn.readline()
}
$streamin.close()
$streamout.close()

可以很容易地将其转换为带有参数的函数或脚本文件,从而使其更加通用。它使用了 StreamReader 和 StreamWriter 以实现其速度和极小的内存占用

Bigdadda06 7 年前

下面是我的解决方案,将一个名为patch6.txt(约32000行)的文件分割成1000行的单独文件。它不快,但它能完成任务。

$infile = "D:\Malcolm\Test\patch6.txt"
$path = "D:\Malcolm\Test\"
$lineCount = 1
$fileCount = 1

foreach ($computername in get-content $infile)
{
    write $computername | out-file -Append $path_$fileCount".txt"
    $lineCount++

    if ($lineCount -eq 1000)
    {
        $fileCount++
        $lineCount = 1
    }
}