代码之家  ›  专栏  ›  技术社区  ›  Fábio Antunes

找到任何文件编码的有效方法

  •  83
  • Fábio Antunes  · 技术社区  · 14 年前

    是的,这是一个最常见的问题,这件事对我来说是模糊的,因为我不太了解它。

    但是我想要一个非常精确的方法来找到一个文件编码。 就像Notepad++一样精确。

    7 回复  |  直到 7 年前
        1
  •  160
  •   2Toad    4 年前

    这个 StreamReader.CurrentEncoding

    *更新4/08/2020,包括UTF-32LE检测和UTF-32BE返回正确编码

    /// <summary>
    /// Determines a text file's encoding by analyzing its byte order mark (BOM).
    /// Defaults to ASCII when detection of the text file's endianness fails.
    /// </summary>
    /// <param name="filename">The text file to analyze.</param>
    /// <returns>The detected encoding.</returns>
    public static Encoding GetEncoding(string filename)
    {
        // Read the BOM
        var bom = new byte[4];
        using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
        {
            file.Read(bom, 0, 4);
        }
    
        // Analyze the BOM
        if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
        if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
        if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
        if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
        if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
        if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true);  //UTF-32BE
    
        // We actually have no idea what the encoding is if we reach this point, so
        // you may wish to return null instead of defaulting to ASCII
        return Encoding.ASCII;
    }
    
        2
  •  49
  •   Simon Mourier    7 年前

    StreamReader 班级:

      using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))
      {
          reader.Peek(); // you need this!
          var encoding = reader.CurrentEncoding;
      }
    

    Peek 否则,.NET没有做任何事情(它也没有阅读序言和BOM)。当然,如果你用其他的 ReadXXX 在检查编码之前调用,它也可以工作。

    如果文件没有BOM表,则 defaultEncodingIfNoBom

    我已经用BOM文件成功地测试了UTF8、UTF16/Unicode(LE&BE)和UTF32(LE&BE)。它不适用于UTF7。

        3
  •  12
  •   CodesInChaos    14 年前

    1) 检查是否有字节顺序标记

    3) 使用本地“ANSI”代码页(ANSI由Microsoft定义)

    步骤2起作用是因为除了UTF8之外的代码页中的大多数非ASCII序列都不是有效的UTF8。

        4
  •  11
  •   Alexei Agüero Alba    7 年前

    看看这个。

    UDE

    这是Mozilla通用字符集检测器的一个端口,你可以这样使用它。。。

    public static void Main(String[] args)
    {
        string filename = args[0];
        using (FileStream fs = File.OpenRead(filename)) {
            Ude.CharsetDetector cdet = new Ude.CharsetDetector();
            cdet.Feed(fs);
            cdet.DataEnd();
            if (cdet.Charset != null) {
                Console.WriteLine("Charset: {0}, confidence: {1}", 
                     cdet.Charset, cdet.Confidence);
            } else {
                Console.WriteLine("Detection failed.");
            }
        }
    }
    
        5
  •  9
  •   Berthier Lemieux    6 年前

    1) 检查是否有字节顺序标记

    2) 检查文件是否为有效的UTF8

    3) 使用本地“ANSI”代码页(ANSI由Microsoft定义)

    https://stackoverflow.com/a/4522251/867248 更详细地解释这个策略。

    using System; using System.IO; using System.Text;
    
    // Using encoding from BOM or UTF8 if no BOM found,
    // check if the file is valid, by reading all lines
    // If decoding fails, use the local "ANSI" codepage
    
    public string DetectFileEncoding(Stream fileStream)
    {
        var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());
        using (var reader = new StreamReader(fileStream, Utf8EncodingVerifier,
               detectEncodingFromByteOrderMarks: true, leaveOpen: true, bufferSize: 1024))
        {
            string detectedEncoding;
            try
            {
                while (!reader.EndOfStream)
                {
                    var line = reader.ReadLine();
                }
                detectedEncoding = reader.CurrentEncoding.BodyName;
            }
            catch (Exception e)
            {
                // Failed to decode the file using the BOM/UT8. 
                // Assume it's local ANSI
                detectedEncoding = "ISO-8859-1";
            }
            // Rewind the stream
            fileStream.Seek(0, SeekOrigin.Begin);
            return detectedEncoding;
       }
    }
    
    
    [Test]
    public void Test1()
    {
        Stream fs = File.OpenRead(@".\TestData\TextFile_ansi.csv");
        var detectedEncoding = DetectFileEncoding(fs);
    
        using (var reader = new StreamReader(fs, Encoding.GetEncoding(detectedEncoding)))
        {
           // Consume your file
            var line = reader.ReadLine();
            ...
    
        6
  •  4
  •   Pacurar Stefan    5 年前

    .NET不是很有用,但您可以尝试以下算法:

    1. 尝试按BOM(字节顺序标记)查找编码。。。很可能找不到
    2. 尝试解析为不同的编码

    这是电话:

    var encoding = FileHelper.GetEncoding(filePath);
    if (encoding == null)
        throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");
    

    代码如下:

    public class FileHelper
    {
        /// <summary>
        /// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings       
        /// Defaults to UTF8 when detection of the text file's endianness fails.
        /// </summary>
        /// <param name="filename">The text file to analyze.</param>
        /// <returns>The detected encoding or null.</returns>
        public static Encoding GetEncoding(string filename)
        {
            var encodingByBOM = GetEncodingByBOM(filename);
            if (encodingByBOM != null)
                return encodingByBOM;
    
            // BOM not found :(, so try to parse characters into several encodings
            var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8);
            if (encodingByParsingUTF8 != null)
                return encodingByParsingUTF8;
    
            var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1"));
            if (encodingByParsingLatin1 != null)
                return encodingByParsingLatin1;
    
            var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7);
            if (encodingByParsingUTF7 != null)
                return encodingByParsingUTF7;
    
            return null;   // no encoding found
        }
    
        /// <summary>
        /// Determines a text file's encoding by analyzing its byte order mark (BOM)  
        /// </summary>
        /// <param name="filename">The text file to analyze.</param>
        /// <returns>The detected encoding.</returns>
        private static Encoding GetEncodingByBOM(string filename)
        {
            // Read the BOM
            var byteOrderMark = new byte[4];
            using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
            {
                file.Read(byteOrderMark, 0, 4);
            }
    
            // Analyze the BOM
            if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7;
            if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8;
            if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
            if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
            if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32;
    
            return null;    // no BOM found
        }
    
        private static Encoding GetEncodingByParsing(string filename, Encoding encoding)
        {            
            var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback());
    
            try
            {
                using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true))
                {
                    while (!textReader.EndOfStream)
                    {                        
                        textReader.ReadLine();   // in order to increment the stream position
                    }
    
                    // all text parsed ok
                    return textReader.CurrentEncoding;
                }
            }
            catch (Exception ex) { }
    
            return null;    // 
        }
    }
    
        7
  •  2
  •   SedJ601    8 年前

    在这里找c#

    https://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx

    string path = @"path\to\your\file.ext";
    
    using (StreamReader sr = new StreamReader(path, true))
    {
        while (sr.Peek() >= 0)
        {
            Console.Write((char)sr.Read());
        }
    
        //Test for the encoding after reading, or at least
        //after the first read.
        Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding);
        Console.ReadLine();
        Console.WriteLine();
    }
    
        8
  •  1
  •   Enzojz    7 年前

    以下代码是我的Powershell代码,用于确定某些cpp或h或ml文件是否使用ISO-8859-1(拉丁语-1)或UTF-8编码而不使用BOM,如果两者都不使用,则假定它是GB18030。我是一个在法国工作的中国人,MSVC在法国计算机上保存为拉丁-1,在中国计算机上保存为GB,所以这有助于我在系统和同事之间交换源文件时避免编码问题。

    方法很简单,如果所有字符都在x00-x7E之间,ASCII、UTF-8和Latin-1都是一样的,但是如果我用UTF-8读取非ASCII文件,我们会发现特殊字符出现,所以尝试用Latin-1读取。在拉丁语-1中,between\x7F和\xAF是空的,而GB使用full between x00 xFF,所以如果我在这两者之间找到任何一个,它就不是拉丁语-1

    代码是在PowerShell中编写的,但是使用.net,因此很容易翻译成C或F#

    $Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
    foreach($i in Get-ChildItem .\ -Recurse -include *.cpp,*.h, *.ml) {
        $openUTF = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::UTF8)
        $contentUTF = $openUTF.ReadToEnd()
        [regex]$regex = '�'
        $c=$regex.Matches($contentUTF).count
        $openUTF.Close()
        if ($c -ne 0) {
            $openLatin1 = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('ISO-8859-1'))
            $contentLatin1 = $openLatin1.ReadToEnd()
            $openLatin1.Close()
            [regex]$regex = '[\x7F-\xAF]'
            $c=$regex.Matches($contentLatin1).count
            if ($c -eq 0) {
                [System.IO.File]::WriteAllLines($i, $contentLatin1, $Utf8NoBomEncoding)
                $i.FullName
            } 
            else {
                $openGB = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('GB18030'))
                $contentGB = $openGB.ReadToEnd()
                $openGB.Close()
                [System.IO.File]::WriteAllLines($i, $contentGB, $Utf8NoBomEncoding)
                $i.FullName
            }
        }
    }
    Write-Host -NoNewLine 'Press any key to continue...';
    $null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown');
    
        9
  •  0
  •   deHaar    5 年前

    可能有用

    string path = @"address/to/the/file.extension";
    
    using (StreamReader sr = new StreamReader(path))
    { 
        Console.WriteLine(sr.CurrentEncoding);                        
    }