代码之家  ›  专栏  ›  技术社区  ›  plaidshirt

用于读取具有不同字符编码的文件的java.util.scanner

  •  4
  • plaidshirt  · 技术社区  · 6 年前

    我用 Java 读取文件列表。其中一些有不同的编码, ANSI 而不是 UTF-8 . java.util.Scanner 无法读取这些文件并获取空输出字符串。 我尝试了另一种方法:

                    FileInputStream fis = new FileInputStream(my_file);
                    BufferedReader br = new BufferedReader(new InputStreamReader(fis));
                    InputStreamReader isr = new InputStreamReader(fis);
                    isr.getEncoding();
    

    我不知道在发生以下情况时如何更改字符编码 美国国家标准协会 那些。utf-8和ansi文件混合在同一个文件夹中。我试着用apache tika来做这个。 在我得到文件编码后,我使用 Scanner ,但我得到的是空输出。

    Scanner scanner = new Scanner(my_file, detector.getCharset().toString());
    line = scanner.nextLine();
    
    3 回复  |  直到 6 年前
        1
  •  1
  •   Friwi    6 年前

    有一个名为juniversalchardet的库,它可以帮助您猜测正确的编码。它最近更新,目前位于GitHub上:

    https://github.com/albfernandez/juniversalchardet

    但是,没有故障保护工具来检测编码,因为有许多未知的东西:

    1. 这个文件是文本还是PNG?
    2. 它是否存储在(1,…,k,…,n)位编码中?
    3. 使用哪种K位编码?

    一些猜测可以通过计算不常用的控制字符的数量来完成。当一个文件包含许多控制符号时,很可能您选择了错误的编码。(然后尝试下一个。)

    juniversalchaddet尝试了多种更成功的方法来确定编码(甚至是中文编码)。它还提供了从已选择正确编码的文件中打开读卡器的方便方法:

    (片段取自 https://github.com/albfernandez/juniversalchardet#creating-a-reader-with-correct-encoding 并适应)

    import org.mozilla.universalchardet.ReaderFactory;
    import java.io.File;
    import java.io.IOException;
    import java.io.Reader;
    
    public class TestCreateReaderFromFile {
    
        public static void main (String[] args) throws IOException {
            if (args.length != 1) {
                System.err.println("Usage: java TestCreateReaderFromFile FILENAME");
                System.exit(1);
            }
    
            Reader reader = null;
            try {
                File file = new File(args[0]);
                reader = ReaderFactory.createBufferedReader(file);
    
                String line;
                while((line=reader.readLine())!=null){
                    System.out.println(line); //Print each line to console
                }
            }
            finally {
                if (reader != null) {
                    reader.close();
                }
            }
    
        }
    
    }
    

    编辑:添加扫描工厂

    /*
    (C) Copyright 2016-2017 Alberto Fernández <infjaf@gmail.com>
    Adapted by Fritz Windisch 2018-11-15
    The contents of this file are subject to the Mozilla Public License Version
    1.1 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
    http://www.mozilla.org/MPL/
    Software distributed under the License is distributed on an "AS IS" basis,
    WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
    for the specific language governing rights and limitations under the
    License.
    Alternatively, the contents of this file may be used under the terms of
    either the GNU General Public License Version 2 or later (the "GPL"), or
    the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
    in which case the provisions of the GPL or the LGPL are applicable instead
    of those above. If you wish to allow use of your version of this file only
    under the terms of either the GPL or the LGPL, and not to allow others to
    use your version of this file under the terms of the MPL, indicate your
    decision by deleting the provisions above and replace them with the notice
    and other provisions required by the GPL or the LGPL. If you do not delete
    the provisions above, a recipient may use your version of this file under
    the terms of any one of the MPL, the GPL or the LGPL.
    */
    
    import java.io.BufferedInputStream;
    import java.io.File;
    import java.io.IOException;
    import java.nio.charset.Charset;
    import java.nio.file.Files;
    import java.nio.file.Path;
    import java.util.Objects;
    import java.util.Scanner;
    import org.mozilla.universalchardet.UniversalDetector;
    import org.mozilla.universalchardet.UnicodeBOMInputStream;
    
    /**
     * Create a scanner from a file with correct encoding
     */
    public final class ScannerFactory {
    
        private ScannerFactory() {
            throw new AssertionError("No instances allowed");
        }
        /**
         * Create a scanner from a file with correct encoding
         * @param file The file to read from
         * @param defaultCharset defaultCharset to use if can't be determined
         * @return Scanner for the file with the correct encoding
         * @throws java.io.IOException if some I/O error ocurrs
         */
    
        public static Scanner createScanner(File file, Charset defaultCharset) throws IOException {
            Charset cs = Objects.requireNonNull(defaultCharset, "defaultCharset must be not null");
            String detectedEncoding = UniversalDetector.detectCharset(file);
            if (detectedEncoding != null) {
                cs = Charset.forName(detectedEncoding);
            }
            if (!cs.toString().contains("UTF")) {
                return new Scanner(file, cs.name());
            }
            Path path = file.toPath();
            return new Scanner(new UnicodeBOMInputStream(new BufferedInputStream(Files.newInputStream(path))), cs.name());
        }
        /**
         * Create a scanner from a file with correct encoding. If charset cannot be determined,
         * it uses the system default charset.
         * @param file The file to read from
         * @return Scanner for the file with the correct encoding
         * @throws java.io.IOException if some I/O error ocurrs
         */
        public static Scanner createScanner(File file) throws IOException {
            return createScanner(file, Charset.defaultCharset());
        }
    }
    
        2
  •  0
  •   Vicky Singh    6 年前

    你的方法不会给你正确的编码。

     FileInputStream fis = new FileInputStream(my_file);
     BufferedReader br = new BufferedReader(new InputStreamReader(fis));
     InputStreamReader isr = new InputStreamReader(fis);
     isr.getEncoding();
    

    这将返回此inputstream使用的编码(读取 javadoc )而不是写在文件中的字符(在你的例子中是我的文件)。如果编码错误,扫描器将无法正确读取文件。

    事实上,如果我错了,一定要纠正我,没有办法让一个特定文件的编码100%准确。很少有项目在猜测编码方面有更好的成功率,但不是100%的准确性。另一方面,如果您知道所使用的编码,则可以使用

    Scanner scanner = new Scanner(my_file, "charset");
    scanner.nextLine();
    

    此外,找出用于Java的ANSI中正确的字符集名称。它不是US-ASCII就是CP1251。

    不管你走哪条路,都要小心 IOException 可能会给你指明正确的方向。

        3
  •  0
  •   Oleg Cherednik    6 年前

    使 Scanner 要使用不同的编码,必须向扫描仪的构造函数提供正确的编码。

    要定义文件编码,最好使用外部lib(例如 https://github.com/albfernandez/juniversalchardet )但如果您确定知道可能的编码,可以根据 Wikipedia

    public static void main(String... args) throws IOException {
        List<String> lines = readLinesFromFile(new File("d:/utf8.txt"));
    }
    
    public static List<String> readLinesFromFile(File file) throws IOException {
        try (Scanner scan = new Scanner(file, getCharsetName(file))) {
            List<String> lines = new LinkedList<>();
    
            while (scan.hasNext())
                lines.add(scan.nextLine());
    
            return lines;
        }
    }
    
    private static String getCharsetName(File file) throws IOException {
        try (InputStream in = new FileInputStream(file)) {
            if (in.read() == 0xEF && in.read() == 0xBB && in.read() == 0xBF)
                return StandardCharsets.UTF_8.name();
            return StandardCharsets.US_ASCII.name();
        }
    }