代码之家  ›  专栏  ›  技术社区  ›  abr

读取大xls和xlsx文件

  •  7
  • abr  · 技术社区  · 6 年前

    我知道周围的帖子,我已经尝试了几次来达到我的目标,我将在下面详细说明:

    我有一个 .zip / .rar ,包含多个 xls &安培 xlsx 文件夹。

    每个excel文件包含多达数千行的duzen,大约有90列左右(每个excel文件可以有更多或更少的列)。

    我已经创建了一个Java WindowBuilder应用程序,在其中我选择了 邮编 / .rar公司 文件并选择将这些文件解压缩到的位置,并使用 FileOutputStream . 保存每个文件后,我将读取文件的内容。

    到现在为止,一直都还不错。 在几次尝试避免OOM(OutofMemory)并加快速度之后,我已经达到了“最终版本”(非常糟糕,但直到我找到正确的阅读方法),我将解释:

    File file = new File('certainFile.xlsx'); //or xls, For example purposes
    Workbook wb;
    Sheet sheet;
    /*
    There is a ton of other things up to this point that I don't consider relevant, as it's related to unzipping and renaming, etc. 
    This is within a cycle
    
    /
    In every zip file, there is at least 1 or 2 files that somehow, when it goes to
    WorkbookFactory.create(), it still gives an OOM because it recognizes is has 
    a bit over a million rows, meaning it's an 2007 format file (according to our friend Google.com), or so I believe so.
    When I open the xlsx file, it indeed has like 10-20mb size and thousands of empty rows. When I save it again
    it has 1mb and a couple thousand. After many attempts to read as InputStream, File or trying to save it in 
    an automatic way, I've worked with converting it to a CSV and read it differently, 
    ence, this 'solution'. if parseAsXLS is true, it applies my regular logic 
    per row per cell, otherwise I parse the CSV.
    */
    if (file.getName().contains("xlsx")) {
        this.parseAsXLS = false;
        OPCPackage pkg = OPCPackage.open(file);
        //This is just to output the content into a csv file, that I will read later on and it gets overwritten everytime it comes by
        FileOutputStream fo = new FileOutputStream(this.filePath + File.separator + "excel.csv");
        PrintStream ps = new PrintStream(fo);
        XLSX2CSV xlsxCsvConverter = new XLSX2CSV(pkg, ps, 90);
        try {
            xlsxCsvConverter.process();
        } catch (Exception e) {
            //I've added a count at the XLSX2CSV class in order to limit the ammount of rows I want to fetch and throw an Exception on purpose
            System.out.println("Limited the file at 60k rows");
        }
    } else {
        this.parseAsXLS = true;
        this.wb = WorkbookFactory.create(file);
        this.sheet = wb.getSheetAt(0);
    }
    

    现在发生的是 .xlsx (来自 .zip文件 与其他几个文件一起归档 .xls .xlsx文件 )在一行中有一个特定的字符,xlsx2csv将其视为endrow,这将导致不正确的输出。

    这是一个例子: imagelink

    注意:目标是只从每个excel文件中获取它们在commum中的特定列集(或者可能有,但不是强制的),并将它们放在一个新的excel中。email列(包含多封用逗号分隔的电子邮件)在电子邮件之前有一个我认为是“enter”的内容,因为如果我手动删除它,它会解决问题。但是,目标不是手动打开每个excel并修复它,否则我只打开每个excel并复制粘贴所需的列。在这个例子中,我需要列: 菲尔达 我是说, 菲尔达格 我是说, 菲尔达尔 菲尔丹 .

    xlsx2csv.java语言 (我不是这个文件的创建者,我只是应用了我的需要)

    import java.awt.List;
    import java.io.File;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.PrintStream;
    
    import javax.xml.parsers.ParserConfigurationException;
    
    import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
    import org.apache.poi.openxml4j.opc.OPCPackage;
    import org.apache.poi.openxml4j.opc.PackageAccess;
    import org.apache.poi.ss.usermodel.DataFormatter;
    import org.apache.poi.ss.util.CellAddress;
    import org.apache.poi.ss.util.CellReference;
    import org.apache.poi.util.SAXHelper;
    import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
    import org.apache.poi.xssf.eventusermodel.XSSFReader;
    import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
    import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
    import org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor;
    import org.apache.poi.xssf.model.StylesTable;
    import org.apache.poi.xssf.usermodel.XSSFComment;
    import org.xml.sax.ContentHandler;
    import org.xml.sax.InputSource;
    import org.xml.sax.SAXException;
    import org.xml.sax.XMLReader;
    
    /**
     * A rudimentary XLSX -> CSV processor modeled on the
     * POI sample program XLS2CSVmra from the package
     * org.apache.poi.hssf.eventusermodel.examples.
     * As with the HSSF version, this tries to spot missing
     *  rows and cells, and output empty entries for them.
     * <p>
     * Data sheets are read using a SAX parser to keep the
     * memory footprint relatively small, so this should be
     * able to read enormous workbooks.  The styles table and
     * the shared-string table must be kept in memory.  The
     * standard POI styles table class is used, but a custom
     * (read-only) class is used for the shared string table
     * because the standard POI SharedStringsTable grows very
     * quickly with the number of unique strings.
     * <p>
     * For a more advanced implementation of SAX event parsing
     * of XLSX files, see {@link XSSFEventBasedExcelExtractor}
     * and {@link XSSFSheetXMLHandler}. Note that for many cases,
     * it may be possible to simply use those with a custom 
     * {@link SheetContentsHandler} and no SAX code needed of
     * your own!
     */
    public class XLSX2CSV {
        /**
         * Uses the XSSF Event SAX helpers to do most of the work
         *  of parsing the Sheet XML, and outputs the contents
         *  as a (basic) CSV.
         */
        private class SheetToCSV implements SheetContentsHandler {
            private boolean firstCellOfRow;
            private int currentRow = -1;
            private int currentCol = -1;
            private int maxrows = 60000;
    
    
    
            private void outputMissingRows(int number) {
    
                for (int i=0; i<number; i++) {
                    for (int j=0; j<minColumns; j++) {
                        output.append(',');
                    }
                    output.append('\n');
                }
            }
    
            @Override
            public void startRow(int rowNum) {
                // If there were gaps, output the missing rows
                outputMissingRows(rowNum-currentRow-1);
                // Prepare for this row
                firstCellOfRow = true;
                currentRow = rowNum;
                currentCol = -1;
    
                if (rowNum == maxrows) {
                        throw new RuntimeException("Force stop at maxrows");
                }
            }
    
            @Override
            public void endRow(int rowNum) {
                // Ensure the minimum number of columns
                for (int i=currentCol; i<minColumns; i++) {
                    output.append(',');
                }
                output.append('\n');
            }
    
            @Override
            public void cell(String cellReference, String formattedValue,
                    XSSFComment comment) {
                if (firstCellOfRow) {
                    firstCellOfRow = false;
                } else {
                    output.append(',');
                }            
    
                // gracefully handle missing CellRef here in a similar way as XSSFCell does
                if(cellReference == null) {
                    cellReference = new CellAddress(currentRow, currentCol).formatAsString();
                }
    
                // Did we miss any cells?
                int thisCol = (new CellReference(cellReference)).getCol();
                int missedCols = thisCol - currentCol - 1;
                for (int i=0; i<missedCols; i++) {
                    output.append(',');
                }
                currentCol = thisCol;
    
                // Number or string?
                try {
                    //noinspection ResultOfMethodCallIgnored
                    Double.parseDouble(formattedValue);
                    output.append(formattedValue);
                } catch (NumberFormatException e) {
                    output.append('"');
                    output.append(formattedValue);
                    output.append('"');
                }
            }
    
            @Override
            public void headerFooter(String arg0, boolean arg1, String arg2) {
                // TODO Auto-generated method stub
    
            }
        }
    
    
        ///////////////////////////////////////
    
        private final OPCPackage xlsxPackage;
    
        /**
         * Number of columns to read starting with leftmost
         */
        private final int minColumns;
    
        /**
         * Destination for data
         */
        private final PrintStream output;
    
        /**
         * Creates a new XLSX -> CSV converter
         *
         * @param pkg        The XLSX package to process
         * @param output     The PrintStream to output the CSV to
         * @param minColumns The minimum number of columns to output, or -1 for no minimum
         */
        public XLSX2CSV(OPCPackage pkg, PrintStream output, int minColumns) {
            this.xlsxPackage = pkg;
            this.output = output;
            this.minColumns = minColumns;
        }
    
        /**
         * Parses and shows the content of one sheet
         * using the specified styles and shared-strings tables.
         *
         * @param styles The table of styles that may be referenced by cells in the sheet
         * @param strings The table of strings that may be referenced by cells in the sheet
         * @param sheetInputStream The stream to read the sheet-data from.
    
         * @exception java.io.IOException An IO exception from the parser,
         *            possibly from a byte stream or character stream
         *            supplied by the application.
         * @throws SAXException if parsing the XML data fails.
         */
        public void processSheet(
                StylesTable styles,
                ReadOnlySharedStringsTable strings,
                SheetContentsHandler sheetHandler, 
                InputStream sheetInputStream) throws IOException, SAXException {
            DataFormatter formatter = new DataFormatter();
            InputSource sheetSource = new InputSource(sheetInputStream);
            try {
                XMLReader sheetParser = SAXHelper.newXMLReader();
                ContentHandler handler = new XSSFSheetXMLHandler(
                      styles, null, strings, sheetHandler, formatter, false);
                sheetParser.setContentHandler(handler);
                sheetParser.parse(sheetSource);
             } catch(ParserConfigurationException e) {
                throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
             }
        }
    
        /**
         * Initiates the processing of the XLS workbook file to CSV.
         *
         * @throws IOException If reading the data from the package fails.
         * @throws SAXException if parsing the XML data fails.
         */
        public void process() throws IOException, OpenXML4JException, SAXException {
            ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(this.xlsxPackage);
            XSSFReader xssfReader = new XSSFReader(this.xlsxPackage);
            StylesTable styles = xssfReader.getStylesTable();
            XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
            int index = 0;
            while (iter.hasNext()) {
                try (InputStream stream = iter.next()) {
                    processSheet(styles, strings, new SheetToCSV(), stream);
                }
                ++index;
            }
        }
    } 
    

    我在寻找实现我目标的不同方法。

    谢谢你抽出时间

    3 回复  |  直到 6 年前
        1
  •  1
  •   micael cunha    6 年前

    好吧,我已经试过复制你的excel文件,我完全把xlsx2csv扔出了窗口。我不认为将xlsx转换为csv的方法是正确的,因为根据xlsx的格式,它可以读取所有空行(您可能知道这一点,因为您将行计数器设置为60k)。不仅如此,如果我们考虑字段,它可能会导致或可能不会导致特殊字符的不正确输出,比如您的问题。

    我所做的是我用过这个图书馆 https://github.com/davidpelfree/sjxlsx 读取和重新写入文件。这是非常直接的,新的xlsx生成的文件已经纠正了字段。

    我建议您尝试这种方法(可能不使用这个lib),尝试重新编写文件以更正它。

        2
  •  2
  •   Yy--    6 年前

    这个怎么样:

    //获取zip流

    ZipFile zipFile = new ZipFile(billWater, Charset.forName("gbk"));
    
    
    ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(billWater),  Charset.forName("gbk"));
    //ZipEntry zipEntry;
    //use openCsv 
     public static <T> List<T> processCSVFileByZip(ZipFile zipFile, ZipEntry zipEntry, Class<? extends T> clazz, Charset charset) throws IOException {
        Reader in = new InputStreamReader(zipFile.getInputStream(zipEntry), charset);
        return processCSVFile(in, clazz, charset, ',');
    }
    
    public static <T> List<T> processCSVFile(Reader in, Class<? extends T> clazz, Charset charset, char sep) {
        CsvToBean<T> csvToBean = new CsvToBeanBuilder(in)
                .withType(clazz).withSkipLines(1)
                .withIgnoreLeadingWhiteSpace(true).withSeparator(sep)
                .build();
        return csvToBean.parse();
    }
    

    //似乎依赖于xlsx文件格式

        3
  •  1
  •   Rich    6 年前

    我认为这里至少有两个开放性问题:

    1. 内存不足 WorkbookFactory.create() 打开旧样式的xls文件时 sparse

    2. XLSX2CSV正在损坏新样式的XLSX文件,可能是由于“某个字符[错误地视为]endRow”

    对于(1),我认为您需要找到一个java xls库,它可以在不分配空格的情况下处理稀疏文件,也可以找到一个javaxls库,它可以在 streaming 方法而不是工作簿工厂采取的批处理方法

    对于(2),需要找到一个不会损坏数据的Java XLSX库。

    我不知道(1)或(2)有什么好的Java库,对不起。

    但是,我建议您用excel而不是java编写这个脚本。excel内置了一种优秀的脚本语言excel vba,它可以处理打开多个文件、从中提取数据等。此外,您可以确信,在excel vba中运行的脚本不会对您在java中遇到的诸如稀疏表或xlsx解析之类的excel功能产生任何问题。

    (如果是一次性工作,您可能还想退后一步,评估手动完成此任务所需的时间,与编写此任务所需的时间相比。)

    祝你好运!