Apache POI: Stream-reading large xlsx-type excel spreadsheet

We tried to load a large spreadsheet file of type .xlsx (hundreds of MBs) for an analytics project using Apache POI’s XSSFWorkbook and we were constantly getting Out of Memory exception. We realized POI read the entire spreadsheet in one go resulting in higher memory usage and hence the exception. We looked for POI interfaces that’d help solve – SXSSFWorkbook looked promising but it only allows to write very large files without running out of memory, not reading.

Apache POI Home Page
Apache POI Home Page

Deeper search took us to a project at https://github.com/monitorjbl/excel-streaming-reader – exactly what we needed. It was amazing!

Here’s how you can use it:

Pick the latest version of library from Maven repo: http://mvnrepository.com/artifact/com.monitorjbl/xlsx-streamer. As of writing this, it’s 1.0.0.

Here’s how you can use this library (to print content of all cells, for example)

This example uses the try-with-resources statement introduced in Java 7. The try-with-resources statement ensures that each resource is closed at the end of the statement. Any object that implements java.lang.AutoCloseable, which includes all objects implementing java.io.Closeable, can be used as a resource.

Additional reading:

You can look at how this library is implemented on top of POI at https://github.com/monitorjbl/excel-streaming-reader#implementation-details

This library will take a provided InputStream and output it to the file system. The stream is piped safely through a configurable-sized buffer to prevent large usage of memory. Once the file is created, it is then streamed into memory from the file system.

The reason for needing the stream being outputted in this manner has to do with how ZIP files work. Because the XLSX file format is basically a ZIP file, it’s not possible to find all of the entries without reading the entire InputStream.

This is a problem that can’t really be gotten around for POI, as it needs a complete list of ZIP entries. The default implementation of reading from an InputStream in POI is to read the entire stream directly into memory. This library works by reading out the stream into a temporary file. As part of the auto-close action, the temporary file is deleted.

Copy Protected by Chetan's WP-Copyprotect.