XLSX and XLS to CSV in Apache POI

XLSX and XLS to CSV in Apache POI

How to convert XLSX and XLS to CSV using Apache POI Library the easy way

For Java Developers, dealing with XLSX and XLS Excel files has been closer to a nightmare. I have struggled with this too. There seems to be no straightforward way to deal with such files, especially XLSX files when our application is processing them. In this short demo, we will fix that, once and for all.

First things first:

The default limit byte size of the Apache POI library is limited to 512 MB, therefore, while reading larger files, you might encounter an "OutofMemoryError" exception.To avoid that, we start our code by overriding the limit and setting it to a maximum so that the library is allocated sufficient resources to handle our Excel file.

We do that by using the IOUtils' setByteArrayMaxOverride to maximum:

//start by setting the max byte array. This is apache poi's internal dependencies therefore you must import it from apache poi 
IOUtils.setByteArrayMaxOverride(1000000000);

Moving on:

We create an input stream from FileInputStream that takes in our xlsx/xls filePath as an argument:

//declare the file path
String xlsxFilePath = "LargeInputFile.xlsx";   
//initialize the FileInputStream from the specified file     
FileInputStream inputStream = new FileInputStream(xlsxFilePath);

Declare the WorkBook:

We use the WorkBookFactory.create() constructor as opposed to declaring XSSFWorkBook and HSSFWorkBook because the WokBookFactory.create() detects the type of file in use from the input stream, that is, whether XLS or XLSX.

//initialize the workbook 
Workbook workbook = WorkbookFactory.create(inputStream);
//get the first sheet
 Sheet sheet = workbook.getSheetAt(0);
//create the FileWriter to write to our output file
 FileWriter writer = new FileWriter(csvFilePath);

Most Importantly:

The output will be deformed as per our expectation, given that normally, cells containing data like Dates and time will be altered during writing, and blank cells will be skipped/removed. Removing empty cells (which is not our expectation), will alter the structure of the output CSV file. Thanks to DataFormatter class, we can retain our original values by formatting the cell values and setting the MissingCellPolicy to CREATE_NULL_AS_BLANK.

//See: DataFormatter dataFormatter = new DataFormatter();
//See also: Cell cell = row.getCell(i, Row.MissingCellPolicy.CREATE_NULL_AS_BLANK);

//loop through the sheet to get rows
 for (Row row : sheet) {
//loop through the rows to obtain the cells
   for (int i = 0; i < row.getLastCellNum(); i++) {
      //create a black cell when cell has null values
      Cell cell = row.getCell(i, Row.MissingCellPolicy.CREATE_NULL_AS_BLANK);
      //switch case that adds the cell values according to their types
      switch (cell.getCellType()) {
//handle string values           
                        case STRING:
                 //handle string values
           writer.append("\"").append(cell.getStringCellValue().replace("\"", "\"\"")).append("\",");
                        break;
//handle numeric values
                       case NUMERIC:
                           //handle date values to remain as they are from the input using dataFormatter
                            if (DateUtil.isCellDateFormatted(cell)) {
                                String cellValue =   dataFormatter.formatCellValue(cell);
                                writer.append(cellValue).append(",");
                            } else {
            writer.append(String.valueOf(cell.getNumericCellValue())).append(",");
                            }
                            break;
//Handle boolens cel values
                        case BOOLEAN:                 writer.append(String.valueOf(cell.getBooleanCellValue())).append(",");
                            break;
                        default:
                            writer.append(",");
                    }
                }

                writer.append("\n");
            }

Finally, relinquish the resources:

writer.flush();
writer.close();
workbook.close();
inputStream.close();

Summary:

Do not forget to wrap the code in a try-catch block to handle the resulting IOException error.

Basically, this is what you will need (I use maven so below is the dependency I used):

  • Apache Poi dependency.
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>5.2.3</version>
        </dependency>

You're welcome.