Parallel Zip on JVM

Zipping tons of files on one core in a multicore/SSD/cloud era is a massive waste of time.

A zip file is just an array of entries and a central directory at the end of a file.

We cannot write to a zip file in parallel, but we can compress data in parallel in memory.

Last but not least, nobody wants to reimplement zip logic from scratch or use an unsupported third-party zip library. We reuse the standard java.util.zip.ZipOutputStream in the presented approach.

Algorithm

Collect all zip entries and their bytes for each input file in parallel.
For each input file:
- Get a ByteArrayOutputStream and a ZipOutputStream on top of it
- Write an entry to a zip stream. Do not close it to avoid writing an unneeded central directory
- Get the bytes from the byte stream

      var zipEntries = ConcurrentHashMap<ZipEntry, byte[]>();

      // for each input file in parallel:
      var out = new ByteArrayOutputStream();
      var zipEntry = new ZipEntry(filePathRelativeToZipRoot);
      var zip = new ZipOutputStream(out);
      try (var fileStream = Files.newInputStream(filePath)) {
        zip.putNextEntry(zipEntry);
        fileStream.transferTo(zip);
        zip.closeEntry();
      }
      zipEntries.put(zipEntry, out.toByteArray());

Write all entries and bytes sequentially to a target zip file:
- Get a FileOutputStream and a ZipOutputStream on top of it
- Write bytes of all entries to a file stream updating zip stream state
- Write the central directory by closing the zip stream

    try (var os = Files.newOutputStream(zipFile)) {
      var zip = new ZipOutputStream(os);
      var offset = 0L;
      for (Map.Entry<ZipEntry, byte[]> o : zipEntries.entrySet()) {
        var zipEntry = o.getKey();
        var bytes = o.getValue();
        zip.xEntries.add(new XEntry(zipEntry, offset)); // via reflection
        os.write(bytes);
        offset += bytes.length;
      }
      zip.offset = offset; // via reflection
      zip.close();
    }

Notes

Java Reflection is used to work around missing Java API. To avoid that in the future, we must request such an API
The algorithm takes roughly the same amount of memory as the target zip file. We can start writing to disk when new zip entries are ready, applying backpressure to control memory consumption
It's the compression that takes most of the time. We can generate already compressed data in parallel in various data generation tasks. Then, saving it to disk will take very little time
We can merge zip files without repacking using the same technique

Results

Zipping 12.06 GB of 175,866 items to a 1.14 GB zip file on a MacBook M2 Max in seconds:

Mode	Seconds
Sequential	151
Parallel	18

A fully functional parallel zip in pure Java (source):

gradle runJava <out.zip> <file-or-dir> ..

A fully functional parallel zip in Kotlin (source):

gradle runKotlin <out.zip> <file-or-dir> ..

Sequential zipping for comparison in pure Java (source):

gradle runSequential <out.zip> <file-or-dir> ..

Related Projects

zip4j

A Java library for zip files and streams

04 May 2019 2,056

java-basics

Java Basics ( Java-8 )

06 Mar 2019 1,635

simplezip

Java processing of Zip files that gives full control over all Zip disk structures

08 May 2024 2

zt-zip

ZeroTurnaround ZIP Library

22 Nov 2011 1,375

hadoop-xz

XZ (LZMA/LZMA2) Codec for Apache Hadoop

28 Mar 2015 12

omusubi

Numbers compression library

12 Nov 2013 5

lz4-java

LZ4 compression for Java

18 Jul 2012 1,103

zip-forge

A tiny, formatter-friendly Java DSL for creating ZIP files.

03 Apr 2023 34

parallel-zip