dwarfs

A fast high compression read-only file system for Linux, Windows and macOS

GPL-3.0 License

Downloads
93
Stars
2K

Bot releases are hidden (Show)

dwarfs - dwarfs-0.9.9 Latest Release

Published by mhx 6 months ago

Bugfixes

  • A bug introduced by an optimization to skip hashing of large files if they already differ in the first 4 KiB could, under rare circumstances, lead to an unexpected "inode has no file" exception after the scanning phase. This bug did not cause any file system inconsistency issues; mkdwarfs either crashes with the exception, or its output will be correct. Fixes github #217 (see also for more details).

Features

  • A sequential access detector was added to the block cache, which can trigger a prefetch of blocks assumed to be read in the future. This improves sequential read throughput roughly by a factor of two. Random access should typically be unaffected. Can be configured / disabled using -o seq_detector.

  • Added tracing support in FUSE driver and dwarfsextract, which allows simple performance analysis using chrome://tracing. Traces can be enabled using -o perfmon_trace and --perfmon-trace.

  • Added performance monitoring and tracing support for the block cache.

Performance

  • Significantly improved the speed of dwarfsck --checksum.
dwarfs - dwarfs-0.9.8

Published by mhx 6 months ago

Bugfixes

  • Build custom version of libcrypto to link with the release binaries in order for them to run properly on FIPS-enabled setups. Fixes github #210.

  • When mounting a DwarFS image on macOS and viewing the volume in Finder, only the directories were shown, but no files. The root cause was that a non-existent extended attribute is reported via a different error code in macOS (ENOATTR) compared to Linux (ENODATA) and the wrong error code was returned for certain Finder-related attributes. Fixes github #211.

  • macOS builds using jemalloc were crashing when calling mallctl("version", ...). The root cause of the crash is still unclear, but as a workaround, the jemalloc version is compiled in from a preprocessor constant rather than using mallctl.

dwarfs - dwarfs-0.9.7

Published by mhx 6 months ago

Bugfixes

  • Handle root uid correctly in access() implementation. Fixes github #204.

Features

  • Show and track library dependencies. Dependencies will be displayed in the command line help; they will also be tracked in the history metadata of a DwarFS image. See also github #207.

Documentation

  • Describe nilsimsa ordering algorithm more accurately.

Performance

  • Reorder branches to improve ricepp speed with real world data.
  • Some tweaks to improve segmenter speed.
dwarfs - dwarfs-0.9.6

Published by mhx 8 months ago

Bugfixes

  • Add workaround for new glog release breaking the folly build. Fixes github #201.

Performance

  • Improve ricepp decoding speed by about 25% on x86 and arm, and up to 100% on Windows. Also improve encoding speed on Windows by 25%. No more need for special hybrid Clang build on Windows.
dwarfs - dwarfs-0.9.5

Published by mhx 8 months ago

Bugfixes

  • Windows path handling was wrong and didn't work properly for e.g. network shares. This is hopefully fixed for all tools now.
dwarfs - dwarfs-0.9.4

Published by mhx 8 months ago

Bugfixes

  • (fix) Prevent installation of ricepp headers/libs. Fixes github #195.

  • (fix) Don't fetch googletest in ricepp build if the targets are already available. Fixes github #194.

Features

  • Added blocksize option to the FUSE driver, which allows the st_blksize value to be configured for the mounted file system. This can be used to optimize throughput.

  • Added experimental readahead option to the FUSE driver. This can potentially increase throughput when performing sequential reads.

dwarfs - dwarfs-0.9.3

Published by mhx 8 months ago

Bugfixes

  • v0.8.0 removed the implementation of the null decompressor under the assumption that it was no longer used; it was, however, still used when recompressing an image with null-compressed blocks. The change to remove the implementation was reverted and a new test case was added. Fixes github #193.

Performance

  • Some more ricepp compression speed improvements. Also, the universal binaries for x86_64 now automatically choose a ricepp version based on CPU capabilities.

  • For Windows, there's an experimental -ricepp package/binary. This contains a "hybrid" build where the ricepp library was built using clang and everything else using cl. This binary offers significantly faster ricepp compression. Decompression speeds are similar to the regular package/binary. If you don't care about compressing large amounts of FITS files on Windows, just stick to the regular package/binary.

dwarfs - dwarfs-0.9.2

Published by mhx 8 months ago

Bugfixes

  • (fix) v0.9.0 introduced an optimization where large files of equal size were only fully hashed for deduplication if the first 4K of their contents also produced the same hash. This introduced a bug causing an exception to be thrown when processing large hard-linked files. The root cause was that the data structure intended to be used for exactly this case was just never populated, and the fix was adding a single line to fill the data structure. The test cases didn't cover large hard-linked files, so this slipped through into the release. A new test case has been added as well.

  • (fix) On Windows, when using Power Shell, the error message dialog for a missing WinFsp DLL was not shown when running dwarfs.exe. The workaround is to use the same delayed loading mechanism that's already used for the universal binary and show the error in the terminal. See also the discussion on github #192.

Features

  • Added a --list option to dwarfsck. This lists all files in the files system image. When used with --verbose, the list also shows permissions, size, uid/git and symbolic link information. Fixes github #192.

  • Added a --checksum option to dwarfsck. This produces output similar to the *sum programs from coreutils and can be used to check the contents of a DwarFS image against local files.

dwarfs - dwarfs-0.9.1

Published by mhx 9 months ago

Bugfixes

  • Invalid UTF-8 characters in file paths would crash mkdwarfs if these paths were displayed in the progress output. A possible workaround was to disable progress output. This fix replaces any invalid characters before displaying them. Fixes github #191.

  • The CMakeLists.txt would bail out as soon as it discovered --as-needed in the linker flags. However, --as-needed is only a problem when combined with BUILD_SHARED_LIBS=ON. The check has been changed to only trigger if both conditions are met.

Other

  • Minor speed improvements in ricepp compression.
dwarfs - dwarfs-0.9.0

Published by mhx 9 months ago

Only two weeks since the last release, but another major milestone: DwarFS now runs on all major platforms, including macOS. There are no macOS binaries available for download, though, but the installation procedure is relatively simple and I'm really hoping for a Homebrew formula to be added soon.

The only other change since v0.8.0 is the addition of the ricepp compression algorithm and the fits categorizer, both of which are intended to be used together for efficiently compressing raw data in astrophotography.

dwarfs - dwarfs-0.8.0

Published by mhx 9 months ago

After more than 600 commits, it's time for another major release. In addition to a long list of fixes, there are quite a few new features, most notably a categorization framework that allows identifying different categories of files and treating them differently. Right now, there are only two categorizers — pcmaudio and incompressible — but there are hopefully more to come. Along with the pcmaudio categorizer, support for FLAC compression has been added. This allows for large collections of uncompressed audio files to be archived efficiently, and also accessed efficiently: the DwarFS FUSE driver can decode a large audio file using multiple cores, something that cannot be done with a single compressed FLAC file.

The project code is now tested much more thoroughly; various new abstractions allow the command line interfaces to actually be covered by the unit tests.

Also, unlike many previous releases, images produced by this release will be compatible with older releases as long as they don't use new features like FLAC compression or history sections, which are unsuppored by older releases. The 0.7.3 and later releases will even deal with unknown sections and compression algorithms. Going forward, use of new features will be tracked by feature flags, so older releases can determine if the feature set used by a file system image is fully or partially supported.

Last but not least, the binaries can now be built with manual pages built-in. This is particularly useful on Windows, where man is not a thing, but also with the universal binaries if you don't have a full install and need to quickly check the manual. The manuals can be read using the --man option.

New Features

  • Categorizer framework. Initially supported categorizers are pcmaudio (detect audio data & metadata and provide context for FLAC compressor) and incompressible (detects "incompressible" data). Enabled using the --categorize option.

  • Multiple segmenters can now run in parallel and write to the same filesystem image in a fully deterministic way. Currently, a segmenter instance will be used per category/subcategory. This can makes segmenting multi-threaded in cases where there are multiple categories. The number of segmenter worker threads can be configured using --num-segmenter-workers.

  • The segmenter now supports different "granularities". The granularity is determined by the categorizer. For example, when segmenting the audio data in a 16-bit stereo PCM file, the granularity is 4 (bytes). This ensures that the segmenter will only produce chunks that start/end on a sample boundary.

  • The segmenter now also features simple "repeating sequence detection". Under certain conditions, these sequences could cause the segmenter to slow down dramatically. See github #161 for details.

  • FLAC compression. This can only be used along with the pcmaudio categorizer. Due to the way data is spread across different blocks, both FLAC compression and decompression can likely make use of multiple CPU cores for large audio files, meaning that loading a .wav file from a DwarFS image using FLAC compression will likely be much faster than loading the same data from a single FLAC file.

  • Completely new similarity ordering implementation that supports multi-threaded and fully deterministic nilsimsa ordering. Also, nilsimsa options are now ever so slightly more user friendly.

  • The --recompress feature of mkdwarfs has been largely rewritten. It now ensures the input filesystem is checked before an attempt is made to recompress it. Decompression is now using multiple threads. Also, recompression can be applied only to a subset of categories and compression options can be selected per category.

  • mkdwarfs now stores a history block in the output image by default. The history block contains information about the version of mkdwarfs, all command line arguments, and a time stamp. A new history entry will be added whenever the image is altered (i.e. by using --recompress). The history can be displayed using dwarfsck. History timestamps can be disabled using --no-history-timestamps for bit-identical images. History creation can also be completely disabled using --no-history.

  • All tools now come with built-in manual pages. This is valuable especially on Windows, which doesn't have man at all, or for the universal binaries, which are usually not installed alongside the manual pages. Running each tool with --man will show the manual page for the tool, using the configured pager. On Windows, if less.exe is in the PATH, it'll also be used as a pager.

  • New verbose logging level (between info and debug).

  • Logging now properly supports multi-line strings.

  • Show compression library versions as part of the --help output. For dwarfsextract, also show libarchive version.

  • --set-time now supports time strings in different formats (e.g. 20240101T0530).

  • mkdwarfs can now write the filesystem image to stdout, making it possible to directly stream the output image to e.g. netcat.

  • Progress display for mkdwarfs has been completely overhauled. Different components (e.g. hashing, categorization, segmenting, ...) can now display their own progress in addition to a "global" progress.

  • mkdwarfs now supports ordering by "reverse path" with --order=revpath. This is like path ordering, but with the path components reversed (i.e. foo/bar/baz.xyz will be ordered as if it were baz.xyz/bar/foo).

  • It is now possible to configure larger bloom filters in mkdwarfs.

  • The mkdwarfs segmenter can now be fully disabled using -W 0.

  • mkdwarfs now adds "feature sets" to the filesystem metadata. These can be used to introduce now features without necessarily breaking compatibility with older tools. As long as a filesystem image doesn't actively use the new features, it can still be read by old tools. Addresses github #158.

  • dwarfsck has a new --quiet option that will only report errors.

  • dwarfsck with --print-header will exit with a special exit code (2) if the image has no header. In all other cases, the exit code will be 0 (no error) or 1 (error).

  • The --json option of dwarfsck now outputs filesystem information in JSON format.

  • dwarfsck has a new --no-check option that skips checking all block hashes. This is useful for quickly accessing filesystem information.

  • The FUSE driver exposes a new dwarfs.inodeinfo xattr on Linux that contains a JSON object with information about the inode, e.g. a list of chunks and associated categories.

  • Don't enable readlink in the FUSE driver if filesystem has no symlinks. This is mainly useful for Windows where symlink support increases the number of getattr calls issued by WinFsp.

  • As an experimental feature, CPU affinity for each worker group can be configured via the DWARFS_WORKER_GROUP_AFFINITY environment variable. This works for all tools, but is really only useful if you have different types of cores (e.g. performance and efficiency cores) and would like to e.g. always run the segmenter on a performance core.

  • The universal binaries are now compressed with a different upx compression level, making them slightly bigger, but decompress much faster.

Bugfixes

  • Allow version override for nixpkgs. Fixes github #155.

  • Resize progress bar when terminal size changes. Fixes github #159.

  • Add Extended Attributes section to README. Fixes github #160.

  • Support 32-bit uid/gid/mode. Also support more than 65536 uids/gids/modes in a filesystem image. Fixes gh #173.

  • Add workaround for broken utf8cpp release. Fixes github #182.

  • Don't call check_section() in filesystem ctor, as it renders the section index useless. Also add regression test to ensure this won't be accidentally reintroduced. Fixes github #183.

  • Ensure timely exit in progress dtor. This could occasionally block command line tools for a few seconds before exiting.

  • --set-owner and --set-group did not work properly with non-zero ids. There were two distinct issues: (1) when building a DwarFS image with --set-owner and/or --set-group, the single uid/gid was stored in place of the index and the respective lookup vectors were left empty and (2) when reading such a DwarFS image, the uid/gid was always set to zero. The issue with (1) is not only that it's a special case, but it also wastes metadata space by repeatedly storing a potentially wide integer value. This fix addresses both issues. The uid/gid information is now stored more efficiently and, when reading an image using the old representation, the correct uid/gid will be reported. Unit tests were added to ensure both old and new formats are read correctly.

  • mkdwarfs is now much better at handling inaccessible or vanishing files. In particular on Windows, where a successful access() call doesn't necessarily mean it'll be possible to open a file, this will make it possible to create a DwarFS file system from hierarchies containing inaccessible files. On other platforms, this means mkdwarfs can now handle files that are vanishing while the file system is being built.

  • mkdwarfs progress updates are now "atomic", i.e. one update is always written with a single system call. This didn't make much of a difference on Linux, but the notoriously slow Windows terminal, along with somewhat interesting thread scheduling, would sometimes make the updates look like a typewriter in slow-motion.

  • utf8_truncate() didn't handle zero-width characters properly. This could cause issues when truncating certain UTF8 strings.

  • A race condition in simple progress mode was fixed.

  • A race condition in filesystem_writer was fixed.

  • The --no-create-timestamp option in mkdwarfs was always enabled and thus useless.

  • Common options (like --log-level) were inconsistent between tools.

  • Progress was incorrect when mkdwarfs was copying sections with --recompress.

  • Treat NTFS junctions like directories.

  • Fix canonical path on Windows when accessing mounted DwarFS image.

  • Fix slow sorting in file_scanner due to path comparison.

  • On Windows, don't crash with an assertion if the input path for mkdwarfs is not found.

Removed Features

  • Python scripting support has been completely removed.

Documentation

  • Add mkdwarfs sequence diagram.

  • Document known issues with WinFsp.

  • Update README with extended attributes information.

  • Add script to check if all options are documented in manpage.

Building

  • Factor out repetitive thrift library code in CMakeLists.txt.

  • Use FetchContent for both fmt and googletest.

  • Use mold for linking when available.

  • The CI workflow now uploads coverage information to codecov.io with every commit.

Testing

  • A ton of tests were added (from 4 kLOC to more than 10 kLOC) and, unsurprisingly, a number of bugs were found in the process.

  • Introduced I/O abstraction layer for all *_main() functions. This allows testing of almost all tool functionality without the need to start the tool as a subprocess. It also allows to inject errors more easily, and change properties such as the terminal size.

dwarfs - dwarfs-0.7.5

Published by mhx 9 months ago

Bugfixes

  • Fix crash in the FUSE driver on Windows when tools like Notepad++ try to access a file like a directory (presumably because this works in cases where the file is an archive). This is a Windows-only issue because the Linux FUSE driver uses the inode-based API, whereas the Windows driver uses the string-based API. While parsing a path in the string-based API, there was no check whether a path component was a directory before trying to descend further.

Other

  • The universal binaries have been compressed using a different compression level (-9 instead of --best --ultra-brute) in upx. The compression ratio is slightly worse, but the decompression speed is significantly faster.
dwarfs - dwarfs-0.7.4

Published by mhx 10 months ago

Bugfixes

  • Fix regression that broke section index optimization introduced in v0.7.3. Fixes github #183.
  • Add workaround for broken utf8cpp release. Fixes github #182.
dwarfs - dwarfs-0.7.3

Published by mhx 11 months ago

This is a small incremental update over the 0.7.2 release adding a single new feature: forward compatibility. This means that the 0.7.3 release will be able to handle DwarFS file system images created with newer releases as long as these images don't use features that are not understood by the older binaries. Up until now, support for new features often triggered a file system version increment, rendering the images unusable with older binaries even if the features weren't actually used in the image. This fixes #158.

dwarfs - dwarfs-0.7.2

Published by mhx about 1 year ago

Bugfixes

  • Fix locale fallback if user-default locale cannot be set. Fixes github #156.
dwarfs - dwarfs-0.7.1

Published by mhx about 1 year ago

Bugfixes

  • Fix potential division by zero crash in speedometer.

Other

  • New tool header.

  • Source code cleanups.

  • Updated static build procedure (see README).

dwarfs - dwarfs-0.7.0

Published by mhx over 1 year ago

This release took much longer than anticipated, but comes with a rather big surprise (for me, at least): Windows support! I didn't expect this to happen just yet, especially given that I haven't really used Windows over the past two decades. My biggest worries were all the dependencies, but fortunately I came across vcpkg and all of a sudden, porting DwarFS to Windows seemed feasible. So here we are, and all the different tools (mkdwarfs, dwarfsck, dwarfsextract and the FUSE driver dwarfs) are now working on Windows.

As of this release, in addition to the "classic" statically linked binaries, DwarFS is also available as a universal binary for each platform. The universal binaries bundle the four main tools (mkdwarfs, dwarfsck, dwarfsextract, dwarfs) in a single, compressed binary that is between 2.5 and 4 MiB in size, a fraction of the size of the standalone binaries. The tools can be accessed either by passing the --tool=<name> option as the first argument, or, more conveniently, by creating symbolic links to the universal binary using the name of the respective tool.

New Features

  • Windows support. All tools are fully working on Windows, including tfeatures such as hard links, symbolic links, Unicode file names. Thanks to WinFsp, the FUSE driver is also working, albeit with a few quirks (1, 2, 3, 4) compared to the Linux version.

  • Universal binaries that bundle all tools in a single binary. On Windows, the universal binary supports delayed loading of WinFsp DLL. This makes the mkdwarfs, dwarfsck and dwarfsextract tools usable without the WinFsp DLL.

  • Added support for Brotli compression. This is generally much slower at compression than ZSTD or LZMA, but faster than LZMA, while offering a compression ratio better than ZSTD. Fixes github #76.

  • Added --filter option to support simple (rsync-like) filter rules. This resulted from a discussion on github #6.

  • Added --compress-niceness option to mkdwarfs. This lowers the priority of the compression worker threads, which has two advantages: a system running mkdwarfs will generally be more responsive, and the compression threads won't starve themselves by taking processing power away from the segmenter.

  • Added --stdout-progress option to dwarfsextract for use with tools such as yad. Fixes github #117.

  • Added --chmod option to mkdwarfs. Fixes github #7.

  • Added --input-list option to support reading a list of input files from a file or stdin. At least partially fixes github #6.

  • Added support for choosing the file hashing algorithm using the --file-hash option. This allows you to pick a secure hash instead of the default XXH3 hash. Also fixes github #92.

  • Added --max-similarity-size option to prevent similarity hashing of huge files. This saves scanning time, especially on slow file systems, while it shouldn't affect compression ratio too much.

  • Added --num-scanner-workers option.

  • Added support for extracting corrupted file systems with dwarfsextract. This is enabled using the --continue-on-error and, if really needed, --disable-integrity-check options. Fixes github #51.

  • Show throughput in the scanning and segmenting phases in mkdwarfs.

  • Show how much of a file has been consumed in the segmenting phase in mkdwarfs. Useful primarily for large files.

  • New metadata format (v2.5). The only change is the addition of a "preferred path separator". This is used to correctly interpret symbolic links, as this is the only place where path separators are stored in DwarFS at all.

  • dwarfs and dwarfsextract now have options to enable performance monitoring. This can provide insight into the latency of various file system operations.

  • Unreadable files are now added as empty files instead of being ignored. Fixes github #40.

  • Honour user locale settings when formatting numbers.

Performance improvements

  • Added a small offset cache to improve random access as well as sequential read latency for large, fragmented files. This gave a 100x higher throughput for a case where DwarFS was used to compress raw file system images. The DwarFS FUSE driver is now capable of achieving read throughput of more than 6 GB/s on a Xeon(R) E-2286M machine.

  • Bypass the block cache for uncompressed blocks. This saves copying block data to memory unnecessarily and allows us to keep all uncompressed blocks accessible directly through the memory mapping. Partially addresses github #139.

  • Improved de-duplication algorithm to only hash files with the same size. File hashing is delayed until at least one more file with the same size is discovered. This happens automatically and should improve scanning speed, especially on slow file systems.

Bugfixes

  • Use folly::hardware_concurrency(). Fixes github #130.

  • Handle ARCHIVE_FAILED status from libarchive, which could be triggered by trying to write long path names to old archive formats (e.g. USTAR, which has a limit of at most 255 characters).

  • Properly handle unicode path truncation.

  • Support LZ4 compression levels above 9.

  • Fix heap-use-after-free in dwarfsextract due to missing archive_write_close() call.

  • Fix heap-use-after-free in brotli decompressor due to re-allocation of the decompressed block data.

  • Default FUSE driver debuglevel to warn in background mode. Fixes github #113.

  • Fixed extract_block.py, which was incorrectly using printf instead of print.

Documentation

Testing

  • Lots of new tools tests.

  • Removed dependency on tar and diff binaries, mainly driven by their unavailability on Windows.

  • Added GitHub workflow based CI pipeline to avoid regressions and simplify builds.

Other

  • The compression code has been made more modular. This should make it much easier to add support for more compression algorithms in the future.

  • Started using C++20 features.

  • Versioning files are no longer written to the git source tree.

dwarfs - dwarfs-0.7.0-RC6

Published by mhx over 1 year ago

Features

  • Support delayed loading of WinFsp DLL for universal binary. This makes the mkdwarfs, dwarfsck and dwarfsextract tools of the universal binary usable without the WinFsp DLL.

Performance

  • Optimized the offset cache to improve random read latency as well as sequential read latency. This gave a 100x higher throughput for a case where DwarFS was used to compress raw file system images. Fixes github #142.

Bugfixes

  • Fixed building with make instead of ninja. Also fix builing in Debug mode. Fixes github #146.
  • Fixed ninja clean.
  • Fixed symlink creation for mount.dwarfs/mount.dwarfs2.

Other

  • Added CI pipeline.
  • Don't write versioning files to source tree.
dwarfs - dwarfs-0.7.0-RC5

Published by mhx over 1 year ago

Features

  • Windows support. All tools can now be built and run on Windows, including the FUSE driver, which makes use of WinFsp. Also fixes github #85.
  • Build a "universal" binary that combines mkdwarfs, dwarfsck, dwarfsextract and dwarfs in a single binary. This binary can be used either through symbolic links with the proper names of the tool, or by passing --tool=<name> as the first argument on the command line.
  • Bypass the block cache for uncompressed blocks. This saves copying block data to memory unnecessarily and allows us to keep all uncompressed blocks accessible directly through the memory mapping. Partially addresses github #139.
  • Show throughput in the scanning and segmenting phases in mkdwarfs.
  • Show how much of a file has been consumed in the segmenting phase. Useful primarily for large files.
  • dwarfs and dwarfsextract now have options to enable performance monitoring. This can give insight into the latency of various file system operations.
  • Added inode offset cache, which improves read() latency for very fragmented files.

Bugfixes

  • Use folly::hardware_concurrency(). Fixes github #130.
  • Handle ARCHIVE_FAILED status from libarchive, which could be triggered by trying to write long path names to old archive formats.
  • Properly handle unicode path truncation.

Documentation

  • Update file system format documentation to cover headers and section indices.

Testing

  • Lots of new tools tests.
  • Remove dependency on tar and diff binaries.

Other

  • Switch to C++20.
dwarfs - dwarfs-0.7.0-RC4

Published by mhx almost 2 years ago

Features

  • Add --compress-niceness option to mkdwarfs.