dwarfs

dwarfs - dwarfs-0.7.0-RC3

Published by mhx almost 2 years ago

Bugfixes

Fix heap-use-after-free in dwarfsextract.
Fix dwarfs benchmark binary.

Features

Add --stdout-progress option to dwarfsextract. Fixes github #117.

Other

Reduce amount of test data to speed up compiles and avoid timeouts on travis.

dwarfs - dwarfs-0.7.0-RC2

Published by mhx almost 2 years ago

Bugfixes

Fix linking against compression libs. Fixes github #112.
Default FUSE driver debuglevel to warn in background mode. Fixes github #113.

Features

Add --chmod option. Fixes github #7.
Add unreadable files as empty files. Fixes github #40.

Documentation

Document how to produce bit-identical images
Update internal operation section of mkdwarfs manpage
Add more documentation details for --file-hash option

Other

Test image reproducibility for path and similarity ordering

dwarfs - dwarfs-0.7.0-RC1

Published by mhx almost 2 years ago

Bugfixes

Fixed extract_block.py, which was incorrectly using printf instead of print.
Support LZ4 compression levels above 9.

Features

Added --filter option to support simple (rsync-like) filter rules. This was driven by a discussion on github #6.
Added --input-list option to support reading a list of input files from a file or stdin. At least partially fixes github #6.
The compression code has been made more modular. This should make it much easier to add support for more compression algorithms in the future.
Added support for Brotli compression. This is generally much slower at compression than ZSTD or LZMA, but faster than LZMA, while offering a compression ratio better than ZSTD. Fixes github #76.
Added support for choosing the file hashing algorithm using the --file-hash option. This allows you to pick a secure hash instead of the default XXH3. Also fixes github #92.
Improved de-duplication algorithm to only hash files with the same size. File hashing is delayed until at least one more file with the same size is discovered. This happens automatically and should improve scanning speed, especially on slow file systems.
Added --max-similarity-size option to prevent similarity hashing of huge files. This saves scanning time, especially on slow file systems, while it shouldn't affect compression ratio too much.
Honour user locale when formatting numbers.
Added --num-scanner-workers option.
Added support for extracting corrupted file systems with dwarfsextract. This is enabled using the --continue-on-error and, if really needed, --disable-integrity-check options. Fixes github #51.

Other

Added unit tests for progress class.
Lots of internal cleanups.

dwarfs - dwarfs-0.6.2

Published by mhx almost 2 years ago

Bugfixes

Fix #91: image creation reproducibility. Add --no-create-timestamp option, produce deterministic inode numbers and fix fsst bug that causes symbol tables to be non-deterministic. Images built while omitting create timestamps will now be bit-identical.
Fix #93: only overwrite existing output file when --force option given on command line.
Fix #104: extracting large files was causing dwarfsextract to OOM. This was fixed by extracting large files in chunks rather than all at once.
Fix #105: handle strrchr() return NULL.
Fix out-of-bounds access (PR #106).
Fix swapped-out cached block detection (PR #107).
Fix data race in cached block that was triggered by statistics collection and could cause the process to crash.
Fix heap-use-after-free when writing section index.

dwarfs - dwarfs-0.6.1

Published by mhx over 2 years ago

Bugfixes

Fix binary installation. This caused the 0.6.0 binary release to contain test binaries as well as duplicate binaries.
The fuse2 driver (dwarfs2) was also missing in the 0.6.0 binary release.

dwarfs - dwarfs-0.6.0

Published by mhx over 2 years ago

Features

Add support for cache tidying, which releases cache memory when the mounted file system is unused.
Section index support for speeding up mount times (fixes #48).

Bugfixes

Fix and simplify static builds as much as possible. Document how to set up a static build environment. This also fixes #75 and #54. Huge shoutout to Maxim Samsonov (@maxirmx) for implementing most of this!
Fix #71: driver hangs when unmounting
Fix #67: dwarfs I/O hangs if call to to fuse_reply_iov fails
Fix #86: block size bits config issues
Various build fixes.

dwarfs - dwarfs-0.5.6

Published by mhx over 3 years ago

Bugfixes

Build fixes for gcc-11 (fixes #52)
Use REALPATH in version.cmake to fix building in symbolically linked repositories (fixes #47).

dwarfs - dwarfs-0.5.5

Published by mhx over 3 years ago

Features

If a filesystem block cannot be compressed to less than the uncompressed size, it will be stored uncompressed. This feature actually fixes the bug described below.

Bugfixes

When building a filesystem from high entropy input data (e.g. already compressed files), and when using LZMA compression with block sizes >= 25, the LZMA algorithm could be unable to pack a block into the worst-case allocated size. This behaviour was not expected and crashed mkdwarfs, and seems to me like a bug in LZMA's lzma_stream_buffer_bound() function. The issue has been fixed by not compressing blocks at all if the compressed size matches or exceeds the uncompressed size. This fixes part of github #45.
Filesystems created such that after segmenting the total data size was a multiple of the block size (i.e. the last block was completely filled) had the last block written to the image twice. Such a filesystem image is perfectly usable, but the repeated block uses space unnecessarily. This is highly unlikely to happen with real data.
Filesystems created with -P shared_files, but no shared files in the source tree, were created correctly, but could not be loaded. This has been fixed and the filesystems can now be loaded correctly.

Other

Added tests for binaries and FUSE driver.
Minor code cleanups.

dwarfs - dwarfs-0.5.4

Published by mhx over 3 years ago

Bugfixes

FUSE driver hangs when accessing files and the driver is not started in foreground or debug mode. This bug is present in both the 0.5.2 and 0.5.3 releases. Fixes github #44.

dwarfs - dwarfs-0.5.3

Published by mhx over 3 years ago

Bugfixes

Add PREFER_SYSTEM_GTEST for distributions (like Gentoo) that have a gtest package. (fixes github #42)
Make sure the source tarball can be built inside a git repo. The version file generation code would attempt to pull information from any outside git repository without checking if it's actually the DwarFS repo. This issue came up when building Arch Linux packages.

dwarfs - dwarfs-0.5.2

Published by mhx over 3 years ago

Bugfixes

Make FUSE driver exit with non-zero exit code if filesystem cannot be mounted. Fixes github #41.

dwarfs - dwarfs-0.5.1

Published by mhx over 3 years ago

Bugfixes

fsst library was built with -march=native, which caused the static binaries not to work on non-AVX platforms. The fsst library is now being built with no extra flags.

dwarfs - dwarfs-0.5.0

Published by mhx over 3 years ago

New Features

New metadata format (v2.3). This includes a number of changes:
- Correct hardlink preservation. With older metadata formats, all duplicate files would appear hardlinked. The new format preserves hardlinked files exactly as present in the input data, and performs additional deduplication at a lower level.
- The new format offers a lot of customization for additional packing of metadata. You can use these to trade off metadata size, mounting speed, etc. Especially for filesystems with millions of files, the metadata size can be reduced significantly.
- In particular, filename and symlink data can be stored in a format that reduces the size by roughly a factor of two, but still allows for random access, so the compressed data can be mapped into memory and decompressed on the fly.
DwarFS now directly supports images using a custom header. The header can be completely arbitrary. mkdwarfs can write, replace or remove such headers, and all other tools can either skip to a specified offset, or determine this offset automatically. This fixes github #38.
dwarfsck has been improved to perform extensive metadata checks.
dwarfsck now shows a detailed breakdown of metadata memory usage, which can be used to optimize metadata packing options.
Added ENABLE_COVERAGE cmake option.

Performance improvements

Scanning has been significantly optimized and is now up to three times faster on average.
Digest computation has been parallelized in both mkdwarfs and dwarfsck giving better performance on multi-core systems.
A set of micro-benchmarks has been added to evaluate the performance of different filesystem operations. This can be build by enabling the -DWITH_BENCHMARKS=1 cmake option.
Zstd contexts are now reused during compression, which seems to give some minor speedup.

Bugfixes

Disable multiversioning on non-x86 platforms, which broke the ARM build.
Due to a bug in the bloom filter code, only half of each 64-bit block in the bloom filter was utilized, which reduced the efficiency of the filter. The bug was spotted thanks to ubsan. With the fixed filter being twice as effective, the default size of the bloom filter has now been halved.
When exporting metadata using --export-metadata, dwarfsck was not truncating the output file, which could lead to a corrupt metadata export.

Other

Compatibility testing with older filesystem versions has been improved.
A new test suite has been added to check detection of corrupted DwarFS images.
Added some high level internals documentation for mkdwarfs.
Documented the filesystem and metadata formats.
Lots of internal cleanups.

dwarfs - dwarfs-0.4.1

Published by mhx over 3 years ago

Performance improvements

Binaries built with gcc have traditionally been much slower than those built with clang, but it was unclear why that was the case. It turns out the reason is simply that CMake defaults to -O3 optimization, which is known to cause performance regressions in some cases. The build has been changed to always build with -O2 when doing an optimized GCC build. The Clang build is unaffected. (fixes github #14)
The segmenting code now uses a bloom filter to discard unsuccessful matches as early and quickly as possible. While this only gives a minor speedup when using a single lookback block, as you increase the number of lookback blocks speed is barely affected whereas before it would slow down significantly. The bloom filter size (relative to the number of values) can be tuned by using --bloom-filter-size, though increasing it any further from the default is likely not going to make a difference.
nilsimsa similarity computation has been improved to make use of different instruction sets depending on CPU architecture, speeding up the process of ordering files by similarity by almost a factor of 2.

Bugfixes

[fix] Linking against libarchive was fixed so that it also works for shared library builds. (fixes github #36)
mkdwarfs didn't catch certain exceptions correctly, which would cause a stack trace instead of a simple error message. This has been fixed.
The statically linked executables were unable to handle any exceptions at all due to duplicate stack unwinding code. This has (hopefully) been fixed now.

dwarfs - dwarfs-0.4.0

Published by mhx over 3 years ago

Up to twice as fast and up to 10% better compression

The segmenting algorithm has been completely rewritten and is now much cleaner, uses much less memory, is significantly faster and detects a lot more duplicate segments. At the same time it's easier to configure (just a single window size instead of a list).

As a result, mkdwarfs speed has been significantly improved. The 47 GiB worth of Perl installations can now be turned into a DwarFS image in less then 6 minutes, about 30% faster than with the 0.3.1 release. Using lzma compression, it actually takes less than 4 minutes now, almost twice as fast as 0.3.1.

At the same time, compression ratio also significantly improved, mostly due to the new segmenting algorithm. With the 0.3.1 release, using the default configuration, the 47 GiB of Perl installations compressed down to 471.6 MiB. With the 0.4.0 release, this has dropped to 426.5 MiB, a 10% improvement. Using lzma compression (-l9), the size of the resulting image went from 319.5 MiB to 300.9 MiB, about 5% better. More importantly, though, the uncompressed file system size dropped from about 7 GiB to 4 GiB thanks to improved segmenting, which means less blocks need to be decompressed on average when using the file system.

New `dwarfsextract` tool

The new tool allows extracting a file system image directly to disk without having to use the FUSE driver. It also allows conversion of the file system image directly into a standard archive format (e.g. tar or cpio). Extracting a DwarFS image can be significantly faster than extracting a equivalent compressed archive.

Options have been cleaned up

The --blockhash-window-sizes and --blockhash-increment-shift options were replaced by --window-size and --window-step, respectively. The new --window-size option takes only a single window size instead of a list. There's also a new option --max-lookback-blocks that allows duplicate segments to be detected across multiple blocks, which can result in significantly better compression when using small file system blocks.

Bugfixes

The rewrite of the segmenting algorithm was triggered by a "bug" (github #35) that caused excessive memory consumption in mkdwarfs. It wasn't really a bug, though, more like a bad algorithm that used memory proportional to the file size. This issue has now been fully solved.
Scanning of large files would excessively grow mkdwarfs RSS. The memory would have sooner or later be reclaimed by the kernel, but the code now actively releases the memory while scanning.
The project can now be built to use the system installed zstd and xxHash libraries. (fixes github #34)
The project can now be built without the legacy FUSE driver. (fixes github #32)

dwarfs - dwarfs-0.3.1

Published by mhx almost 4 years ago

Bugfix release

This fixes a couple of minor compilation issues mostly related to issue #31.

dwarfs - dwarfs-0.3.0

Published by mhx almost 4 years ago

Even better compression than before

Mostly thanks to a new ordering algorithm that is now enabled by default, I've seen a 15% improvement in achievable compression ratio. In my standard test of packing 48 GiB of Perl installations, the resulting DwarFS image size reduced from 556 MiB to 472 MiB without any regression in compression speed.

More memory efficient FUSE driver

By switching to jemalloc, the FUSE driver has become much more memory efficient, using up to ten times less memory than with the standard glibc allocator.

Python scripting support

The Lua scripting interface has been fully replaced by a new Python interface. I've been looking for a luabind replacement, but none of the candidates seemed to be well maintained or reasonably easy to integrate. Python is much more approachable for most people and boost::python seems well maintained. The new interface also has a lot more features. You can find an example script in the distribution.

Fix for file system images created with versions before dwarfs-0.2.3

If you've created DwarFS images with the 0.2.0, 0.2.1 or 0.2.2 releases, symbolic links were stored in a way that the FUSE driver in the 0.2.x releases could not read them back correctly. With the new 0.3.0 release, these old images, including the symbolic links, can now be read again, so there's no need to rebuild your old images.

Improved file system format

The file system format has been updated with the 0.3.0 release to include integrity checking via SHA2-512/256 hashes as well as features that should make recovery easier in case of file system image corruption. In addition to the SHA hashes, the extremely fast xxHash library is used to store a second hash that is checked every time any part of the file system is used. While there are currently no recovery features implemented, having this data in the file system already should be really valuable. You can convert an old image to the new format using:

mkdwarfs -i old.dwarfs -o new.dwarfs --recompress none

Statically linked 64-bit Linux binaries available

Given the long list of dependencies, building DwarFS might not be an option for you. In that case, you can now download the binary distribution that should work fine on most 64-bit Linux distributions. FUSE drivers are included for both FUSE2 and FUSE3

Lots of smaller fixes & changes

See the Change Log for a full list of changes.

dwarfs - dwarfs-0.3.0-RC1

Published by mhx almost 4 years ago

dwarfs - dwarfs-0.2.4

Published by mhx almost 4 years ago

Fix --set-owner and --set-group options, which caused an
exception to be thrown at the end of creating a file system.
(fixes github #24)

dwarfs - dwarfs-0.2.3

Published by mhx almost 4 years ago

Bugfixes

Features

Other

Bugfixes

Features

Documentation

Other

Bugfixes

Features

Other

Bugfixes

Bugfixes

Features

Bugfixes

Bugfixes

Features

Bugfixes

Other

Bugfixes

Bugfixes

Bugfixes

Bugfixes

New Features

Performance improvements

Bugfixes

Other

Performance improvements

Bugfixes

Up to twice as fast and up to 10% better compression

New dwarfsextract tool

Options have been cleaned up

Bugfixes

Bugfix release

Even better compression than before

More memory efficient FUSE driver

Python scripting support

Fix for file system images created with versions before dwarfs-0.2.3

Improved file system format

Statically linked 64-bit Linux binaries available

Lots of smaller fixes & changes

Related Projects

filesystem

httm

New `dwarfsextract` tool