thunderdup
- Fast concurrent linux file deduplicatorHow to use:
$ time thunderdup
Scanning ...
2024-05-03 06:58:10:
unique: 276 173 MiB
duplicate: 2 213 KiB
queue length: 0
currently working workers: 0/192
Deduplicating ...
deduplicating: docs/examples/example-folder/ipfs.paper.draft3.pdf
- docs/examples/example-folder/test-dir/ipfs.paper.draft3.pdf
deduplicating: .git/hooks/pre-rebase.sample
- test/sharness/lib/sharness/.git/hooks/pre-rebase.sample
total dedupped: 426 KiB
dedupping errors: 0
________________________________________________________
Executed in 73.64 millis fish external
usr time 179.56 millis 621.00 micros 178.94 millis
sys time 83.87 millis 80.00 micros 83.79 millis
This is a non incremental file deduplicator, tested on btrfs.
go
installedgo install github.com/Jorropo/thunderdup@latest
Or run as a one shot script:
go run github.com/Jorropo/thunderdup@latest
thunderdup
vs bees
I was using bees
but it wasn't fitting my usecase very well.
Advantages over bees
:
btrs fi defrag
.Disadvantages over bees
:
bees
use a probabilistic hash table which let it to use a fixed amount of memory at the cost of deduping accuracy, thunderdup
stores all the files scanned and their hash in memory, it will crash if you have too much files compared to your amount of ram.bees
on previous kernels.bees
can dedup files which only have partial overlaps.*needs investigation to make sure this doesn't work by accident, I tried it once and it worked properly.
thunderdup
is written in a memory safe language (Go) and open all the files in Read-Only mode, deduplication happens using linux's FileDedupeRange
syscall which atomically compare file content in the kernel.
This create reflinks which are Copy-On-Write that means the files share the on disk storage, however when one of them has modified regions they are written into a new location, this means other files are not affected.
Assuming there are no bugs in the kernel, the worst that can happen is dedup not happening where it should have, it can't corrupt or change the content of your files.
It is also possible to have a bug in Go or thunderdup itself, but that less likely.