About

Recursive Extractor is a Cross-Platform .NET Standard 2.0 Library and Command Line Program for parsing archive files and disk images, including nested archives and disk images.

Supported File Types


7zip+	ar	bzip2
deb	dmg**	gzip
iso	rar^	tar
vhd	vhdx	vmdk
wim*	xzip	zip+

Variants

Command Line

Installing

Ensure you have the latest .NET SDK.
Run dotnet tool install -g Microsoft.CST.RecursiveExtractor.Cli

This adds RecursiveExtractor to your path so you can run it directly from your shell.

Running

Basic usage is: RecursiveExtractor --input archive.ext --output outputDirectory

For example, to extract only ".cs" files:

RecursiveExtractor --input archive.ext --output outputDirectory --allow-globs **/*.cs

Run RecursiveExtractor --help for more details.

.NET Standard Library

Recursive Extractor is available on NuGet as Microsoft.CST.RecursiveExtractor. Recursive Extractor targets netstandard2.0+ and the latest .NET, currently .NET 6.0, .NET 7.0 and .NET 8.0.

Usage

The most basic usage is to enumerate through all the files in the archive provided and do something with their contents as a Stream.

using Microsoft.CST.RecursiveExtractor;

var path = "path/to/file";
var extractor = new Extractor();
foreach(var file in extractor.Extract(path))
{
    doSomething(file.Content); //Do Something with the file contents (a Stream)
}

using Microsoft.CST.RecursiveExtractor;

var extractor = new Extractor();
var extractorOptions = new ExtractorOptions()
{
    ExtractSelfOnFail = true,
};
extractor.ExtractToDirectory(options.Output, options.Input, extractorOptions);

var path = "/Path/To/Your/Archive"
var extractor = new Extractor();
try {
    IEnumerable<FileEntry> results = extractor.ExtractFileAsync(path);
    await foreach(var found in results)
    {
        Console.WriteLine(found.FullPath);
    }
}
catch(OverflowException)
{
    // This means Recursive Extractor has detected a Quine or Zip Bomb
}

public Stream Content { get; }
public string FullPath { get; }
public string Name { get; }
public FileEntry? Parent { get; }
public string? ParentPath { get; }
public DateTime CreateTime { get; }
public DateTime ModifyTime { get; }
public DateTime AccessTime { get; }

var path = "/Path/To/Your/Archive"
var directory
var extractor = new Extractor();
try {
    IEnumerable<FileEntry> results = extractor.ExtractFile(path, new ExtractorOptions()
    {
        Passwords = new Dictionary<Regex, List<string>>()
        {
            { new Regex("\.zip"), new List<string>(){ "PasswordForZipFiles" } },
            { new Regex("\.7z"), new List<string>(){ "PasswordFor7zFiles" } },
            { new Regex(".*"), new List<string>(){ "PasswordForAllFiles" } }

        }
    });
    foreach(var found in results)
    {
        Console.WriteLine(found.FullPath);
    }
}
catch(OverflowException)
{
    // This means Recursive Extractor has detected a Quine or Zip Bomb
}

Exceptions

RecursiveExtractor protects against ZipSlip, Quines, and Zip Bombs. Calls to Extract will throw an OverflowException when a Quine or Zip bomb is detected and a TimeOutException if EnableTiming is set and the specified time period has elapsed before completion.

Otherwise, invalid files found while crawling will emit a logger message and be skipped. You can also enable ExtractSelfOnFail to return the original archive file on an extraction failure.

Notes on Enumeration

Multiple Enumeration

You should not iterate the Enumeration returned from the Extract and ExtractAsync interfaces multiple times, if you need to do so, convert the Enumeration to an in memory collection first.

Parallel Enumeration

If you want to enumerate the output with parallelization you should use a batching mechanism, for example:

var extractedEnumeration = Extract(fileEntry, opts);
using var enumerator = extractedEnumeration.GetEnumerator();
ConcurrentBag<FileEntry> entryBatch = new();
bool moreAvailable = enumerator.MoveNext();
while (moreAvailable)
{
    entryBatch = new();
    for (int i = 0; i < BatchSize; i++)
    {
        entryBatch.Add(enumerator.Current);
        moreAvailable = enumerator.MoveNext();
        if (!moreAvailable)
        {
            break;
        }
    }

    if (entryBatch.Count == 0)
    {
        break;
    }

    // Run your parallel processing on the batch
    Parallel.ForEach(entryBatch, new ParallelOptions() { CancellationToken = cts.Token }, entry =>
    {
        // Do something with each FileEntry
    }
}

Disposing During Enumeration

If you are working with a very large archive or in particularly constrained environment you can reduce memory and file handle usage for the Content streams in each FileEntry by disposing as you iterate.

var results = extractor.Extract(path);
foreach(var file in results)
{
    using var theStream = file.Content;
    // Do something with the stream.
    _ = theStream.ReadByte();
// The stream is disposed here by the using statement
}

Feedback

If you have any issues or feature requests (for example, supporting other formats) you can open a new Issue.

If you are having trouble parsing a specific archive of one of the supported formats, it is helpful if you can include an sample archive with your report that demonstrates the issue.

Dependencies

Recursive Extractor aims to provide a unified interface to extract arbitrary archives and relies on a number of libraries to parse the archives.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Related Projects

CollectServiceFabricData

CollectSFData is a Service Fabric support utility to collect detailed cluster traces and other di...

27 Jul 2019 14

evodiff

Generation of protein sequences and evolutionary alignments via discrete diffusion models

07 Jun 2022 487

SAF

Simple Archive Format - A minimal, stream-friendly format for file/object archiving.

24 Jun 2020 46

backfill

A JavaScript caching library for reducing build time

27 Aug 2019 156

dpu-utils

Utilities used by the Deep Program Understanding team

10 Oct 2018 102

BlingFire

A lightning fast Finite State machine and REgular expression manipulation library.

13 Mar 2019 1,823

PhoneticMatching

A phonetic matching library. Includes text utilities to do string comparisons on phonemes (the so...

21 Jun 2018 154

goodpoints

A Python package for generating concise, high-quality summaries of a probability distribution

03 Nov 2021 39

infersharpaction

About Infer# is an interprocedural and scalable static code analyzer for C#. Via the capabilities...

31 Mar 2020 62

cppgraphqlgen

C++ GraphQL schema service generator

27 Jul 2018 323

infinibatch

Efficient, check-pointed data loading for deep learning with massive data sets.

27 Jun 2020 203

azfuse

A lightweight blobfuse-like python tool with the data transfer through azcopy

09 Jun 2022 30

artifacts-credprovider

The Azure Artifacts Credential Provider enables dotnet, NuGet.exe, and MSBuild to interactively a...

19 Jun 2018 771

DevSkim-Action

The GitHub Action for DevSkim

14 Jul 2020 38

RichCodeNavIndexer

A GitHub Action that adds rich code navigation to a repo's branches and pull requests.

03 Feb 2020 26