Multithreaded compression in C#

https://stackoverflow.com/questions/1211225

06-07-2019
|

Question

Is there a library in .net that does multithreaded compression of a stream? I'm thinking of something like the built in System.IO.GZipStream, but using multiple threads to perform the work (and thereby utilizing all the cpu cores).

I know that, for example 7-zip compresses using multiple threads, but the C# SDK that they've released doesn't seem to do that.

Solution

I think your best bet is to split the data stream at equal intervals yourself, and launch threads to compress each part separately in parallel, if using non-parallelized algorithms. (After which a single thread concatenates them into a single stream (you can make a stream class that continues reading from the next stream when the current one ends)).

You may wish to take a look at SharpZipLib which is somewhat better than the intrinsic compression streams in .NET.

EDIT: You will need a header to tell where each new stream begins, of course. :)

OTHER TIPS

Found this library: http://www.codeplex.com/sevenzipsharp

Looks like it wraps the unmanaged 7z.dll which does support multithreading. Obviously not ideal having to wrap unmanaged code, but it looks like this is currently the only option that's out there.

I recently found a compression library that supports multithreaded bzip compression:DotNetZip. The nice thing about this library is that the ParallelBZip2OutputStream class is derived from System.IO.Stream and takes a System.IO.Stream as output. This means that you can create a chain of classes derived from System.IO.Stream like:

ICSharpCode.SharpZipLib.Tar.TarOutputStream
Ionic.BZip2.ParallelBZip2OutputStream (from the DotNetZip library)
System.Security.Cryptography.CryptoStream (for encryption)
System.IO.FileStream

In this case we create a .tar.bz file, encrypt it (maybe with AES) and directly write it to a file.

A compression format (but not necessarily the algorithm) needs to be aware of the fact that you can use multiple threads. Or rather, not necessarily that you use multiple threads, but that you're compressing the original data in multiple steps, parallel or otherwise.

Let me explain.

Most compression algorithms compress data in a sequential manner. Any data can be compressed by using information learned from already compressed data. So for instance, if you're compressing a book by a bad author, which uses a lot of the same words, clichés and sentences multiple times, by the time the compression algorithm comes to the second+ occurrence of those things, it will usually be able to compress the current occurrence better than the first occurrence.

However, a side-effect of this is that you can't really splice together two compressed files without decompressing both and recompressing them as one stream. The knowledge from one file would not match the other file.

The solution of course is to tell the decompression routine that "Hey, I just switched to an altogether new data stream, please start fresh building up knowledge about the data".

If the compression format has support for such a code, you can easily compress multiple parts at the same time.

For instance, a 1GB file could be split into 4 256MB files, compress each part on a separate core, and then splice them together at the end.

If you're building your own compression format, you can of course build support for this yourself.

Whether .ZIP or .RAR or any of the known compression formats can support this is unknown to me, but I know the .7Z format can.

Normally I would say try Intel Parallel studio, which lets you develop code specifically targetted at multi-core systems, but for now it does C/C++ only. Maybe create just lib in C/C++ and call that from your C# code?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow