o Split the compression across several worker threads. By default, "several"
matches number of CPUs, capped at 24 for sanity when running on a very big
hardwares. Provide option to set that number manually;
o Fix bug inherited from the mkulzma (R.I.P) which degraded already slow LZMA
compression even further by calling function to release compression state
after processing each block.
It is neither documented as required nor actually required by the LZMA
library. This caused spree of system calls to release memory and then map
it again for every block. LZMA compression is more than 2x faster after this
change alone;
o Record time it takes to do compression and report throughput achieved.
o Add simple first-level 256 entry hash table for de-dup code, so it's not
becoming a bottleneck at big files.