Five months have passed since our last installment of “latest developments in unblob“, and five months is a lot for a project this young and active !
First, we wanted to let you know that we will be present in Singapore this week for BlackHat Asia Arsenal. Don’t hesitate to pass by the Arsenal area on Friday morning to see unblob in action ! The team will also be in Vienna on June 1st for a one hour workshop on unpacking firmwares with unblob. Expect more of those workshops in the future as we would like to train people on how to write handlers fast and efficiently.
We also wanted to thank our very first external contributors. It’s an important milestone for us as we really want to involve the infosec community in general with this project.
Let’s now dive into the biggest changes that happened in unblob over the last few months.
Publication to PyPi so that users could simply run pip install unblob
was one of our top priority for the first quarter. First, we had to add versioning to unblob so that users would not get lost. We agreed on calendar versioning since it’s something we already use for both our external and internal projects. We then had a merge request ready for publication to PyPi using poetry. That’s when things went down.
See, you can’t release a package to PyPi if any of your dependencies are pulled from a git repository. 3 of unblob’s dependencies (jefferson, ubi-reader, and yaffshiv) were this way because of our need to fork and fix security issues.
With those three previously git based dependencies out of the way, we were ready to publish on PyPi and we did !
We had some initial issues with building wheels for different versions of Python while embedding the Rust part, so we moved it to unblob-native in order to have a pure Python package. We’ve also moved away from python-lzo due to the requirement for the installer to have a C compiler and liblzo headers. This is probably as lean as we can get.
This means everything is now available as pre-built wheels for Linux and MacOS (on x86-64 and AARCH64 architecture), with support for Python 3.8 to 3.11 !
Antoine joined the unblob development team for a 15 weeks internship and did not disappoint ! With an average of one format released per week, he added handlers and extractors for the following formats: Netgear CHK, Netgear TRX, D-Link SHRS, D-Link encrpted_img (sic), Xiaomi HDR1, QNAP QTS, Instar BNEG, HP BDL, HP IPKG, and Instar HD. He’s currently working on the Expert Witness Compression Format used by forensics tools.
We’re sure Antoine has a bright future in this field and can’t wait to attend his thesis defense !
We mentioned Pyperscan in our last unblob blog post. It’s a Rust library with Python binding for Vectorscan (the port fo Hyperscan with multi-architectural support) built by our very own vlaci. I wanted to get back to it and mention that version 0.2.2 is now available, with a bunch of improvements initiated by unblob’s need.
Check it out if you ever need to perform extremely fast pattern matching in Python.
Multi-volumes handling is something that was recently added to our roadmap. Multi-volume means a file (archived or compressed) that is spread through different files. It could be an archive file split in different chunks of fixed size (as offered by 7-zip with the multipart/split option), or a file format made of an index file referencing other files.
Since unblob used to operate on single files, multi-volume handling is a significant architectural change. We’re planning on doing this by offering a DirectoryHandler
class that acts in a similar way than a Handler
. Instead of a bytes pattern, a directory handler defines a filename pattern. Whenever unblob traverses a new directory within the extraction directory, it will check if any of the files matches one of the directory handler patterns. If there is a match, these files are fed to the directory handler so that it can perform the reconstruction and extraction/decompression. If the operation is successful, these files are not scanned by unblob in the traditional way.
You can follow our progress on multi-volume handling here.
When we initially developed the SquashFS handler for unblob, we chose sasquatch as third party extractor because it was the de-facto standard, probably thanks to its use by binwalk. As time went by, we came to the realization that sasquatch is cursed:
I’m personally unsure about whether or not we should keep pushing with sasquatch, especially with projects like backhand round the corner. Only time will tell.
Yes, we found a way to make unblob even faster. Previously, the scanner would run through the whole file and if a pattern was identified within an already identified and validated chunk, it would simply be discarded.
Discarding those chunks was resource intensive, especially with formats holding plaintext data like ExtFS, so we decided to use an other approach: when a valid chunk is identified, we stop the scanner and make it start again right after the valid chunk.
Performance gains from this change will mostly be visible in files holding different chunks of concatenated data (e.g. a filesystem followed by a signature or fingerprint).
Of course those big improvements should not overshadow all the nice improvements we added to the framework over the last 5 months:
_extract
directoriesAs mentioned earlier, multi-volumes handling will probably land soon. Along with features that we initially mentioned in our previous blog post such as human readable output, meta-data reporting, and chunks auto-identification (e.g. auto-detect padding).
We will continue working hard on unblob to keep it stable, fast, and safe to use while expanding the formats it support. Something that probably led the EMBA maintainers to completely ditch binwalk in favor of unblob. This is a bold move, and it really means a lot to us !
We hope to see you in Singapore or in Vienna. Until then, we wish you a very pleasant Spring 🙂