Five months have passed since our last installment of “latest developments in unblob“, and five months is a lot for a project this young and active !
First, we wanted to let you know that we will be present in Singapore this week for BlackHat Asia Arsenal. Don’t hesitate to pass by the Arsenal area on Friday morning to see unblob in action ! The team will also be in Vienna on June 1st for a one hour workshop on unpacking firmwares with unblob. Expect more of those workshops in the future as we would like to train people on how to write handlers fast and efficiently.
Let’s now dive into the biggest changes that happened in unblob over the last few months.
The PyPi Job
Publication to PyPi so that users could simply run
pip install unblob was one of our top priority for the first quarter. First, we had to add versioning to unblob so that users would not get lost. We agreed on calendar versioning since it’s something we already use for both our external and internal projects. We then had a merge request ready for publication to PyPi using poetry. That’s when things went down.
See, you can’t release a package to PyPi if any of your dependencies are pulled from a git repository. 3 of unblob’s dependencies (jefferson, ubi-reader, and yaffshiv) were this way because of our need to fork and fix security issues.
- Jefferson (JFFS2 extractor) was never published to a package repo so we had to adapt the code to do it. We took over Jefferson maintenance – with approval from the original author – from our fork and moved everything so that it uses Poetry as package manager and implement the same code checkers and linters that we use in unblob for code quality. There’s still a long way to go in terms of test coverage and overall quality, but at least the tooling is there if we want to do it. Thanks to those changes, Jefferson is now available on PyPi.
- ubi-reader, the UBI/UBIFS extractor, was already on PyPi but was missing some improvements we made within our fork. After some productive discussions with the maintainer, they merged our modifications upstream and we modified unblob so that it would rely on upstream ubi-reader. If you’re interested in open source dynamics and how things get solved when everyone is chill, look here 🙂
- yaffshiv, the YAFFS extractor, was our biggest road block to publication on PyPi. At the time, they had (and still do) yet to fix the path traversal we fixed in our fork and submitted to upstream 6 months ago. We had two options: either port the extraction part to unblob, or publish our fork to PyPi as “yaffshiv-ng”. After having a deeper look at yaffshiv, it turned out that it does not support YAFFS version 1 at all, and misses a lot of very important things when it comes to sound extraction (e.g. chunk serial in YAFFSv1 / chunk sequence ID in YAFFSv2). We decided to implement our own YAFFS extractor that supports YAFFS v1 and v2, demonstrating that with unblob, a YAFFS extractor can be written in less than 500 lines of code.
With those three previously git based dependencies out of the way, we were ready to publish on PyPi and we did !
We had some initial issues with building wheels for different versions of Python while embedding the Rust part, so we moved it to unblob-native in order to have a pure Python package. We’ve also moved away from python-lzo due to the requirement for the installer to have a C compiler and liblzo headers. This is probably as lean as we can get.
This means everything is now available as pre-built wheels for Linux and MacOS (on x86-64 and AARCH64 architecture), with support for Python 3.8 to 3.11 !
Antoine joined the unblob development team for a 15 weeks internship and did not disappoint ! With an average of one format released per week, he added handlers and extractors for the following formats: Netgear CHK, Netgear TRX, D-Link SHRS, D-Link encrpted_img (sic), Xiaomi HDR1, QNAP QTS, Instar BNEG, HP BDL, HP IPKG, and Instar HD. He’s currently working on the Expert Witness Compression Format used by forensics tools.
We’re sure Antoine has a bright future in this field and can’t wait to attend his thesis defense !
We mentioned Pyperscan in our last unblob blog post. It’s a Rust library with Python binding for Vectorscan (the port fo Hyperscan with multi-architectural support) built by our very own vlaci. I wanted to get back to it and mention that version 0.2.2 is now available, with a bunch of improvements initiated by unblob’s need.
Check it out if you ever need to perform extremely fast pattern matching in Python.
Multi-volumes handling is something that was recently added to our roadmap. Multi-volume means a file (archived or compressed) that is spread through different files. It could be an archive file split in different chunks of fixed size (as offered by 7-zip with the multipart/split option), or a file format made of an index file referencing other files.
Since unblob used to operate on single files, multi-volume handling is a significant architectural change. We’re planning on doing this by offering a
DirectoryHandler class that acts in a similar way than a
Handler . Instead of a bytes pattern, a directory handler defines a filename pattern. Whenever unblob traverses a new directory within the extraction directory, it will check if any of the files matches one of the directory handler patterns. If there is a match, these files are fed to the directory handler so that it can perform the reconstruction and extraction/decompression. If the operation is successful, these files are not scanned by unblob in the traditional way.
You can follow our progress on multi-volume handling here.
Sasquatch is Cursed
When we initially developed the SquashFS handler for unblob, we chose sasquatch as third party extractor because it was the de-facto standard, probably thanks to its use by binwalk. As time went by, we came to the realization that sasquatch is cursed:
- In order to add support for SquashFS version 2, we had to fix a weird bug that probably appeared due to an impedance mismatch between squashfs-tools and the sasquatch patches.
- Since sasquatch is based on squashfs-tools version 4.3.x, we figured that we were leaving unblob users vulnerable to CVE-2021-41072. We therefore decided to rebase our sasquatch fork onto version 4.5.1.
- With this rebase, another impedance mismatch bug appeared: squashfs-tools returns an a non-zero exit code on non-fatal exceptions since version 4.4, and sasquatch raise non-fatal exceptions when it enumerate through the different compression implementation, especially with LZMA adaptive it appears. The conjunction of both behavior made it break on some D-Link firmware (thanks for the heads up from EMBA team). We implemented a fix in our fork, limiting the impact of not reporting non-fatal exceptions to a minimum.
I’m personally unsure about whether or not we should keep pushing with sasquatch, especially with projects like backhand round the corner. Only time will tell.
Yes, we found a way to make unblob even faster. Previously, the scanner would run through the whole file and if a pattern was identified within an already identified and validated chunk, it would simply be discarded.
Discarding those chunks was resource intensive, especially with formats holding plaintext data like ExtFS, so we decided to use an other approach: when a valid chunk is identified, we stop the scanner and make it start again right after the valid chunk.
Performance gains from this change will mostly be visible in files holding different chunks of concatenated data (e.g. a filesystem followed by a signature or fingerprint).
One Last Thing
Of course those big improvements should not overshadow all the nice improvements we added to the framework over the last 5 months:
- improved documentation & processes
- add StatReport for every Task
- return of non-POSIX path processing
- we moved to ruff as our linter (do try it, it’s great !)
- we changed the naming convention of extracted chunks
- fixed a regression in the ELF kernel initramfs extraction
- added more defensive programming to our safe tarfile implementation
- added a handler and extractor for Engenius firmwares
- delete the empty
- improved entropy calculation twice
What’s Next ?
As mentioned earlier, multi-volumes handling will probably land soon. Along with features that we initially mentioned in our previous blog post such as human readable output, meta-data reporting, and chunks auto-identification (e.g. auto-detect padding).
We will continue working hard on unblob to keep it stable, fast, and safe to use while expanding the formats it support. Something that probably led the EMBA maintainers to completely ditch binwalk in favor of unblob. This is a bold move, and it really means a lot to us !
We hope to see you in Singapore or in Vienna. Until then, we wish you a very pleasant Spring 🙂