Happy new year and best wishes from the unblob maintainers team ! It’s been too long since we’ve last talked about unblob. So, it’s about time we introduce you to the new features, bug fixes, and overall new formats that landed in unblob in the second half of 2023.
We had the opportunity to present unblob at Blackhat Asia 2023 in Singapore and BlackHat EU 2023 in London, where we also gave a workshop on “Hands-on Firmware Extraction, Exploration, and Emulation“. If you’re interested, the slides and accompanying material is available on Github.
Speaking of Github, we’re very close to 2,000 stars on the unblob repository. Don’t hesitate to star the project since it helps the project to gain in visibility !
Let’s now dive into the biggest changes that happened in unblob over the last few months.
We’ve worked on more than a hundred unblob handlers (both public and private), including dedicated extractors written in Python. Since we’re quite aware of the risks of path traversals, we kept writing boilerplate code to protect against path traversals over and over again.
This led to the introduction of the FileSystem
class in unblob core. FileSystem
is a class that you can use in an unblob Extractor
so that you don’t have to think about potential path traversals.
The idea is that you instantiate a FileSystem
object by giving it a Path
to the directory you want to restrict file operations to. From there, you can use the different FileSystem
functions to create directories, files, symlinks, or hardlinks. The API will take care of checking that everything is right and will store errors that you can consume later on.
An unblob extractor relying on the FileSystem
object cannot be coerced by a malicious file to create files or directories outside of the extraction directory or create links pointing outside of the extraction directory. If a malicious file attempts to do so, the FileSystem
will create and store a report in its reports
attribute that is later consumed by unblob to generate the JSON report and display them on the console.
Outside of the security implications, having a single and easy-to-use interface to interact with the filesystem really helps in simplifying extractor code. For example, here’s the IPKG extractor code moving away from boilerplate to using FileSystem
:
def extract(self, inpath: Path, outdir: Path): entries = [] + fs = FileSystem(outdir) with File.from_path(inpath) as file: header = self._struct_parser.parse("ipkg_header_t", file, Endian.LITTLE) file.seek(header.toc_offset, io.SEEK_SET) @@ -64,28 +77,18 @@ class HPIPKGExtractor(Extractor): entry_path = Path(snull(entry.name).decode("utf-8")) if entry_path.parent.name: raise InvalidInputFormat("Entry name contains directories.") - if not is_safe_path(outdir, entry_path): - logger.warning( - "Path traversal attempt, discarding.", - outdir=outdir, - ) - continue entries.append( ( - outdir.joinpath(outdir / entry_path.name), - Chunk( - start_offset=entry.offset, - end_offset=entry.offset + entry.size, - ), + Path(entry_path.name), + entry.offset, + entry.size, ) ) - for carve_path, chunk in entries: - carve_chunk_to_file( - file=file, - chunk=chunk, - carve_path=carve_path, - ) + for carve_path, start_offset, size in entries: + fs.carve(carve_path, file, start_offset, size) + + return ExtractResult(reports=list(fs.problems))
The code is easier to understand and if we ever spot a bug in our sandboxing implementation, there’s just one place where we need to fix it, rather than a multitude of handlers.
Of course, there’s no obligation to rely on the FileSystem
API if you write an Extractor
. However, we have a policy that merge requests to unblob with an Extractor
that does not use FileSystem
won’t be approved.
We initially developed unblob with our specific use case in mind, that is a firmware extractor running in a pipeline where we need visibility. This meant unblob yielding lots of logs to the console, which threw early users off.
The lack of human readable output was one of the things that was coming up the most when talking with unblob users, so we listened.
Now, if you run unblob without verbose flags (-v
), it will display a progress bar as it walks through the different layers, and then provides you with a nice summary of the extraction process:
Yes, the progress bar can go backwards in a very Windows XP-like manner. This is because unblob does not know in advance what will be in the next layer 🙂
One of the core benefits of unblob is the ability to identify unknown chunks of data. A good part of these unknown chunks is usually padding made of the same byte (i.e., 0x00
or 0xFF
) between two valid chunks (a kernel and filesystem, for example). Padding is not really relevant for analysts, so we wanted a way for unblob to inspect those unknown chunks and label them as being padding.
So now, whenever you extract something, if an unknown chunk is actually just padding, the carved chunk extension will be set to “padding” and reported as such in the JSON report.
Of course, the introduction of this feature paves the way for the implementation of identification of other types of unknown chunks. We could, for example, check if an unknown chunk is actually the MD5 hash of the previous chunk of data.
Some users in the community do not need the full features of unblob and just want to “scan” the file. That means having unblob report what it identifies in the file and stop there. We’ve introduced a few changes in unblob so that the extraction directory is not created in scan only mode, and the UI shows a nice summary of what was identified:
It’s been a constant for the team since the inception of unblob, and the second semester of 2023 also brought its fair share of weird file formats to handle 🙂
We encountered two challenges with tar archives: archives with sparse’d files, and v7 tar format (aka unix-compatible tar files).
The initial tar handler assumed that none of the files were sparse’d. A sparse’d file is an optimized representation of a file where some of its “empty” content (repeating null bytes) is removed. This is problematic because sparse’d files have two sizes: the original file size, and the sparse’d file size. Computing the end offset of a tar archive based on the original file size would mean getting past the archive, since it contains less than the original file.
v7 tar file headers do not have the ustar
magic that we match on for modern tar files, which is a pretty easy magic to match on. In order to match on those v7 archives, we had to build a regular expression that matches on mode, uid, gid, mtime, and size based on their properties such as ASCII encoding of octal values. This was not easy but we managed to find the right way to do it with less than 2% impact on processing speed !
We fixed a small bug in the extfs extraction implementation and also identified an endless loop bug in e2fsprogs. We reported the bug to the e2fsprogs maintainers through different channels but it’s still not fixed at this time, that’s why we forked it, fixed it, and moved unblob to this fork of e2fsprogs.
A FAT filesystem is made of the following four parts:
Since the first three parts are fixed size for a FAT filesystem, unblob just needs to compute the Data Region size in order to find the end offset of the filesystem. The Data Region contains a set number of clusters that themselves contains a set number of sectors. The number of clusters in the Data Region, sectors per clusters, and sector size are all stored in the filesystem information sector within the Reserved Sector.
When writing the handler, we made the hypothesis that a FAT file system would always contains all of its clusters even if they’re free. You have to keep in mind that when you generate say a 4GB FAT32 filesystem image with a single root directory, most clusters are free and will stay like this up until you fill the disk with new files and directories. But the disk still is 4GB, with lots of clusters full of null bytes ready to be written to.
Seems like a fair assumption, right ? Well, we recently got some FAT samples that were truncated so the Data Region would only contain active cluster sectors. It’s like the initial image minus the free clusters. This meant that we were computing an end offset way past the actual end offset of our FAT chunk.
We had to switch our approach for end offset calculation by parsing the File Allocation Table. Starting from the end of the FAT, we find the last non-freed (different than 0) cluster index. This cluster is the last cluster with data within the filesystem, we then translate the cluster index to an offset, that corresponds to the end offset.
We’re relying on pyfatfs for FAT image parsing since we quickly discovered that writing a FAT parser in python is a non-trivial tasks due to the need to support FAT12, FAT16, FAT32, and all the edge cases coming from extensions added to these formats. The exact implementation details are visible here.
Since the introduction of multi-volume handling in unblob (the capacity for handlers to handle formats split in different files, such as split 7z archives), we only ever publicly released one handler taking advantage of that feature.
We recently had a customer upload with a split gzip compressed stream that we were able to quickly add support for thanks to multi-volume support. The handler supports both “gzip then split” and “split then gzip”
We’ve also improved the way users can tell unblob to skip files based on their magic and added the ability to skip files based on their extension. Adding a new magic value to skip does not clear the default list of magics to skip but rather extend it. If you want to start from a clean slate, just use --clear-skip-magics
. To skip processing files with a specific extension, you can use --skip-extension
. Use one per extension you want to skip.
Of course, we also had to fix a few bugs along the way :
The FileSystem API is already a big step towards filesystem sandboxing that we plan to further lock down with the help of Landlock. Landlock is an unprivileged access control API exposed by the Linux kernel that a process can use to restrict itself at the filesystem level. This is being worked on in both unblob-native through Rust’s landlock crate, and in unblob to call unblob-native sandbox interface.
Another important next step is meta-data reporting. Initially, meta-data reporting will be limited to file header values (see here) but we plan on improving it to the point where unblob can report on extracted files metadata such as the file initial permissions and ownership if the format preserves them. We’re already documenting which formats are preserving that information in https://unblob.org/formats/.
As always, we strongly recommend you try unblob during your next embedded device assessment. If you encounter any kind of bugs, just open an issue. If you happen to develop a really cool handler, don’t hesitate to open a pull request. If you need help when developing a handler or extractor, just open a discussion.
Have a wonderful year everyone !
The unblob maintainer team.