Blog: Hardware Hacking

Fuzzy matching with Ghidra BSim, a guide

Adam Bromiley 05 Aug 2024

TL;DR

  • BSim, Ghidra’s new built-in plugin is a game-changer for reversing firmware and other stripped binaries.
  • Rapidly identify and annotate functions from known libraries.
  • Fuzzy matching works with unknowns, like exact library versions and compiler options.
  • Automatically define custom variable types and structures in your project.

Background

Oh no! You’re stuck disassembling yet another firmware blob stripped of symbols and lacking any handy reference strings.

You soon find yourself squirrelling down countless function call trees. The once-distinct line between application code and library code has blurred. DAT_0ff53d20, you somewhat confidentially labelled as perhaps_main_config_struct two weeks ago, has just appeared in a function that almost certainly performs low-level synchronisation of tasks in the RTOS.

If only there was a way to identify such functions and datatypes from the outset…

Ghidra 11.0 introduced BSim, the NSA’s 2023 Christmas gift to the reverse engineering community. Fuzzy festive feelings were complemented with fuzzy function matching.

BSim is a native plugin for finding equivalent functions across analysed binaries whilst accounting for possible variations like compiler optimisations, instruction set differences, and light feature patches.

Fuzzy matching with known code is particularly helpful when reversing RTOS and bare-metal firmware, since:

  • Libraries are compiled statically into the firmware blob, making it difficult to distinguish them from application code.
  • Symbols and debug symbols are often stripped.
  • Lacking a memory map, strings and static structures that would otherwise identify a function may have broken references.

Ghidra has long had Function IDs, but this operates on exact matches. A change in processor target, compiler flags, preprocessor macros, etc. requires a new FID database. With firmware dumps, it’s rare we know these specifics.

BSim instead compares functions with a “similarity” and “confidence” score:

  • Similarity ranges from 0.0 to 1.0, where 1.0 is an exact match.
  • Confidence is unbounded. Greater confidence means the functions share rare / unique features or have a large quantity of matching features. In Ghidra’s words: “the meatiness of a match.”

BSim is a new feature, so it’s in constant development and thus features may be added or changed with future Ghidra versions. This guide used version 11.1.1.

Our firmware

Presented to the disassembler is an ESP32 firmware dump from a recent research project.

We’ve already done light reversing work. The bare minimum. Sans coffee.

We have a function, aptly labelled unknown_function(). It doesn’t reference any known strings, functions, or globals. It also doesn’t implement a particularly identifying algorithm or use any magic numbers.

The context isn’t providing much help either: the calling function runs ip4addr_aton() beforehand (identified from some sort of assertion), and if we work our way up the call stack we end up at http_server_task() (manually identified off-camera). Aside from that, we’re lost.

Installation

Let’s start with the basics:

  1. Install Ghidra.
  2. Create or open any existing project.
  3. Open CodeBrowser and go to File > Configure.
  4. Under BSim’s entry, hit Configure and enable the “BSimSearchPlugin”. Close both dialogue boxes.

Creating a BSim database

We need a database to store the BSim signatures in. Ghidra offers three backends:

  1. H2 – a standalone local file requiring zero dependencies.
  2. PostgreSQL – uses a custom PostgreSQL distribution in Ghidra.
  3. Elasticsearch – requires an existing Elasticsearch server.

The latter two options are more complex to set up but necessary for Ghidra server instances. They also index data, so may be quicker for large signature collections.

This guide uses H2. I’m lazy. One database per project on a personal machine will not present any noticeable slowdown.

Staying in CodeBrowser, go to Window > Script Manager and run CreateH2BSimDatabaseScript.java:

Set the database parameters to anything. I’ve named mine “Signatures”. The database template can be nosize, 32, or 64 depending on if you want the database to be architecture-agnostic or cater specifically for 32 or 64-bit architectures respectively.

It will run and create the database file in your chosen directory:

Sample preparation

Now we want to fill the database with signatures of library functions we suspect are used by the firmware.

Analysing strings, you can often figure out characteristics like the RTOS or TCP / IP stack used by a device, at a minimum.

It wasn’t hard to identify the firmware as using the ESP-IDF framework, version 4.4.7. I could also get a list of components used (e.g., lwIP):

The goal is to therefore use signatures of known ESP-IDF 4.4.7 functions to identify code in the disassembled firmware. For this, we need to compile the library (or find a precompiled release) and ensure its symbols are not stripped.

The following commands were run in line with the official guide to download, setup, and build ESP-IDF (done in a Linux VM for ease):

(https://docs.espressif.com/projects/esp-idf/en/stable/esp32/get-started/linux-macos-setup.html)

Note that the point of fuzzy matching means we don’t require exactly version 4.4.7. Successful matches will still be made with any other ESP-IDF 4.0 release, albeit with less confidence; this is useful if the firmware lacks clear version indicators. I’m also compiling this with the default configuration. I.e., despite knowing the device is IPv4-only, I can compile ESP-IDF with IPv6 support and still get adequate matches.

ESP-IDF builds as distinct components rather than a standalone binary. Each component is located in build/esp-idf/:

Populating the BSim database

To populate the BSim database, we first need to import and auto-analyse the binaries. I’m doing this in a separate Ghidra project (named ESP-IDF Project) to avoid clutter.

Note that since the ESP-IDF build process generates static libraries, they are imported into Ghidra as batches of object files:

For each object file, open in CodeBrowser and analyse with the default settings. Then, from the Script Manager, run AddProgramToH2BSimDatabaseScript.java and set it to use the database file created earlier.

Scripted signature generation

There are 994 object files across the ESP-IDF components; manually importing them via the GUI is a chore. Instead, BSim offers a CLI for programmatically populating the database.

Having copied out the libraries to a folder (here, …\ESP-IDF\) and closing Ghidra to avoid project locks…

  1. Find the support scripts in the Ghidra install directory:
    > cd C:\Users\adam\Desktop\Tools\Ghidra\support\
  2. Batch import (remove -recursive otherwise) each library file to the new project and analyse them with the default options:
    > mkdir ‘C:\Users\adam\BSim Demo\ESP-IDF Project’
    > .\analyzeHeadless
        ‘C:\Users\adam\BSim Demo\ESP-IDF Project’
        ‘ESP-IDF Project’
        -import ‘C:\Users\adam\BSim Demo\ESP-IDF\*’
        -recursive

  1. For all imported binaries, generate and commit function signatures to the previously created BSim database:
    > .\bsim `
        generatesigs `
        ‘ghidra:/C:/Users/adam/BSim Demo/ESP-IDF Project/ESP-IDF Project.gpr’ `
        –bsim ‘file:/C:\Users\adam\BSim Demo\Signatures.mv.db’ `
        –commit

The command line also offers ways to script database creation, save the intermediate signature files, and manage the database.

Finding function matches

Now we can go back to the original firmware project.

  1. If not done already, import and analyse the firmware in CodeBrowser.
  2. Go to Bsim > Manage Servers and add the H2 database file.

We’re now ready to start identifying functions. Right-click any function and search for it in the BSim database.

The interface is relatively self-explanatory. A lower similarity or confidence threshold means more false positives but could help match functions with more significant differences.

Small functions like getters, setters, cleanup routines, and those returning constant values will exhibit high similarity but low confidence in a match.

Conversely larger functions, despite having more chance for variation and thus lower similarity, may possess a greater quantity of equivalent points. This results in a “meatier” match (i.e., holding more significance) and an increased confidence value.

In this particular project, dropping the similarity threshold to as low as 0.2 yielded useful results. In such cases, I cross-reference the decompilation with the library function’s source code. Obviously for closed-source projects only releasing compiled binaries, this may not be possible.

With a threshold of 0.3 we get two matches from socket.c in lwIP. This result roughly lines up with the context being an HTTP thread with neighbouring networking routines; I’d be suspicious, say, of matches to ESP-IDF’s USB driver.

Function comparison

Right-click a match and Compare Functions brings the decompilation of both functions side by side. The disassembly can be compared from the Listing View tab at the top, too.

There’re a lot of differences (cyan), but we can attribute these to relocations, failed function references, and the fact Ghidra hasn’t auto analysed several variables in our project as structures. Ultimately, the core functionality, code flow, scalar values, etc. all correlate.

Since lwIP is open source, we can compare it to the actual source code:

Looks good to me!

Right-click anywhere and apply the library’s function signature and data types to the project.

This is one of BSim’s more powerful features. Not only will it rename matching functions, but it will auto-create structs and typedefs for you. Just note that blind trust is not advised, and manual intervention may be required. E.g., sockaddr_storage has an extra member if compiled with IPv6 support, so applying that to a structure in an IPv4-only project may cause local decompilation quirks unless you redefine the datatype.

Close BSim and notice that unknown_function() has a shiny new name and sane datatypes. We can go one step further and, with access to the accompanying source code, identify every lwip_bind() callee too:

All your signature are belong to us

Let’s go one step further: select the entire program disassembly (Ctrl-A in the Listing pane) and click Bsim > Search Functions… to match all functions to signatures in the database.

This is particularly helpful in quickly identifying swathes of functionality from the get-go by targeting matches with high confidence and high similarity.

Also try BSim > Perform Overview…, which provides a table containing how many signature-matches each function has.

A function with many matches will be a simple routine comprising a few generic lines of code. This function just returns -1, which explains why BSim matches it to 454 library functions and reports a low self-significance.

More interesting are functions with fewer matches. They typically contain more complex routines that act as a unique fingerprint, making it trivial to validate the BSim comparison and apply the correct function name and variable datatypes:

Conclusion

This guide outlined the process to install and use the BSim plugin in Ghidra, concluding with using it to fuzzy match functions en masse from a complex development framework.

I’ve found it significantly enhances my reverse engineering workflow, allowing for quick and accurate annotation of functions and datatypes.

To summarise the steps:

  1. Enable BSim.
  2. Create an empty signature database with java – CreateH2BSimDatabaseScript.java
  3. Build or obtain a precompiled release of a library known to be used by your program. Ensure symbols aren’t stripped.
  4. Import and auto-analyse the library in Ghidra (in the same or a separate project).
  5. Create BSim signatures for library functions with java – AddProgramToH2BSimDatabaseScript.java
  6. Open the target program’s Ghidra project.
  7. For a suspected library function in the program’s disassembly, search for BSim matches.
  8. No results? Decrease similarity threshold.
  9. Results? Compare disassemblies and decompilations, with reference to source code if available.
  10. Apply function name and / or variable data types.
  11. Repeat.