Kestrel provides a layer of abstraction to compose hunt-flows with standard hunt steps that run across many data sources and data types. This blogs overviews how data is retrieved, processed, and stored in Kestrel, and explains the 10x data retrieval performance improvement through Kestrel 1.5, 1.6, and 1.7.
Hunting is the procedure to find a string of needles (attack steps) in the haystack (gigantic monitored data pool). Among different hunt steps/commands in a hunt-flow (shown in Figure 1), the retrieval steps, e.g., GET
, are the steps to pull data from data sources to Kestrel internal data store (firepit). During the execution of the retrieval steps, data is transmitted, translated, and ingested into the internal data store.
The Kestrel internal data store implements an entity-relation view of the data on top of relational database—each type of entity resides in one table, and the relations between entities reside in a separate table. Kestrel variables are realized as views of entity tables. Most hunt steps such as transformation and inspection are read-only or projection steps, except that retrieval and enrichment steps (Kestrel analytics) could write new rows/columns in the tables, or even create tables, e.g., new entities from retrieval.
In real-world hunts, the data retrieval (Figure 2) is done through stix-shifter in the current Kestrel implementation, which enables hunters to connect to over 30 data sources (EDR, SIEM, log management systems, data lake, etc.). In Figure 2, most phases before the bottom three-box chain are lightweight and fast, while the most time-consuming phases are the three boxes at the bottom:
In theory, all phases of data retrieval should be executed in a sequence (shown in Figure 2). A simplified view is shown in Figure 3.
In reality, the first complication is pagination—big trunks of data may not be retrieved at once—many data sources such as Elasticsearch provide multi-page or multi-round retrieval for large data. For instance, Elasticsearch has a default page size of 10,000, and any amount of data larger than it should be fetched in multiple rounds with the search_after
API.
In Kestrel 1.5.8, we add pagination support in the stix-shifter data source interface as illustrated in Figure 4.
In Kestrel 1.5.10, we realize fast translation (a firepit function) as an alternative to stix-shifter result translation (shown in Figure 5). Without fast translation, stix-shifter translates raw data from data sources to STIX in JSON (text) before ingestion. With fast translation enabled (configurable in stixshifter.yaml
), the raw data is translated into Pandas DataFrame (Python-native data structure) to be more performant than JSON.
In Kestrel 1.6.0, we implement an async producer-consumer model between transmission and translation/ingestion to take advantage of the async support in stix-shifter v5. Unfortunately, this does not improve performance as expected since the CPU-bound translation (implemented in stix-shifter v5 async mode) always timeouts the I/O-bound transmission and restarts the transmission after the translation completes.
In Kestrel 1.7.0, we skip asyncio
and move to multiprocessing
to deal with both I/O-bound and CPU-bound phases. Figure 6 illustrates the data retrieval procedure with multi-process. The key operations are as follows:
stixshifter.yaml
).n
native data source queries.n
transmission worker processes, each contacts the data source to execute one native query and retrieve results back. Multiple pages of the result of the same query, if exist, are retrieved back one by one by the same transmission worker.We evaluate the performance improvement from Kestrel 1.5 to Kestrel 1.7 against three Elasticsearch data sources:
The single-hunt-step hunts we run against all three data sources are like the one below:
ips = GET ipv4-addr FROM stixshifter://datasourceX
WHERE value = '127.0.0.1'
LAST 24 HOURS
Different IP addresses and time ranges against different data sources are used to retrieve 50k-150k records per hunt. The best run out of 5 is picked up to rule out slow network outliers and calculate throughputs in the table below:
Kestrel 1.5 | Kestrel 1.7 | Speed | |
Single-small-node Elasticsearch Testbed | 136 rec/sec | 1655 rec/sec | 12x |
Single-large-node Elasticsearch Testbed | 221 rec/sec | 2399 rec/sec | 11x |
Multi-node Enterprise Elasticsearch Testbed | 796 rec/sec | 3894 rec/sec | 5x |
After the series of performance upgrades, Kestrel v1.7 finally reaches a reasonably good design regarding data retrieval performance. Regarding wall-clock time, Kestrel v1.7 spends 1x-2x time in the entire data retrieval procedure (Figure 2) than the time spend on transmission alone. Since the transmission time is largely decided by the speed of the data source API, network bandwidth, and the serialized pagination procedure, there is limited space to improve.
Beyond data retrieval performance, the Kestrel team continues to work on improvements and upgrades to make Kestrel an enterprise-grade hunting tool. We will introduce new syntax in Kestrel v1.7.1 to enable hunting with super big data (sampled results without waiting for full return). Happy hunting and join us on slack!
Dr. Xiaokui Shu is a Senior Research Scientist at IBM Research and the Technical Steering Committee Chair of the Open Cybersecurity Alliance (OCA).