Explore detailed experimental results, setup information, and supplementary materials related to Provexa.
Attacks based on commonly used exploits.
Multi-host multi-step intrusive attacks
The attacker used an external host (Command & Control (C2) server) to perform initial penetration, distribute malware, and exfiltrate data. The first host compromised by the attacker is named as Host 1, which is a starting point to perform lateral movement and other malicious actions to compromise more hosts.
Real-world malware cases
We obtained a dataset of free Windows malware samples from VirusSign. We focused on 5 largest categories (i.e., Trojan.Autorun, Trojan.Danger, Virus.Hijack, Virus.Infector, Virus.Sysbot), and we randomly selected 1 malware sample for each category.
DARPA TC attack cases
We selected 3 attack cases from the DARPA Transparent Computing (TC) Engagement #5 data release. Specifically,
the dataset consists of the captured system audit logs of six performer systems (i.e., ClearScope, FiveDirections,
THEIA, TRACE, CADETS, and MARPLE) under the attack of the red team using different attack strategies.
The audit logs include both benign system activities and attack activities. The dataset also includes a ground-truth
report with attack descriptions and setups for the cases. We first retrieved the logs for the five performers with desktop
OS's (excluding ClearScope that runs Android). After examining the logs, we found that the logs for CADETS lack key
attributes (e.g., file name), making us unable to confirm the attack ground truth to conduct evaluations. In MARPLE,
the attack failed. In TRACE, we found that forward tracking one step can reveal the attack sequence. Thus, we do not
consider CADETS, MARPLE, and TRACE cases. Nevertheless, similar attacks were already performed for other performer systems
and their logs are covered. For the other two performer systems, the attack cases for the same performer system are largely
similar, and thus we selected at least one attack case from each system to include in our evaluation benchmark. To use the logs
for evaluation, we developed a tool to parse the released logs and loaded the parsed system entities and system events into Provexa's databases.
ATLASv2 attack cases
We selected 4 attack cases from the ATLASv2 Attack Engagement dataset. This dataset builds on the first version
of the ATLAS dataset and includes audit logs recorded from various sources to increase the benign activities
recorded to simulate a more realistic dataset with enough noise data to accompany the malicious system events.
The dataset also includes a ground-truth report with attack descriptions and setups for the cases. In this dataset,
two host machines were used for daily usage by two researchers to record benign activies for 4 days. In the fifth day,
the malicious attacks were carried out while the victim hosts continued serving the daily use of the researchers.
This captures realistic system logs that closely resemble real-world attack logs. We select 4 attack cases that involve
different vulnerabilities in commonly used software - Adobe Flash and Microsoft Word.
We provide query examples written in different languages: ProvQL, SQL, Cypher, and Splunk SPL (search query only).
We select malicious_ssh_theft
as an example. Two queries (a backward tracking query, and a search query) are shown.
Query 1: Backward Tracking From the POI Events
In the first query, the security analyst performs backward tracking from the POI event whose cmdline attribute contains authorized_keys.
bg_query1 = back track where (cmdline like 'authorized_keys', type=process) from db(malicious_ssh_theft);
WITH RECURSIVE allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, CAST (dstip AS text), dstport, CAST (srcip AS text), srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process),
nodes AS (SELECT * FROM allnodes),
alledges AS (
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM fileevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM networkevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, 0 AS amount FROM processevent),
edges AS (SELECT e.* FROM alledges e INNER JOIN nodes n1 ON e.srcid = n1.id
INNER JOIN nodes n2 ON e.dstid = n2.id),
graph (id, srcid, dstid, starttime, endtime, hostname, optype, amount, threshold, step) AS (
SELECT *, edges.endtime, 1 FROM edges
WHERE dstid IN (SELECT id FROM nodes WHERE (cmdline like authorized_keys) AND (type = process))
UNION SELECT edges.*, LEAST(edges.endtime, graph.threshold), graph.step
FROM edges JOIN graph ON edges.dstid = graph.srcid
WHERE edges.starttime <= graph.threshold)
SELECT DISTINCT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM graph;
MATCH ()-[r]->(root)
WHERE (root.cmdline =~ authorized_keys) AND (root.type = process)
SET r.threshold = r.endtime, r.marked=true
WITH root
MATCH p = ()-[*..3]->(root)
WITH DISTINCT relationships(p) as r
FOREACH (i IN reverse(range(0, size(r)-2))
| FOREACH (n1 IN [r[i]]
| FOREACH (n2 IN [r[i+1]]
| FOREACH (edge IN
CASE
WHEN n1.starttime <= n2.threshold THEN [n1]
ELSE []
END | SET edge.marked=true
SET n1.threshold=CASE
WHEN n1.endtime > n2.threshold THEN n2.threshold
ELSE n1.endtime END))))
WITH DISTINCT r
MATCH (sn)-[rr]->(en)
WHERE (rr IN r AND rr.marked=true)
RETURN DISTINCT sn, rr, en
Query 2: Search for the Entry Nodes on the Backward Dependency Graph
In the second query, the security analyst searches for the attack entry nodes on the backward dependency graph.
search from bg_query1 where e1{exename like "ssh",type=process},
e2{type=process}
with e1->e2
return * as entry;
WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, CAST (dstip AS text), dstport, CAST (srcip AS text), srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process),
nodes AS (SELECT * FROM allnodes), alledges AS (
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM fileevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM networkevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, 0 AS amount FROM processevent),
edges AS (SELECT e.* FROM alledges e),
event1 (id, srcid, dstid, starttime, endtime, hostname, optype, amount) AS (
SELECT edges.* FROM
nodes n1 INNER JOIN edges ON n1.id = edges.srcid
INNER JOIN nodes n2 ON edges.dstid = n2.id
WHERE ((n1.exename like ssh) AND (n1.type = process)) AND (n2.type = process) AND (edges.optype != null)),
result AS (SELECT event1.* FROM event1 WHERE true)
SELECT * FROM result;
MATCH (e1)-[event0]->(e2)
WHERE (e1.exename =~ ssh) AND
(e1.type = process) AND
e2.type = process AND true
RETURN e1, e2, event0
| search index=process exename="ssh" type="process"
| fields id
| join type=inner id
[search index=processevent
| fields srcid, dstid, *]
| join type=inner dstid
[search index=process type="process"
| fields id, *
| rename id as dstid]
| table *
Detailed survey questions and results are presented below.
We carry out different optimization settings to get performance improvement by tuning the various config settings of PostgreSQL. For this experiments, we randomly selected 3 sample attack cases to evaluate various optimization settings. We average our results over 10 runs to account for variability.
We experimented with 4 categories of tuning: default, light tuning, moderate tuning and full tuning. These experiments are summarized as follows:
Stage | Configuration Focus | Purpose |
---|---|---|
A. Default (Vanilla) | No changes | Establish baseline |
B. Light tuning | Planner + memory settings only | Emulates a reasonable engineer tuning PG casually |
C. Moderate tuning | Light + cache + parallelism | Represents best-effort user tuning |
D. Full tuning | All settings incl. WAL/JIT/autovac | Maximize DB performance, but with more system impact |
This is our baseline. The default PostgreSQL configuration is unoptimized and tends to be conservative. By including this category in our experiment, we establish our baseline and serves a meaningful starting point to compare the optimizations' improvements.
Results:
Experiment Category | Mode | Avg. Cost | Avg. Execution Time |
---|---|---|---|
Vanilla - No Provexa | track | 197,851,757,322,145.12 | 550 |
Vanilla - No Provexa | search | 9,095.40 | 4,522.42 |
Vanilla - Provexa | track | 484.26 | 6.76 |
Vanilla - Provexa | search | 8,312.13 | 4,494.48 |
We introduce light tuning targeting memory, cache, and planner cost estimates.
OPTIMIZED_SETTINGS_LIGHT = (
"SET random_page_cost = 1.1; "
"SET cpu_tuple_cost = 0.005; "
"SET cpu_index_tuple_cost = 0.0025; "
"SET effective_cache_size = '12GB'; "
"SET work_mem = '128MB'; "
"SET default_statistics_target = 200; "
)
Results:
Experiment Category | Mode | Avg. Cost | Avg. Execution Time |
---|---|---|---|
Optimized - No Provexa | track | 72,316,505,058,575.20 | 498.34 |
Optimized - No Provexa | search | 3,882.48 | 3,841.44 |
Optimized - Provexa | track | 346.00 | 6.10 |
Optimized - Provexa | search | 3,391.91 | 4,362.55 |
In addition to the light settings, we add parallelism and collapse control
OPTIMIZED_SETTINGS_MEDIUM = (
OPTIMIZED_SETTINGS_LIGHT +
"SET join_collapse_limit = 12; "
"SET from_collapse_limit = 12; "
"SET parallel_setup_cost = 100; "
"SET parallel_tuple_cost = 0.01; "
)
Results:
Experiment Category | Mode | Avg. Cost | Avg. Execution Time |
---|---|---|---|
Optimized - No Provexa | track | 72,316,505,058,575.20 | 486.95 |
Optimized - No Provexa | search | 3,882.48 | 3,949.99 |
Optimized - Provexa | track | 346.00 | 5.73 |
Optimized - Provexa | search | 3,391.91 | 4,397.55 |
For this experiment, in addition to the medium settings, we update the following settings in the postgresql.conf
config file and restart the database engine for the change to take effect.
shared_buffers = 4GB
temp_buffers = 64MB
max_parallel_workers = 8
max_parallel_workers_per_gather = 4
maintenance_work_mem = 1GB
checkpoint_timeout = 15min
max_wal_size = 4GB
wal_buffers = 16MB
jit = on
jit_above_cost = 50000
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_scale_factor = 0.05
Results:
Experiment Category | Mode | Avg. Cost | Avg. Execution Time |
---|---|---|---|
Optimized - No Provexa | track | 72,316,505,058,575.20 | 497.65 |
Optimized - No Provexa | search | 3,882.48 | 4,046.09 |
Optimized - Provexa | track | 346.00 | 6.65 |
Optimized - Provexa | search | 3,391.91 | 4,774.89 |
We observe that the differences across optimization levels are minimal. We manually tuned a combination of light and medium settings to identify the best-performing configuration.
OPTIMIZED_SETTINGS_FINAL = (
"SET random_page_cost = 1.1; "
"SET cpu_tuple_cost = 0.005; "
"SET cpu_index_tuple_cost = 0.0025; "
"SET effective_cache_size = '8GB'; "
"SET work_mem = '64MB'; "
"SET join_collapse_limit = 8; "
"SET from_collapse_limit = 8; "
)
Results:
Experiment Category | Mode | Avg. Cost | Avg. Execution Time |
---|---|---|---|
Optimized - No Provexa | track | 72,316,505,213,136.03 | 246.45 |
Optimized - No Provexa | search | 3,882.48 | 2,556.05 |
Optimized - Provexa | track | 346.00 | 4.79 |
Optimized - Provexa | search | 3,391.91 | 3,046.11 |
We explored multiple levels of PostgreSQL tuning, ranging from the default configuration to a fully optimized setup that adjusts memory, planner cost estimates, parallelism, and JIT compilation. The default PostgreSQL settings, while functional, are known to be intentionally conservative, particularly to ensure stability on low-resource environments. As expected, our baseline results under this configuration exhibited very high planner costs and longer execution times, especially for tracking queries.
As we introduced progressively more aggressive tuning - light, moderate, and full tuning we observed some reduction in execution time, particularly for large search queries. However, even with full tuning, the improvements over the default were modest and inconsistent. No configuration level substantially reduced planner cost or execution time across the board.
In contrast, Provexa's decomposition and scheduling strategy consistently delivered significant performance improvements, both in execution time and query cost, regardless of the database tuning level. This highlights that our approach addresses structural inefficiencies in recursive and multi-step forensic queries that traditional database tuning cannot eliminate.
These results confirm that Provexa offers benefits orthogonal to backend tuning. It complements traditional optimizations but outperforms them even under conservative defaults, reinforcing the robustness and general applicability of our design.
This section provides a detailed analysis of Provexa's multi-query splitting strategy, which decomposes complex recursive graph traversal queries into smaller, more efficient subqueries. We demonstrate how this approach significantly outperforms traditional recursive SQL queries by reducing intermediate result sets, improving execution times, and enabling better database optimization. Through concrete examples using real attack cases, we show the step-by-step query decomposition process and compare performance metrics between recursive and split execution strategies.
WITH RECURSIVE ...
) to traverse the graph in the database.For each step, the code generates a query like:
WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, dstip::text, dstport, srcip::text, srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process
)
SELECT id FROM allnodes WHERE id IN (...);
This is repeated for each set of nodes discovered at each step, with the id IN (...)
clause updated as the traversal progresses.
bg = back track where
(cmdline like 'freemem.sh', type=process)
from db(case2_supply_chain);
Goal: Backtrack all processes with cmdline
like 'freemem.sh'
in the provenance graph.
This query is split into multiple (4) subqueries as follows:
WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, dstip::text, dstport, srcip::text, srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process)
SELECT id FROM allnodes WHERE (cmdline like '%freemem.sh%') AND (type = 'process')
WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, dstip::text, dstport, srcip::text, srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process)
SELECT id FROM allnodes WHERE id IN (7,11,12)
WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, dstip::text, dstport, srcip::text, srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process)
SELECT id FROM allnodes WHERE id IN (779,780,781,531,532,533,535,536,537,538,539,540,541,798,543,799,800,544,801,545,546,548,549,550,806,807,808,809,810,555,556,557,558,559,560,561,562,818,819,563,564,820,565,566,575,577,579,581,587,588,589,591,592,593,594,599,600,601,607,608,609,610,623,624,625,626,627,629,635,636,637,638,639,640,641,642,643,645,651,652,653,655,656,657,658,659,660,661,662,663,664,665,666,671,672,673,674,675,676,677,683,684,685,687,688,689,691,692,693,694,695,696,697,699,700,701,702,703,704,705,706,719,726,727,729,735,736,737,738,739,740,741,743,744,745,747,749,751,752,753)
WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, dstip::text, dstport, srcip::text, srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process)
SELECT id FROM allnodes WHERE id IN (38,30)
Miniquery 1 Output: [10]
cmdline like 'freemem.sh'
and type=process
).10
matches.Miniquery 2 Output: [11, 12, 7]
10
.11
, 12
, and 7
.Miniquery 3 Output: A large set of IDs:
[806, 645, 657, 799, 691, ..., 541] // (over 100 node IDs)
11, 12, 7
), the system finds their parents.Miniquery 4 Output: [30, 38]
30
and 38
are found as parents.By splitting up the queries, and looking for the parents, Provexa reconstructs the full path of events that led to the process 'freemem.sh'
. This way we avoid using expensive recursive query and iteratively execute simple search queries.
Advantages:
Taking a sample case malware_infector
, the search query:
search from bg where
e1{path like 'Virus.Infector', type=file},
e2{pid=2688, exename like 'exe', type=process},
e3{type=file}
with e1->e2 &&[<100s]
e2->e3
return * as entry;
was split into the following queries using Provexa:
WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, CAST (dstip AS text), dstport, CAST (srcip AS text), srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process),
alledges AS (
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM fileevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM networkevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, 0 AS amount FROM processevent ),
node1 AS (SELECT * FROM allnodes WHERE (allnodes.path like '%Virus.Infector%') AND (allnodes.type = 'file')),
node2 AS (SELECT * FROM allnodes WHERE ((allnodes.pid = 2688) AND (allnodes.exename like '%exe%')) AND (allnodes.type = 'process')),
edges AS (SELECT alledges.* FROM alledges WHERE 'true' = 'true'),
event (id, srcid, dstid, starttime, endtime, hostname, optype, amount) AS (
SELECT edges.* FROM
node1 n1 INNER JOIN edges ON n1.id = edges.srcid
INNER JOIN node2 n2 ON edges.dstid = n2.id
)
SELECT * FROM event
WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, CAST (dstip AS text), dstport, CAST (srcip AS text), srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process),
alledges AS (
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM fileevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM networkevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, 0 AS amount FROM processevent ),
node1 AS (SELECT * FROM allnodes WHERE allnodes.id IN (70)),
node2 AS (SELECT * FROM allnodes WHERE allnodes.type = 'file'),
edges AS (SELECT alledges.* FROM alledges WHERE 'true' = 'true'),
event (id, srcid, dstid, starttime, endtime, hostname, optype, amount) AS (
SELECT edges.* FROM
node1 n1 INNER JOIN edges ON n1.id = edges.srcid
INNER JOIN node2 n2 ON edges.dstid = n2.id
)
SELECT * FROM event
(exename like '%exe%' AND pid=2688)
srcid = 70
, leading to Bitmap Index Scan on fileevent_srcid
Conclusion: Provexa's decomposition allowed the database to apply smarter, localized plans with index-friendly constraints, avoiding expensive join cascades.
We examine recursive vs. split-query execution across backtracking and search queries on the same sample attack cases (case2_supply_chain
and malware_infector
).
Detailed results of the cost analysis are available here.
id IN (...)
with index scans or small sequential filtersThe recursive SQL query invokes complex sort-merge joins and processes large intermediate results even for small final outputs. In contrast, the Provexa approach breaks the task into well-scoped queries that exploit indexes, minimize intermediate state, and reduce execution time by an order of magnitude.
The multi-query (split) strategy offers performance advantages under the following structural and query characteristics:
cmdline like '%freemem.sh%'
), split queries target only relevant subgraphs at each step. This avoids the exhaustive materialization typical of recursive CTEs.Metric | Recursive SQL | Provexa Split Execution |
---|---|---|
Cost Estimate | 73 million | 22–73 per query |
Execution Time | ~31 ms | ~0.1–0.6 ms per query |
Intermediate Rows | 144M | 1–140 per query |
Join Strategy | Recursive, multi-way | Simple, indexed |
Memory Footprint | ~1MB+ | <100KB per step |
Application Control | None (SQL-only) | Fine-grained, flexible |
This cost-based analysis confirms that Provexa's multi-query execution model is effective in practice and plays to the strength of the databases: