Appendix

Explore detailed experimental results, setup information, and supplementary materials related to Provexa.


Table of Contents

Descriptions of Attack Cases Used in Evaluations

Attacks based on commonly used exploits.

  • Wget Executable (Case malicious_wget): The attacker uses wget to download an executable script and executes it.
  • Illegal Storage (Case malicious_illegal_store): A server administrator uses wget to download suspicious files to other users' home directory.
  • Hide File (Case malicious_hide_file): The attacker downloads a malicious file and hides it by changing its file name and location.
  • Backdoor Download (Case malicious_backdoor_dl): A malicious insider uses wget to download the backdoor malware from a compromised server and hides it among normal files.
  • Annoying Server User (Case malicious_server_usr): The annoying user logs into other users' home directories on a vulnerable server and downloads a script to write some garbage data.
  • Password-Free Theft (Case malicious_ssh_theft): The attacker logs into other users' server via ssh and appends her public key to the ~/.ssh/authorized_keys file to leave a persistent backdoor.
  • Wget-GCC-Crash (Case malicious_gcc_crash): The attacker logs into other's server via ssh, uses wget to download a C source code file, and uses GCC to compile it. The attacker then executes this malicious executable file to exhaust the memory and crash the system.
  • Scan and Login (Case malicious_scan_login): The attacker uses Nmap to explore the accessible servers whose port 22 is open for SSH connection. The attacker then uses a script to automatically test if there exists a password-less login with SSH keys to the discovered servers.
  • Password Reuse (Case malicious_pwd_reuse): An administrator decodes the /etc/shadow file with the John the Ripper password cracker, and uses the cracked password to login as another user.
  • Cheating Student (Case malicious_student): A cheating student hacks into her school's Apache server and downloads midterm scores.


Multi-host multi-step intrusive attacks
The attacker used an external host (Command & Control (C2) server) to perform initial penetration, distribute malware, and exfiltrate data. The first host compromised by the attacker is named as Host 1, which is a starting point to perform lateral movement and other malicious actions to compromise more hosts.

  • Shellshock Penetration (Case multistep_penetration): The attacker exploits the Shellshock vulnerability to execute arbitrary code on Host 1.
  • Password Crack (Case multistep_password_crack): After the initial penetration on Host 1 in the previous attack, the attacker connects to Dropbox and downloads an image where C2 server's IP address is encoded in the image. Based on the IP, the attacker downloads a malware from the C2 server to Host 1. When the malware is executed, it scans the SSH configuration file to locate other reachable hosts in the network. After this discovery phase, the malware downloads another script from the C2 server and sends it to these discovered hosts to steal password from them.
  • Data Leakage (Case multistep_data_leakage): After compromising the discovered hosts, the attacker steals sensitive data files, by scanning the file system, putting files into a compressed file, and transferring it back to the C2 server.
  • Command-Line Injection (Case multistep_cmd_injection): The victim host runs Kodi Mediaplayer, which exports remote control API as a web service. One of its input sanitizations has an error that fails to filter invalid input from the outside, which in turn allows the attacker to inject arbitrary command blended in one of its requests.
  • Supply Chain Attack (Case multistep_supply_chain): The attacker discovers that the GNU Wget version that runs on a victim server is vulnerable to writing arbitrary files by deceptively redirecting requests from HTTP to a compromised FTP server (CVE-2016-4971). The attacker then embeds a common remote access trojan (RAT) into a Ubuntu package and compromises one of the Ubuntu repositories. In this way, when the victim server uses wget to download Ubuntu packages, the requests are redirected to the attacker's FTP server, and the downloaded package contains the RAT. When the victim installs the downloaded packages on the victim server, the RAT is triggered, establishing a C2 channel.
  • Phishing Email (Case multistep_phishing_email): A malicious Trojan is downloaded as an Email attachment and the enclosed macro is triggered by Microsoft Word. This allows the attacker to run arbitrary code in the context of the current user (CVE-2017-11882). After the initial penetration via the phishing email, the attacker steals sensitive data files from the compromised host.
  • Netcat Backdoor (Case multistep_netcat_backdoor): A malicious Trojan is downloaded by a victim user from a compromised HTTP server via a deceptive URL (CVE-2018-14574). The attacker penetrates into the victim host through the Trojan backdoor, and uses the netcat utility to maintain a persistent Netcat backdoor.
  • WannaCry (Case multistep_wannacry): An attacker exploits the EternalBlue vulnerability (CVE-2017-0146) in network to access the machines, and then encrypts data.


Real-world malware cases
We obtained a dataset of free Windows malware samples from VirusSign. We focused on 5 largest categories (i.e., Trojan.Autorun, Trojan.Danger, Virus.Hijack, Virus.Infector, Virus.Sysbot), and we randomly selected 1 malware sample for each category.

  • Case malware_autorun: Trojan.Autorun
  • Case malware_danger: Trojan.Danger
  • Case malware_hijack: Virus.Hijack
  • Case malware_infector: Virus.Infector
  • Case malware_sysbot: Virus.Sysbot


DARPA TC attack cases
We selected 3 attack cases from the DARPA Transparent Computing (TC) Engagement #5 data release. Specifically, the dataset consists of the captured system audit logs of six performer systems (i.e., ClearScope, FiveDirections, THEIA, TRACE, CADETS, and MARPLE) under the attack of the red team using different attack strategies. The audit logs include both benign system activities and attack activities. The dataset also includes a ground-truth report with attack descriptions and setups for the cases. We first retrieved the logs for the five performers with desktop OS's (excluding ClearScope that runs Android). After examining the logs, we found that the logs for CADETS lack key attributes (e.g., file name), making us unable to confirm the attack ground truth to conduct evaluations. In MARPLE, the attack failed. In TRACE, we found that forward tracking one step can reveal the attack sequence. Thus, we do not consider CADETS, MARPLE, and TRACE cases. Nevertheless, similar attacks were already performed for other performer systems and their logs are covered. For the other two performer systems, the attack cases for the same performer system are largely similar, and thus we selected at least one attack case from each system to include in our evaluation benchmark. To use the logs for evaluation, we developed a tool to parse the released logs and loaded the parsed system entities and system events into Provexa's databases.

  • Case tc_fivedirections_1: 05092019 1326 - FiveDirections 2 - Firefox Drakon APT Elevate Copykatz Sysinfo
  • Case tc_fivedirections_2: 05172019 1226 - FiveDirections 3 - Firefox DNS Drakon APT FileFilter-Elevate
  • Case tc_theia: 05152019 1448 - THEIA 1 - Firefox Drakon APT BinFmt-Elevate Inject


ATLASv2 attack cases
We selected 4 attack cases from the ATLASv2 Attack Engagement dataset. This dataset builds on the first version of the ATLAS dataset and includes audit logs recorded from various sources to increase the benign activities recorded to simulate a more realistic dataset with enough noise data to accompany the malicious system events. The dataset also includes a ground-truth report with attack descriptions and setups for the cases. In this dataset, two host machines were used for daily usage by two researchers to record benign activies for 4 days. In the fifth day, the malicious attacks were carried out while the victim hosts continued serving the daily use of the researchers. This captures realistic system logs that closely resemble real-world attack logs. We select 4 attack cases that involve different vulnerabilities in commonly used software - Adobe Flash and Microsoft Word.

  • Case atlasv2_s1: CVE-2015-5122 - Adobe Flash Exploit
  • Case atlasv2_s2: CVE-2015-3105 - Adobe Flash Exploit
  • Case atlasv2_s3: CVE-2017-11882 - Microsoft Word Exploit
  • Case atlasv2_s4: CVE-2017-0199 - Microsoft Word Exploit

Example Queries

We provide query examples written in different languages: ProvQL, SQL, Cypher, and Splunk SPL (search query only). We select malicious_ssh_theft as an example. Two queries (a backward tracking query, and a search query) are shown.

Query 1: Backward Tracking From the POI Events
In the first query, the security analyst performs backward tracking from the POI event whose cmdline attribute contains authorized_keys.

  1. ProvQL: The security analyst uses the command line and type constraints to specify the POI event. The result is bound to variable bg_query1.
    bg_query1 = back track where (cmdline like 'authorized_keys', type=process) from db(malicious_ssh_theft);


  2. SQL: In SQL, the security analyst uses Common Ta- ble Expressions (CTEs) to create a recursive query. WITH RECURSIVE defines recursive CTEs that unify all nodes and edges from different tables. Then, the security analyst uses SELECT to perform search.
    WITH RECURSIVE allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
    SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
    SELECT 'network', id, NULL::text, NULL::text, CAST (dstip AS text), dstport, CAST (srcip AS text), srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
    SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process),
    nodes AS (SELECT * FROM allnodes),
    alledges AS (
    SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM fileevent UNION
    SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM networkevent UNION
    SELECT id, srcid, dstid, starttime, endtime, hostname, optype, 0 AS amount FROM processevent),
    edges AS (SELECT e.* FROM alledges e INNER JOIN nodes n1 ON e.srcid = n1.id
    INNER JOIN nodes n2 ON e.dstid = n2.id),
    graph (id, srcid, dstid, starttime, endtime, hostname, optype, amount, threshold, step) AS (
    SELECT *, edges.endtime, 1 FROM edges
    WHERE dstid IN (SELECT id FROM nodes WHERE (cmdline like authorized_keys) AND (type = process))
    UNION SELECT edges.*, LEAST(edges.endtime, graph.threshold), graph.step
    FROM edges JOIN graph ON edges.dstid = graph.srcid
    WHERE edges.starttime <= graph.threshold)
    SELECT DISTINCT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM graph;


  3. Cypher: In Cypher, the security analyst first matches all the paths. Then, the security analyst writes nested loops to traverse all relationships and joins the results.
    MATCH ()-[r]->(root)
    WHERE (root.cmdline  =~  authorized_keys) AND (root.type = process) 
    SET r.threshold = r.endtime, r.marked=true
    WITH root
    MATCH p = ()-[*..3]->(root)
    WITH DISTINCT relationships(p) as r
    FOREACH (i IN reverse(range(0, size(r)-2))
    | FOREACH (n1 IN [r[i]]
    | FOREACH (n2 IN [r[i+1]]
    | FOREACH (edge IN
        CASE
        WHEN n1.starttime <= n2.threshold THEN [n1]
        ELSE []
        END | SET edge.marked=true
              SET n1.threshold=CASE 
                WHEN n1.endtime > n2.threshold THEN n2.threshold
                ELSE n1.endtime END))))
    WITH DISTINCT r
    MATCH (sn)-[rr]->(en)
    WHERE (rr IN r AND rr.marked=true)
    RETURN DISTINCT sn, rr, en



Query 2: Search for the Entry Nodes on the Backward Dependency Graph
In the second query, the security analyst searches for the attack entry nodes on the backward dependency graph.

  1. ProvQL: Using ProvQL, the security analyst searches from bg_query1 bound to a local in-memory graph and specifies event relationships in a with clause. The result is bound to variable entry.
    search from bg_query1 where e1{exename like "ssh",type=process},
    e2{type=process}
    with e1->e2
    return * as entry;


  2. SQL: Since SQL does not support in-memory manage- ment, the security analyst needs to union all tables again. Then, the security analyst uses the SELECT clause to perform the search.
    WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
    SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
    SELECT 'network', id, NULL::text, NULL::text, CAST (dstip AS text), dstport, CAST (srcip AS text), srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
    SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process),
    nodes AS (SELECT * FROM allnodes), alledges AS (
    SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM fileevent UNION
    SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM networkevent UNION
    SELECT id, srcid, dstid, starttime, endtime, hostname, optype, 0 AS amount FROM processevent),
    edges AS (SELECT e.* FROM alledges e),
    event1 (id, srcid, dstid, starttime, endtime, hostname, optype, amount) AS (
        SELECT edges.* FROM
        nodes n1 INNER JOIN edges ON n1.id = edges.srcid
        INNER JOIN nodes n2 ON edges.dstid = n2.id
        WHERE ((n1.exename like ssh) AND (n1.type = process)) AND (n2.type = process) AND (edges.optype != null)),
    result AS (SELECT event1.* FROM event1 WHERE true)
    SELECT * FROM result;


  3. Cypher: In Cypher, the security analyst uses MATCH to specify the pattern and uses WHERE to specify the constraints.
    MATCH (e1)-[event0]->(e2)
    WHERE (e1.exename =~ ssh) AND 
    (e1.type = process) AND
    e2.type = process AND true
    RETURN e1, e2, event0


  4. Splunk SPL: In Splunk SPL, the security analyst searches different tables and uses join type=inner to join the results.
    | search index=process exename="ssh" type="process"
    | fields id
    | join type=inner id
        [search index=processevent
        | fields srcid, dstid, *]
    | join type=inner dstid
        [search index=process type="process"
        | fields id, *
        | rename id as dstid]
    | table *

Survey Questions and Results

Detailed survey questions and results are presented below.

Database Optimization Configurations

We carry out different optimization settings to get performance improvement by tuning the various config settings of PostgreSQL. For this experiments, we randomly selected 3 sample attack cases to evaluate various optimization settings. We average our results over 10 runs to account for variability.

We experimented with 4 categories of tuning: default, light tuning, moderate tuning and full tuning. These experiments are summarized as follows:

Stage Configuration Focus Purpose
A. Default (Vanilla) No changes Establish baseline
B. Light tuning Planner + memory settings only Emulates a reasonable engineer tuning PG casually
C. Moderate tuning Light + cache + parallelism Represents best-effort user tuning
D. Full tuning All settings incl. WAL/JIT/autovac Maximize DB performance, but with more system impact

A) Default

This is our baseline. The default PostgreSQL configuration is unoptimized and tends to be conservative. By including this category in our experiment, we establish our baseline and serves a meaningful starting point to compare the optimizations' improvements.

Results:

Experiment Category Mode Avg. Cost Avg. Execution Time
Vanilla - No Provexa track 197,851,757,322,145.12 550
Vanilla - No Provexa search 9,095.40 4,522.42
Vanilla - Provexa track 484.26 6.76
Vanilla - Provexa search 8,312.13 4,494.48

B) Light Tuning

We introduce light tuning targeting memory, cache, and planner cost estimates.

OPTIMIZED_SETTINGS_LIGHT = (
    "SET random_page_cost = 1.1; "
    "SET cpu_tuple_cost = 0.005; "
    "SET cpu_index_tuple_cost = 0.0025; "
    "SET effective_cache_size = '12GB'; "
    "SET work_mem = '128MB'; "
    "SET default_statistics_target = 200; "
)

Results:

Experiment Category Mode Avg. Cost Avg. Execution Time
Optimized - No Provexa track 72,316,505,058,575.20 498.34
Optimized - No Provexa search 3,882.48 3,841.44
Optimized - Provexa track 346.00 6.10
Optimized - Provexa search 3,391.91 4,362.55

C) Medium Tuning

In addition to the light settings, we add parallelism and collapse control

OPTIMIZED_SETTINGS_MEDIUM = (
    OPTIMIZED_SETTINGS_LIGHT +
    "SET join_collapse_limit = 12; "
    "SET from_collapse_limit = 12; "
    "SET parallel_setup_cost = 100; "
    "SET parallel_tuple_cost = 0.01; "
)

Results:

Experiment Category Mode Avg. Cost Avg. Execution Time
Optimized - No Provexa track 72,316,505,058,575.20 486.95
Optimized - No Provexa search 3,882.48 3,949.99
Optimized - Provexa track 346.00 5.73
Optimized - Provexa search 3,391.91 4,397.55

D) Advanced Tuning

For this experiment, in addition to the medium settings, we update the following settings in the postgresql.conf config file and restart the database engine for the change to take effect.

shared_buffers = 4GB
temp_buffers = 64MB
max_parallel_workers = 8
max_parallel_workers_per_gather = 4
maintenance_work_mem = 1GB
checkpoint_timeout = 15min
max_wal_size = 4GB
wal_buffers = 16MB
jit = on
jit_above_cost = 50000
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_scale_factor = 0.05

Results:

Experiment Category Mode Avg. Cost Avg. Execution Time
Optimized - No Provexa track 72,316,505,058,575.20 497.65
Optimized - No Provexa search 3,882.48 4,046.09
Optimized - Provexa track 346.00 6.65
Optimized - Provexa search 3,391.91 4,774.89

Final Chosen Config

We observe that the differences across optimization levels are minimal. We manually tuned a combination of light and medium settings to identify the best-performing configuration.

OPTIMIZED_SETTINGS_FINAL = (
    "SET random_page_cost = 1.1; "
    "SET cpu_tuple_cost = 0.005; "
    "SET cpu_index_tuple_cost = 0.0025; "
    "SET effective_cache_size = '8GB'; "
    "SET work_mem = '64MB'; "
    "SET join_collapse_limit = 8; "
    "SET from_collapse_limit = 8; "
)

Results:

Experiment Category Mode Avg. Cost Avg. Execution Time
Optimized - No Provexa track 72,316,505,213,136.03 246.45
Optimized - No Provexa search 3,882.48 2,556.05
Optimized - Provexa track 346.00 4.79
Optimized - Provexa search 3,391.91 3,046.11

Key Observations

We explored multiple levels of PostgreSQL tuning, ranging from the default configuration to a fully optimized setup that adjusts memory, planner cost estimates, parallelism, and JIT compilation. The default PostgreSQL settings, while functional, are known to be intentionally conservative, particularly to ensure stability on low-resource environments. As expected, our baseline results under this configuration exhibited very high planner costs and longer execution times, especially for tracking queries.

As we introduced progressively more aggressive tuning - light, moderate, and full tuning we observed some reduction in execution time, particularly for large search queries. However, even with full tuning, the improvements over the default were modest and inconsistent. No configuration level substantially reduced planner cost or execution time across the board.

In contrast, Provexa's decomposition and scheduling strategy consistently delivered significant performance improvements, both in execution time and query cost, regardless of the database tuning level. This highlights that our approach addresses structural inefficiencies in recursive and multi-step forensic queries that traditional database tuning cannot eliminate.

These results confirm that Provexa offers benefits orthogonal to backend tuning. It complements traditional optimizations but outperforms them even under conservative defaults, reinforcing the robustness and general applicability of our design.

Multi-Query Splitting Analysis

This section provides a detailed analysis of Provexa's multi-query splitting strategy, which decomposes complex recursive graph traversal queries into smaller, more efficient subqueries. We demonstrate how this approach significantly outperforms traditional recursive SQL queries by reducing intermediate result sets, improving execution times, and enabling better database optimization. Through concrete examples using real attack cases, we show the step-by-step query decomposition process and compare performance metrics between recursive and split execution strategies.

Optimized Query Execution

1. How is the query optimized?

  • Unoptimized (Recursive): Uses a single large recursive SQL query (WITH RECURSIVE ...) to traverse the graph in the database.
  • Optimized (Non-Recursive): Instead of recursion, the traversal is performed in the application layer:
    • The code issues multiple, smaller SQL queries to fetch nodes and edges step-by-step.
    • Each step fetches only the relevant nodes/edges for that iteration, reducing the load on the database and avoiding deep recursion.

2. How is the equivalent query formed up?

For each step, the code generates a query like:

WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, dstip::text, dstport, srcip::text, srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process
)
SELECT id FROM allnodes WHERE id IN (...);

This is repeated for each set of nodes discovered at each step, with the id IN (...) clause updated as the traversal progresses.


Using a sample attack case (case2_supply_chain)

ProvQL Query

bg = back track where
(cmdline like 'freemem.sh', type=process)
from db(case2_supply_chain);

Goal: Backtrack all processes with cmdline like 'freemem.sh' in the provenance graph.

This query is split into multiple (4) subqueries as follows:

WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, dstip::text, dstport, srcip::text, srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process)
SELECT id FROM allnodes WHERE (cmdline  like  '%freemem.sh%') AND (type = 'process') 

WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, dstip::text, dstport, srcip::text, srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process)
SELECT id FROM allnodes WHERE id IN (7,11,12)

WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, dstip::text, dstport, srcip::text, srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process)
SELECT id FROM allnodes WHERE id IN (779,780,781,531,532,533,535,536,537,538,539,540,541,798,543,799,800,544,801,545,546,548,549,550,806,807,808,809,810,555,556,557,558,559,560,561,562,818,819,563,564,820,565,566,575,577,579,581,587,588,589,591,592,593,594,599,600,601,607,608,609,610,623,624,625,626,627,629,635,636,637,638,639,640,641,642,643,645,651,652,653,655,656,657,658,659,660,661,662,663,664,665,666,671,672,673,674,675,676,677,683,684,685,687,688,689,691,692,693,694,695,696,697,699,700,701,702,703,704,705,706,719,726,727,729,735,736,737,738,739,740,741,743,744,745,747,749,751,752,753) 

WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, dstip::text, dstport, srcip::text, srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process)
SELECT id FROM allnodes WHERE id IN (38,30)

Step 1: Find Initial Nodes (POI)

Miniquery 1 Output: [10]

  • The system finds all nodes matching your constraint (cmdline like 'freemem.sh' and type=process).
  • Only node with ID 10 matches.

Step 2: Find Immediate Parents

Miniquery 2 Output: [11, 12, 7]

  • The system looks for all nodes that have outgoing edges to node 10.
  • These are the direct parents (previous steps in the provenance): nodes 11, 12, and 7.

Step 3: Expand to Next Layer of Parents

Miniquery 3 Output: A large set of IDs:

[806, 645, 657, 799, 691, ..., 541]  // (over 100 node IDs)
  • For each of the nodes found in Step 2 (11, 12, 7), the system finds their parents.
  • This step expands the search to all nodes that are one more step back in the provenance chain.
  • The result is a much larger set of nodes, representing all possible "grandparents" in the provenance graph.

Step 4: Continue Backtracking

Miniquery 4 Output: [30, 38]

  • The system continues backtracking one more step for all nodes found in Step 3.
  • Now, only nodes 30 and 38 are found as parents.

By splitting up the queries, and looking for the parents, Provexa reconstructs the full path of events that led to the process 'freemem.sh'. This way we avoid using expensive recursive query and iteratively execute simple search queries.

How Optimized Backtrack Works:

  • Each step issues a simple, non-recursive SQL query for the current set of nodes.
  • The application collects results and uses them as input for the next step.
  • This continues until no new nodes are found (or a limit is reached).
  • At the end, you have the full ancestry (provenance) of your original POI node(s).

Advantages:

  • Traversal is managed by the application, not the database.
  • Each query is small and scoped.
  • No expensive recursive materialization.
  • Scales well with graph size and depth.

Search Queries

  • Simple search queries are usually not split.
  • When multiple constraints or pattern relationships are involved, Provexa may split the query for better performance or clarity.
  • This is done selectively and only when beneficial.

Taking a sample case malware_infector, the search query:

search from bg where
e1{path like 'Virus.Infector', type=file},
e2{pid=2688, exename like 'exe', type=process},
e3{type=file}
with e1->e2 &&[<100s]
e2->e3
return * as entry;

was split into the following queries using Provexa:

WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, CAST (dstip AS text), dstport, CAST (srcip AS text), srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process),
alledges AS (
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM fileevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM networkevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, 0 AS amount FROM processevent ),
node1 AS (SELECT * FROM allnodes WHERE (allnodes.path  like  '%Virus.Infector%') AND (allnodes.type = 'file')), 
node2 AS (SELECT * FROM allnodes WHERE ((allnodes.pid = 2688) AND (allnodes.exename  like  '%exe%')) AND (allnodes.type = 'process')), 
edges AS (SELECT alledges.* FROM alledges WHERE 'true' = 'true'),
event (id, srcid, dstid, starttime, endtime, hostname, optype, amount) AS (
    SELECT edges.* FROM
    node1 n1 INNER JOIN edges ON n1.id = edges.srcid
    INNER JOIN node2 n2 ON edges.dstid = n2.id
)
SELECT * FROM event

WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, CAST (dstip AS text), dstport, CAST (srcip AS text), srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process),
alledges AS (
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM fileevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM networkevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, 0 AS amount FROM processevent ),
node1 AS (SELECT * FROM allnodes WHERE allnodes.id IN (70)), 
node2 AS (SELECT * FROM allnodes WHERE allnodes.type = 'file'), 
edges AS (SELECT alledges.* FROM alledges WHERE 'true' = 'true'),
event (id, srcid, dstid, starttime, endtime, hostname, optype, amount) AS (
    SELECT edges.* FROM
    node1 n1 INNER JOIN edges ON n1.id = edges.srcid
    INNER JOIN node2 n2 ON edges.dstid = n2.id
)
SELECT * FROM event
  • Single Query:
    • Execution Time: 241.5 ms
    • PostgreSQL cost estimate: 4954.
    • Involves deeply nested joins and multiple filters on unindexed columns.
    • Even though it returns 39 final rows, it materializes large intermediate result sets.
  • Split Queries (Provexa):
    • Execution Time: 114.9 ms + 6.9 ms = ~121.8 ms
    • PostgreSQL cost estimate: ~1.5x lower (2878 + 451 = 3329)
    • More efficient due to early pruning:
      • First query narrows down process nodes quickly with (exename like '%exe%' AND pid=2688)
      • Second query benefits from knowing srcid = 70, leading to Bitmap Index Scan on fileevent_srcid
    • Plans use index scans, materialization, and hashing over smaller data windows.

Conclusion: Provexa's decomposition allowed the database to apply smarter, localized plans with index-friendly constraints, avoiding expensive join cascades.


Why Multi-query approach is superior?

We examine recursive vs. split-query execution across backtracking and search queries on the same sample attack cases (case2_supply_chain and malware_infector). Detailed results of the cost analysis are available here.

Recursive (Single SQL)

  • Planner Cost Estimate: ~74 million
  • Execution Time: ~31 ms
  • Intermediate Rows: 144 million
  • Query Plan: Deep recursive union, multi-way joins, large working set materialized

Provexa (Split Execution)

  • Cost per Query: ~22–73
  • Execution Time per Query: ~0.1–0.6 ms
  • Working Set: Targeted IDs (e.g., 3–140 nodes)
  • Query Plan: Simple id IN (...) with index scans or small sequential filters

Key Observation

The recursive SQL query invokes complex sort-merge joins and processes large intermediate results even for small final outputs. In contrast, the Provexa approach breaks the task into well-scoped queries that exploit indexes, minimize intermediate state, and reduce execution time by an order of magnitude.


When Does Multi-Query Execution Outperform Recursive SQL?

The multi-query (split) strategy offers performance advantages under the following structural and query characteristics:

  • Deep Graph Traversals: Recursive SQL queries generate increasingly large intermediate join results as the depth of traversal increases. Provexa mitigates this by limiting each query to a shallow slice of the graph.
  • Selective Filtering: When node or edge filters are highly selective (e.g., cmdline like '%freemem.sh%'), split queries target only relevant subgraphs at each step. This avoids the exhaustive materialization typical of recursive CTEs.
  • Complex Search Patterns: Queries with multiple constraints, pattern relationships, or temporal conditions are more efficiently handled when decomposed into smaller subqueries. This avoids large Cartesian joins and enables independent optimization of each constraint.

Summary

Metric Recursive SQL Provexa Split Execution
Cost Estimate 73 million 22–73 per query
Execution Time ~31 ms ~0.1–0.6 ms per query
Intermediate Rows 144M 1–140 per query
Join Strategy Recursive, multi-way Simple, indexed
Memory Footprint ~1MB+ <100KB per step
Application Control None (SQL-only) Fine-grained, flexible

Conclusion

This cost-based analysis confirms that Provexa's multi-query execution model is effective in practice and plays to the strength of the databases:

  • Reduces planner-estimated costs
  • Scales with graph depth and node count
  • Avoids recursive complexity and memory bloat
  • Maintains fine-grained control over graph exploration