Provexa: Enabling Efficient Attack Investigation via Human-in-the-Loop Security Analysis

Overview

Provexa is a human-in-the-loop investigation platform designed to uncover sophisticated, multi-stage cyberattacks by analyzing massive volumes of system audit data. At its core lies ProvQL, a powerful domain-specific language tailored for security analysts working over system provenance graphs.

The architecture comprises lightweight system agents that collect OS-level events (file, process, and network interactions), which are then parsed and stored in graph or relational databases. These events are transformed into provenance graphs where nodes represent system entities and edges represent causal event relationships.

Analysts interact with Provexa via a notebook-style UI that enables constructing queries, visualizing results, and progressively refining hypotheses. Two core primitives support investigation:

Attack Pattern Search: Search for suspicious multi-event behavior patterns (e.g., data exfiltration, remote access).
Causal Dependency Tracking: Uncover chains of events leading to or resulting from an attack indicator.

The domain-aware query engine intelligently schedules subqueries for optimized performance, and an in-memory management supports fast, iterative analysis. This architecture empowers analysts to stay focused on what matters — surfacing relevant attack behaviors without wading through irrelevant noise.

Motivation

System dependency graphs for a multi-stage, multi-host data leakage attack. The combined dependency graphs of the two victim hosts contain 100,524 nodes and 154,353 edges. The attack-relevant nodes and edges, highlighted in dark black, comprise only 20 nodes and 20 edges, indicating the significant challenge of finding a needle in a haystack.

Sophisticated cyberattacks like APTs unfold over multiple hosts and stages, often blending into normal activity. Traditional forensic tools fail to cope due to:

Massive event volume: Millions of daily system events overwhelm analysts.
Dependency explosion: Causal tracing from a single alert brings in thousands of irrelevant nodes.
Lack of analyst control: Existing systems don't support iterative refinement or domain-informed filtering.

Provexa solves this. We demonstrate how ProvQL is used to investigate the data leakage case in the figure above.

1

We search Host 1 for a process (curl) that reads a *.tar file and immediately sends it over the network. The result from the search confirms data exfiltration involving sensitive_data.tar. The result is stored in poi1 for further analysis.

                    poi1 = search from db(host1) where

                      e1{name="curl", type=process},

                      e2{name="*.tar", type=file},

                      e3{type=network}

                      with e2[read] → e1 && [<1s]

                      e1[write] → e3;

                    display poi1;

2

We backtrack poi1 on Host 1 to find the origin of the sensitive_data.tar file, filtering out benign processes like vscode. The results reveal that an scp process created the file by copying it from Host 2, confirming remote data transfer.

                    g1 = back track poi1 from db(host1) exclude nodes where name like "vscode";

                    display g1;

3

We backtrack the creation of sensitive_data.tar on Host 2. The query reveals a tar process packed /etc/passwd and /etc/shadow into the archive, confirming the data collection phase of the attack.

                    g2 = back track "sensitive_data.tar" from db(host2)

                      exclude nodes where name like "vscode";

                    display g2;

4

We trace backward from the curl process on Host 1 to uncover the attack's entry point. Non-critical activity like ping is excluded. The resulting graph, stored in g3, helps isolate the origin of the malicious process.

                    g3 = back track where exename like "curl" from db(host1)
                    include nodes where not path like "ping";

                    display g3;
                  

5

We search the in-memory graph g3 for events involving the attacker’s IP 20.69.152.188. The query reveals a lighttpd process that reads from this IP, suggesting the attacker exploited a web server vulnerability to compromise Host 1. The result is saved in poi2.

                    poi2 = search from g3 where 

                      e1{srcip="20.69.152.188"},

                      e2{type=process} 

                      with e1[read] → e2;

                    display poi2;

6

We merge g1 and g3 to build a comprehensive view of the attack on Host 1 (g4), then perform a forward trace from the entry point poi2 within this graph. Filtering out irrelevant nodes like cat, the resulting trace (g5) captures the critical path from entry to exfiltration.

                    g4 = g1 | g3;

                    g5 = forward track poi2 from g4
                    exclude nodes where name like "cat";

                    display g5;

The dark paths shown in Fig. 1 visualize the output of g5, capturing the key attack steps with minimal noise. This showcases how ProvQL enables efficient, step-by-step investigation of complex, multi-stage attacks.

Key Takeaways

🔍 Massive Reduction in Noise: Provexa filters out irrelevant events, achieving significant reduction in provenance graph size and isolating only the attack-relevant paths.
🧠 Human-Centric Querying: ProvQL empowers analysts to express high-level patterns and causal hypotheses with precision, while retaining control over filters, constraints, and graph traversal semantics.
🚀 Complementary Optimization Layer: Provexa acts as an intelligent pre-processing layer that reduces the query burden on the backend database. The system performs even better when the underlying engine is optimized, leading to lower execution time and reduced running cost.
🧪 User Study Results:
- Rated highest in usability, learnability, and task success.
- Users completed investigations with fewer iterations and more confidence than with SQL or Cypher.
- Provexa was preferred by nearly all participants for real-world use.
📊 Faster, Iterative Investigations: Provexa achieves faster query execution through in-memory result management, allowing reuse of intermediate results across queries, a feature not supported in general-purpose query languages. This leads to subsecond response times and supports fluid, iterative analyst workflows.
📈 Beyond Database Optimization: Our experiments show that Provexa improves execution time and cost by orders of magnitude beyond what can be achieved through backend database optimizations alone. It serves as a powerful optimization layer that amplifies performance even when the underlying database is fully tuned.

Want to Dive Deeper?

For more details on attack scenarios, example queries, and experimental results, check our full appendix page:

View Full Appendix →

BibTex

@misc{yang2024enablingefficientattackinvestigation,
    title={Enabling Efficient Attack Investigation via Human-in-the-Loop Security Analysis}, 
    author={Saimon Amanuel Tsegai and Xinyu Yang and Haoyuan Liu and Peng Gao},
    year={2024},
    eprint={2211.05403},
    archivePrefix={arXiv},
    primaryClass={cs.CR},
    url={https://arxiv.org/abs/2211.05403},
}