Appendix

Explore detailed experimental results, setup information, and supplementary materials related to Provexa.

Descriptions of Attack Cases Used in Evaluations

Attacks based on commonly used exploits.

Wget Executable (Case malicious_wget): The attacker uses wget to download an executable script and executes it.
Illegal Storage (Case malicious_illegal_store): A server administrator uses wget to download suspicious files to other users' home directory.
Hide File (Case malicious_hide_file): The attacker downloads a malicious file and hides it by changing its file name and location.
Backdoor Download (Case malicious_backdoor_dl): A malicious insider uses wget to download the backdoor malware from a compromised server and hides it among normal files.
Annoying Server User (Case malicious_server_usr): The annoying user logs into other users' home directories on a vulnerable server and downloads a script to write some garbage data.
Password-Free Theft (Case malicious_ssh_theft): The attacker logs into other users' server via ssh and appends her public key to the ~/.ssh/authorized_keys file to leave a persistent backdoor.
Wget-GCC-Crash (Case malicious_gcc_crash): The attacker logs into other's server via ssh, uses wget to download a C source code file, and uses GCC to compile it. The attacker then executes this malicious executable file to exhaust the memory and crash the system.
Scan and Login (Case malicious_scan_login): The attacker uses Nmap to explore the accessible servers whose port 22 is open for SSH connection. The attacker then uses a script to automatically test if there exists a password-less login with SSH keys to the discovered servers.
Password Reuse (Case malicious_pwd_reuse): An administrator decodes the /etc/shadow file with the John the Ripper password cracker, and uses the cracked password to login as another user.
Cheating Student (Case malicious_student): A cheating student hacks into her school's Apache server and downloads midterm scores.

Multi-host multi-step intrusive attacks
The attacker used an external host (Command & Control (C2) server) to perform initial penetration, distribute malware, and exfiltrate data. The first host compromised by the attacker is named as Host 1, which is a starting point to perform lateral movement and other malicious actions to compromise more hosts.

Shellshock Penetration (Case multistep_penetration): The attacker exploits the Shellshock vulnerability to execute arbitrary code on Host 1.
Password Crack (Case multistep_password_crack): After the initial penetration on Host 1 in the previous attack, the attacker connects to Dropbox and downloads an image where C2 server's IP address is encoded in the image. Based on the IP, the attacker downloads a malware from the C2 server to Host 1. When the malware is executed, it scans the SSH configuration file to locate other reachable hosts in the network. After this discovery phase, the malware downloads another script from the C2 server and sends it to these discovered hosts to steal password from them.
Data Leakage (Case multistep_data_leakage): After compromising the discovered hosts, the attacker steals sensitive data files, by scanning the file system, putting files into a compressed file, and transferring it back to the C2 server.
Command-Line Injection (Case multistep_cmd_injection): The victim host runs Kodi Mediaplayer, which exports remote control API as a web service. One of its input sanitizations has an error that fails to filter invalid input from the outside, which in turn allows the attacker to inject arbitrary command blended in one of its requests.
Supply Chain Attack (Case multistep_supply_chain): The attacker discovers that the GNU Wget version that runs on a victim server is vulnerable to writing arbitrary files by deceptively redirecting requests from HTTP to a compromised FTP server (CVE-2016-4971). The attacker then embeds a common remote access trojan (RAT) into a Ubuntu package and compromises one of the Ubuntu repositories. In this way, when the victim server uses wget to download Ubuntu packages, the requests are redirected to the attacker's FTP server, and the downloaded package contains the RAT. When the victim installs the downloaded packages on the victim server, the RAT is triggered, establishing a C2 channel.
Phishing Email (Case multistep_phishing_email): A malicious Trojan is downloaded as an Email attachment and the enclosed macro is triggered by Microsoft Word. This allows the attacker to run arbitrary code in the context of the current user (CVE-2017-11882). After the initial penetration via the phishing email, the attacker steals sensitive data files from the compromised host.
Netcat Backdoor (Case multistep_netcat_backdoor): A malicious Trojan is downloaded by a victim user from a compromised HTTP server via a deceptive URL (CVE-2018-14574). The attacker penetrates into the victim host through the Trojan backdoor, and uses the netcat utility to maintain a persistent Netcat backdoor.
WannaCry (Case multistep_wannacry): An attacker exploits the EternalBlue vulnerability (CVE-2017-0146) in network to access the machines, and then encrypts data.

Real-world malware cases
We obtained a dataset of free Windows malware samples from VirusSign. We focused on 5 largest categories (i.e., Trojan.Autorun, Trojan.Danger, Virus.Hijack, Virus.Infector, Virus.Sysbot), and we randomly selected 1 malware sample for each category.

Case malware_autorun: Trojan.Autorun
Case malware_danger: Trojan.Danger
Case malware_hijack: Virus.Hijack
Case malware_infector: Virus.Infector
Case malware_sysbot: Virus.Sysbot

DARPA TC attack cases
We selected 3 attack cases from the DARPA Transparent Computing (TC) Engagement #5 data release. Specifically, the dataset consists of the captured system audit logs of six performer systems (i.e., ClearScope, FiveDirections, THEIA, TRACE, CADETS, and MARPLE) under the attack of the red team using different attack strategies. The audit logs include both benign system activities and attack activities. The dataset also includes a ground-truth report with attack descriptions and setups for the cases. We first retrieved the logs for the five performers with desktop OS's (excluding ClearScope that runs Android). After examining the logs, we found that the logs for CADETS lack key attributes (e.g., file name), making us unable to confirm the attack ground truth to conduct evaluations. In MARPLE, the attack failed. In TRACE, we found that forward tracking one step can reveal the attack sequence. Thus, we do not consider CADETS, MARPLE, and TRACE cases. Nevertheless, similar attacks were already performed for other performer systems and their logs are covered. For the other two performer systems, the attack cases for the same performer system are largely similar, and thus we selected at least one attack case from each system to include in our evaluation benchmark. To use the logs for evaluation, we developed a tool to parse the released logs and loaded the parsed system entities and system events into Provexa's databases.

Case tc_fivedirections_1: 05092019 1326 - FiveDirections 2 - Firefox Drakon APT Elevate Copykatz Sysinfo
Case tc_fivedirections_2: 05172019 1226 - FiveDirections 3 - Firefox DNS Drakon APT FileFilter-Elevate
Case tc_theia: 05152019 1448 - THEIA 1 - Firefox Drakon APT BinFmt-Elevate Inject

ATLASv2 attack cases
We selected 4 attack cases from the ATLASv2 Attack Engagement dataset. This dataset builds on the first version of the ATLAS dataset and includes audit logs recorded from various sources to increase the benign activities recorded to simulate a more realistic dataset with enough noise data to accompany the malicious system events. The dataset also includes a ground-truth report with attack descriptions and setups for the cases. In this dataset, two host machines were used for daily usage by two researchers to record benign activies for 4 days. In the fifth day, the malicious attacks were carried out while the victim hosts continued serving the daily use of the researchers. This captures realistic system logs that closely resemble real-world attack logs. We select 4 attack cases that involve different vulnerabilities in commonly used software - Adobe Flash and Microsoft Word.

Case atlasv2_s1: CVE-2015-5122 - Adobe Flash Exploit
Case atlasv2_s2: CVE-2015-3105 - Adobe Flash Exploit
Case atlasv2_s3: CVE-2017-11882 - Microsoft Word Exploit
Case atlasv2_s4: CVE-2017-0199 - Microsoft Word Exploit

Example Queries

We provide query examples written in different languages: ProvQL, SQL, Cypher, and Splunk SPL (search query only). We select malicious_ssh_theft as an example. Two queries (a backward tracking query, and a search query) are shown.

Query 1: Backward Tracking From the POI Events
In the first query, the security analyst performs backward tracking from the POI event whose cmdline attribute contains authorized_keys.

ProvQL: The security analyst uses the command line and type constraints to specify the POI event. The result is bound to variable bg_query1.
```
bg_query1 = back track where (cmdline like 'authorized_keys', type=process) from db(malicious_ssh_theft);
```

SQL: In SQL, the security analyst uses Common Ta- ble Expressions (CTEs) to create a recursive query. WITH RECURSIVE defines recursive CTEs that unify all nodes and edges from different tables. Then, the security analyst uses SELECT to perform search.

WITH RECURSIVE allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, CAST (dstip AS text), dstport, CAST (srcip AS text), srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process),
nodes AS (SELECT * FROM allnodes),
alledges AS (
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM fileevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM networkevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, 0 AS amount FROM processevent),
edges AS (SELECT e.* FROM alledges e INNER JOIN nodes n1 ON e.srcid = n1.id
INNER JOIN nodes n2 ON e.dstid = n2.id),
graph (id, srcid, dstid, starttime, endtime, hostname, optype, amount, threshold, step) AS (
SELECT *, edges.endtime, 1 FROM edges
WHERE dstid IN (SELECT id FROM nodes WHERE (cmdline like authorized_keys) AND (type = process))
UNION SELECT edges.*, LEAST(edges.endtime, graph.threshold), graph.step
FROM edges JOIN graph ON edges.dstid = graph.srcid
WHERE edges.starttime <= graph.threshold)
SELECT DISTINCT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM graph;

Cypher: In Cypher, the security analyst first matches all the paths. Then, the security analyst writes nested loops to traverse all relationships and joins the results.

MATCH ()-[r]->(root)
WHERE (root.cmdline  =~  authorized_keys) AND (root.type = process) 
SET r.threshold = r.endtime, r.marked=true
WITH root
MATCH p = ()-[*..3]->(root)
WITH DISTINCT relationships(p) as r
FOREACH (i IN reverse(range(0, size(r)-2))
| FOREACH (n1 IN [r[i]]
| FOREACH (n2 IN [r[i+1]]
| FOREACH (edge IN
    CASE
    WHEN n1.starttime <= n2.threshold THEN [n1]
    ELSE []
    END | SET edge.marked=true
          SET n1.threshold=CASE 
            WHEN n1.endtime > n2.threshold THEN n2.threshold
            ELSE n1.endtime END))))
WITH DISTINCT r
MATCH (sn)-[rr]->(en)
WHERE (rr IN r AND rr.marked=true)
RETURN DISTINCT sn, rr, en

Query 2: Search for the Entry Nodes on the Backward Dependency Graph
In the second query, the security analyst searches for the attack entry nodes on the backward dependency graph.

ProvQL: Using ProvQL, the security analyst searches from bg_query1 bound to a local in-memory graph and specifies event relationships in a with clause. The result is bound to variable entry.
```
search from bg_query1 where e1{exename like "ssh",type=process},
e2{type=process}
with e1->e2
return * as entry;
```

SQL: Since SQL does not support in-memory manage- ment, the security analyst needs to union all tables again. Then, the security analyst uses the SELECT clause to perform the search.

WITH allnodes (type, id, name, path, dstip, dstport, srcip, srcport, pid, exename, exepath, cmdline) AS (
SELECT 'file', id, name, path, NULL::text, NULL::int, NULL::text, NULL::int, NULL::int, NULL::text, NULL::text, NULL::text FROM file UNION
SELECT 'network', id, NULL::text, NULL::text, CAST (dstip AS text), dstport, CAST (srcip AS text), srcport, NULL::int, NULL::text, NULL::text, NULL::text FROM network UNION
SELECT 'process', id, NULL::text, NULL::text, NULL::text, NULL::int, NULL::text, NULL::int, pid, exename, exepath, cmdline FROM process),
nodes AS (SELECT * FROM allnodes), alledges AS (
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM fileevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, amount FROM networkevent UNION
SELECT id, srcid, dstid, starttime, endtime, hostname, optype, 0 AS amount FROM processevent),
edges AS (SELECT e.* FROM alledges e),
event1 (id, srcid, dstid, starttime, endtime, hostname, optype, amount) AS (
    SELECT edges.* FROM
    nodes n1 INNER JOIN edges ON n1.id = edges.srcid
    INNER JOIN nodes n2 ON edges.dstid = n2.id
    WHERE ((n1.exename like ssh) AND (n1.type = process)) AND (n2.type = process) AND (edges.optype != null)),
result AS (SELECT event1.* FROM event1 WHERE true)
SELECT * FROM result;

Cypher: In Cypher, the security analyst uses MATCH to specify the pattern and uses WHERE to specify the constraints.

MATCH (e1)-[event0]->(e2)
WHERE (e1.exename =~ ssh) AND 
(e1.type = process) AND
e2.type = process AND true
RETURN e1, e2, event0

Splunk SPL: In Splunk SPL, the security analyst searches different tables and uses join type=inner to join the results.

| search index=process exename="ssh" type="process"
| fields id
| join type=inner id
    [search index=processevent
    | fields srcid, dstid, *]
| join type=inner dstid
    [search index=process type="process"
    | fields id, *
    | rename id as dstid]
| table *

Survey Questions and Results

Detailed survey questions and results are presented below.

Database Optimization Configurations

Detailed experiment setup and results are presented below.

Multi-Query Splitting Analysis