Plans for the future

This is a kind of a high level TODO list what needs to be done before the aggregator can be considered fully usable (in its first version). Some of these tasks (and some other smaller things) are described as issues at gitlab.

1 Switch to Suricata

We don't want to finish writing the guts daemon and we want to switch to Suricata to do the analysis. The switch will comprise of these (bigger) tasks:

1.1 The Suricata data source

The Suricata IDS can be configured to output events of what happens on the network in the so called event log. Furthermore, it can be made to log the events through a unix-domain socket.

The data source therefore will listen on a unix domain socket where Suricata connects. It'll convert each JSON event into an update with appropriate tags and provide it in the push mode.

Unknown events should probably be ignored (or logged, but not considered hard errors). We don't provide statistics from Suricata.

1.2 The conntrack data source

We will turn connection tracking and connection tracking accounting on in the kernel. We then provide the statistics and basic events (such as creation of flows or their ends) based on this. This complements data from Suricata.

The idea is this data source is hybrid ‒ it would provide events (such as an end of a flow, including the statistics at that time) in push mode and read the statistics of all the active flows in pull mode.

There are two general approaches to this. We can start an external program (conntrack) that outputs to its standard output. We then parse its output. Some experiments with parsing the output can be seen in https://gitlab.labs.nic.cz/mvaner/conntrack. There's parsing of both the XML and the textual output, in separate branches.

The other approach (cleaner, but probably slightly harder to do) is to directly open the netlink connection (or through the corresponding library) and communicate with the kernel. This way we don't need to transcribe the data to textual form and parse it (which adds a place for mistakes) and we likely need only smaller subset of privileges to run (eg. not full root, only CAP_NETADM).

1.3 Assignment of DNS names

The Guts daemon caches DNS information and assigns names to IP addresses in future flows. This is not currently done in Suricata. We therefore need to do this here unless it's added into Suricata.

Suricata does output the content of DNS queries and responses as events. We can cache these (similar to how it is done in Guts now) and assign the names to IP addresses.

It would be implemented as an internal computation. It would contain the cache, listen for events with DNS responses in them and store them. Also, for each flow that doesn't have a set of names set (even empty) it would do a lookup in the cache and add them.

It is open for discussion if we want to update names of already existing flows when we get a new DNS name. Maybe only in case there's no name assigned (eg. the set of names is empty).

2 Persistent data storage

Currently, the aggregator keeps data only in RAM. We want to do several things to support persistent storage.

2.1 Define an on-disk format

We want to have an on-disk format of a bucket. Some ideas are hinted at in internals.

We want to be able to write data into the bucket and to read them using the interfaces in libquery (Container and related ones).

When writing the bucket, it is first written into a temporary location and after it is fully written and synced, it is moved into the right place. This ensures that the storage doesn't contain any partial buckets.

2.2 Define in-memory index

An index of what data can be found where needs to exist. It needs to be able to be updated as we modify the data and it also needs to be constructed on startup, by scanning available files.

2.3 Write out the journal

From time to time we want to append to the journal (probably with multiple slices in each batch). Each write-out would be a single bucket file in a, with a sequence number as the name.

2.4 Update queries to read bucket files

Currently, the query looks into the in-memory representation. We want to generalize it to first consult the index, decide what sources (bucket files and in-memory data) it wants to examine and run on that. It'll be abstracted by the interfaces in libquery (these may need some tweaks).

Note that there may be multiple candidates for a bucket for each time span and we need to choose just one.

2.5 Make the queries multi-threaded

When we read data from disk and process theoretically large amounts of data, it can take a long time. Therefore, we want to do the work on a thread pool and provide just a future for the query result. We may even split the work into multiple tasks (eg. reading and processing each separate bucket may be a separate task and they can be done in parallel if there are enough CPUs).

2.6 Aggregation

Every time we write out the journal, we check its length for some limits. If the limits are reached, the journal is processed to form some other, aggregated buckets.

To do that we generalize and abuse the query processing (so it can provide multiple buckets on its output, can join neighboring slices and can be restricted to a set of input buckets). When the output buckets are successfully written out and moved into place, the relevant part of journal is removed. Note that because this can happen in the background, new journal files can appear after we started, therefore we don't want to drop these. For the same reason, we need to start the aggregation at most once at a time.

The next tier of aggregation data is then checked for some limits and if they are reached, the same process is repeated there.

If the limit is reached at the last tier, the last bucket (or buckets) are dropped instead of being processed.

2.7 Startup cleanup

After constructing the index at startup, we clean up files in the temporary area. We also detect the situation where there was an aggregation, but hasn't finished before shutting down and remove extra files (maybe on the aggregated side and start again, which is safer).

3 Analyses

These are mostly unsorted ideas what we could analyse further on the data.

Combining multiple name sources. If we have a DNS name, Host header in HTTP or a CA name, we may want to choose one that is representative for the IP address.
Parent connection and origin domain name. We want to guess somehow (based on time between connections, originator and some past bayesian probabilities) if a connection is secondary (eg. request to CDN) and what the originating connection and domain is.
Risk factors. Suricata cat report kinds of incidents (eg. unencrypted connections to mail server). We want to add tags for specific issues, but also want to derive some kind of „danger score“ (or maybe just a flag).