Skip to content

Overheads

Understanding and Managing Collector Overheads

Epoch's collectors are designed to minimize the overheads while collecting realtime metrics. This guide explains the techiques and tradeoffs for managing those overheads.

Functional View of Overheads

The following is the functional view of overheads in Epoch Collectors:

  1. Packet Capture: This is achieved with the rpcap or sslsplit component. This has minimal CPU overhead and must be co-located with the host. Typical overhead is ~1% of CPU runtime on average and 5-10MB of RAM.
  2. Stream Processing: This has the maximum CPU and memory overheads (when doing layer7 analysis) but can be offloaded to another host. For layer7 protocol analysis, typical overhead is 3-8 % of CPU runtime on average and 200-600 MB of RAM. For layer4 protocol analysis, typical overhead is less than 0.25% of CPU and less than 50 MB of RAM.
  3. Infrastructure Metrics Collection: This has a peridictable and recurring overhead of about ~1% of CPU runtime on average and ~120MB of RAM. This has to run co-located with the host.

Default Overheads

With the default collector settings (layer4 protocol analysis mode), the typical overheads are 1-2% of CPU time and 150-200MB of memory.

With the layer7 protocol analysis settings all the functional overheads are on the host without any sampling or resource limits. Typical workloads incur on average 5-10 % of CPU time for the Collectors. The actual overhead depends on the throughput of network transactions being processed. The typical memory overheads are between 300-700MB.

The outgoing network bandwidth is between 5-20 KBps, the exact number would depend on the number of unique series.

Controlling Overheads

The following techniques for controlling overheads can be used alone or in conjunction with one another.

Sampling Rate

This parameter controls the sampling of Network traffic at the TCP flow level for the layer7 analysis. This parameter does not impact the layer4 analysis. The sampling parameter is provided as a percentage of total traffic (1-100), so a sampling rate of 50 would sample 50% of the flows. By default, the sampling is turned off. The resulting metrics are normalized based on the sampling rate. This parameter can significantly reduce CPU overheads but the resulting tradeoff is a lower fidelity of protocol metrics. Refer Sampling Rate

Nice Value

Run the collector under the standard Unix/Linux nice command with the value specified. Nice makes sure that processes with higher priority get a larger chunk of the CPU time than a lower priority process. Refer Nice Value

OS Resource Limits

Collectors' overheads can also be limited via the OS resource limits. The following are examples for limiting resources in some of the environments. All the examples limit the Collector’s CPU to a maximum of 50% of a single CPU and to a maximum of 1024 MB memory.

Docker

Before running, ensure the runtime arguments of your collector container has the following parameters:

--cpu-period=100000 --cpu-quota=50000 --memory=1024m

If you use Docker 1.13 or higher, use --cpus instead.

--cpus=0.5
Kubernetes

Before deploying, ensure your collector manifest has the following parameters:

resources:
  requests:
    memory: "512Mi"
    cpu: "0.5"
  limits:
    memory: "1024Mi"
    cpu: "1.0"

Remote Stream Processing

The CPU and memory overhead is much smaller if stream processing is done outside the host where the collector is running, refer to standalone stream processor. The outgoing bandwidth is higher is this mode but should not incur any costs if stream processor(s) are running in the same VPC as the collectors.