Why do we monitor production applications? We want to know if the applications we run in production are healthy, of course.
Healthy is a matter of perspective
But what does healthy mean? Most of the time, healthy refers to a number of non-functional proporties:
- performance latency, response time
- security vulnerabilities, data integrity, regulatory compliance
- reliability number and kind of errors, availability
- capacity resource consumption for given load, scaling headroom, constraints
- cost per user, per app, per container, transaction
Often, these different qualitative proporties are monitored separately. Different teams are responsible for only one of these each, and they’ll use their own isolated set of tools giving them a narrow view of their environment to monitor only what their specific responsibility is.
So we see that healthy can mean many things to different people, and applications can be healthy from one perspective, but unhealthy from another.
The cloud-native visibility gap
This problem of monitoring and different forms of health is exacerbated by ephemeral containers and the often complex and large mesh of services working together to form an actual application. I’ll explain why.
Containers tend to be very short-lived, running for only seconds or minutes and then disappearing completely. How do you monitor the health of something that exists only temporary, and is tied into any number of other containers? How do you even know which containers interact with each other to form a customer-facing application?
With different, isolated, sets of monitoring tools and the ephemeral and numerous characteristics of containers, monitoring becomes very difficult, and observing security and performance events needs to accommodate for this. How do you do forensics if your monitoring system does not keep track of historical containers and telemetry?
Closing the gap
Observing from the kernel
To achieve monitoring for both ephemeral and disparate container workloads, Sysdig uses an innovative approach on data collection. Instead of modifying application code (instrumentation, like OpenTracing) or doubling the container count (adding a sidecar to each container), they use eBPF, an Linux-native in-kernel virtual machine that captures runtime telemetry (not unlike how a network analyzer captures network packets) and forwards it to the Sysdig analytics platform.
This allows Sysdig to analyze thousands of microservices running in containers without missing any containers, and without missing any telemetry. They can reconstruct which containers make up an application by looking at container-to-container interaction and analyze the inter-container traffic for performance and security analysis.
The analytics platform is built for cloud-native workloads, meaning that it understands that containers come and go in flock, that they’re part of a bigger whole and that you need all that contextual information to do monitoring and post-mortem forensics.
Unifying the view
Sysdig closes the visibility gaps by offering a single, unified view for performance metrics, compliance dashboards and security events to be used across teams. This helps teams highlight the events and containers that have performance or security and compliance issues that need immediate attention.
Sysdig Monitor provides the run-time monitoring, dashboarding, alerting and trace-driven troubleshooting. Monitor uses the open source sysdig tool.
Sysdig Secure is the security part of the equation with vulnerability management, compliance (250+ out-of-the-box compliance checks), runtime security, anomaly detection and forensics for post-mortem improvements. Sysdig Secure uses the open source Falco, Inspect and Anchore tools.
Sysdig is not just for production. In addition, it helps developers by trying to catch faults (like known security vulnerabilities) before they even make it to production. This reduces production issues and lets developers treat their containers as artifacts.
Sysdig’s enterprise offering is a complete set of monitoring and security features, pulled into a single overview. Here, developers, security and operations teams share a common, complete view of reality to optimize performance, reduce risk and improve recovery time.