Monitoring competency involves identifying system metrics and logs, configuring thresholds and alerts, and retaining data for post-event triage and analysis. In addition, user behaviours and metrics may be tracked to help inform future delivery efforts. Metrics and alerts may be combined and used to identify unusual events which could indicate security breaches.
Building competence in monitoring involves regular review of the usefulness of metrics being monitored, suitability of alert thresholds, and lifecycle management of retained data.
“If the metrics you are looking at aren’t useful in optimizing your strategy - stop looking at them.” - Mark Twain
Ideal monitoring capability should have the following characteristics:
Monitored metrics should inform prioritisation of features. This may require the development of support for custom metrics in a system.
Monitoring is the eyes and ears of your deployed environments, constantly recording the health of your systems at a rate that no manual process ever could. It is important for this capability to reliably indicate unexpected conditions that need to be remediated. Reliability means avoidance of “false positives” (where systems are incorrectly reported as healthy), and “false negatives” (where systems are incorrectly reported as unhealthy).
Allow root cause analysis
Retained logs and metrics should allow retrospective analysis of failure events. This learning activity helps increase system stability over the long-term.
To achieve the objectives outlined above, consider the following suggestions:
Tailored alert thresholds
Tailoring monitoring systems to reduce noise is essential, as a “noisy” alert system will result in “alert fatigue” of Service Desk personnel. This is an incredibly important activity to focus on as part of continual improvement processes, as time is wasted chasing irrelevant warnings while service-impacting failures are lost in the sea of alerts.
- All alerts are followed up: no alerts are routinely ignored because they are “not relevant”
- Alerts deemed irrelevant have their alert thresholds or criteria changed, or they are removed
- Alerts are tailored on a per-system basis: 90% memory usage may be expected for some workloads, but an indication of failure in others
Intentional data retention
Be proactive about deciding what data is logged and how long it is retained. Change systems to meet your requirements, don’t just accept the defaults you are provided.
- Log and metric data is securely destroyed after a defined time period
- Log data does not include sensitive information
- Metrics are tracked on an appropriate basis (CPU may be appropriate to track every 10 seconds, but remaining disk space may be recorded every hour)
Onward processing of metrics
Metrics are utilised to inform wider service delivery, and potentially incorporated into new data sets to inform wider organisational decisions.
- Log and metric data is anonymised, sampled, and routed to organisational business intelligence tools
- Log and metric data is consumed by SIEM tools
- Log and metric data is used to understand whether SLA/SLOs have been met
The following books are recommended to help develop competency in this area:
Alistair Croll, Benjamin Yoskovitz
Develop a data-driven mindset to product improvement with practical examples of the use and abuse of metrics in different organisational scenarios. Useful from a business and product management perspective on monitoring as part of a build/measure/learn cycle.
Edit this page on GitHub
The content on this page is published under Open Source licenses via GitHub. To submit issues or provide feedback please visit the repository.Visit