Skip to content

Commit 8da25ac

Browse files
Netdata ai (#20309)
Co-authored-by: ilyam8 <ilya@netdata.cloud>
1 parent bc88f8b commit 8da25ac

File tree

5 files changed

+669
-146
lines changed

5 files changed

+669
-146
lines changed
Lines changed: 45 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,54 @@
1-
# Machine Learning and Anomaly Detection
1+
# Netdata AI
22

3-
Netdata includes advanced Machine Learning capabilities to help you detect and resolve anomalies in your infrastructure before they escalate into critical issues. These features provide real-time insights and proactive monitoring to improve system reliability.
3+
Boost your monitoring and troubleshooting capabilities with Netdata's AI-powered features.
44

5-
## Key Features
5+
Netdata AI helps you **detect anomalies, understand metric relationships, and resolve issues quickly** with intelligent assistance all designed to make your infrastructure management smarter, faster, and bulletproof.
66

7-
### Anomaly Detection with K-Means Clustering
7+
## What Can Netdata AI Do For You?
88

9-
Netdata trains K-means clustering models to detect anomalies in your infrastructure. These models power the [Anomaly Advisor](/docs/dashboards-and-charts/anomaly-advisor-tab.md), which visually highlights anomalies on the dashboard, allowing you to quickly identify and investigate unexpected behavior.
9+
Netdata AI combines powerful machine learning capabilities with intuitive interfaces to help you:
1010

11-
### Metric Correlations
11+
1. **Detect anomalies automatically** before they escalate into critical issues
12+
2. **Understand relationships** between metrics during troubleshooting
13+
3. **Get expert guidance** when resolving alerts and performance problems
1214

13-
Netdata enables metric correlation analysis through the dashboard. This feature uses the [Two-sample Kolmogorov-Smirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test) and volume heuristic measures to help you understand relationships between different metrics and identify potential causes of anomalies.
15+
## Machine Learning and Anomaly Detection
1416

15-
### Netdata Assistant for Troubleshooting
17+
Our ML-powered anomaly detection works silently in the background, monitoring your metrics and identifying unusual patterns.
1618

17-
The [Netdata Assistant](/docs/netdata-assistant.md) provides AI-driven assistance for troubleshooting alerts and anomalies. You can interact with it directly to get explanations, recommendations, and next steps based on detected anomalies and system behavior.
19+
| Feature | What It Does For You |
20+
|----------------------------------|------------------------------------------------------------------------------|
21+
| **Unsupervised Learning** | Works automatically without requiring manual training or labeling of data |
22+
| **Multiple Model Consensus** | Reduces false positives by 99% by requiring agreement across multiple models |
23+
| **Real-time Anomaly Bits** | Flags unusual metrics instantly, with zero storage overhead |
24+
| **Anomaly Rate Visualization** | Highlights anomalous time periods in your dashboard for quick investigation |
25+
| **Node-Level Anomaly Detection** | Identifies when your entire system is behaving unusually |
26+
| **Metric Correlations** | Helps you find relationships between metrics to pinpoint root causes |
1827

19-
These Machine Learning features enhance observability and streamline incident response, helping you maintain system health with greater efficiency.
28+
Learn more in the [Machine Learning and Anomaly Detection](/src/ml/README.md) documentation.
29+
30+
## Netdata Assistant
31+
32+
When alerts trigger or anomalies emerge, Netdata Assistant serves as your AI-powered troubleshooting companion.
33+
34+
| Feature | What It Does For You |
35+
|----------------------------|-----------------------------------------------------------------------|
36+
| **Alert Context** | Explains what each alert means and why you should care about it |
37+
| **Guided Troubleshooting** | Offers step-by-step instructions tailored to your specific situation |
38+
| **Persistent Window** | Follows you throughout your dashboards as you investigate issues |
39+
| **Curated Resources** | Provides links to relevant documentation to deepen your understanding |
40+
| **Time-Saving** | Eliminates the need for searching documentation or online forums |
41+
42+
Learn more about [Netdata Assistant](/docs/netdata-assistant.md) and how it helps streamline your troubleshooting workflow.
43+
44+
## Getting Started
45+
46+
Netdata AI features are enabled by default with the standard installation. The machine learning capabilities require the `dbengine` database mode, which is the default setting.
47+
48+
To start exploring:
49+
50+
1. **Anomaly Detection**: Check the [Anomaly Advisor tab](/docs/dashboards-and-charts/anomaly-advisor-tab.md) to see detected anomalies
51+
2. **Metric Correlations**: Use the Metric Correlations button in the dashboard to analyze relationships between metrics
52+
3. **Netdata Assistant**: Click the Assistant button in the Alerts tab when troubleshooting alerts
53+
54+
These AI features work seamlessly with Netdata's other capabilities, enhancing your overall monitoring and troubleshooting experience without requiring any AI expertise.

docs/metric-correlations.md

Lines changed: 262 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,66 +1,297 @@
11
# Metric Correlations
22

3-
The Metric Correlations feature helps you quickly identify metrics and charts relevant to a specific time window of interest, allowing for faster root cause analysis.
3+
The **Metric Correlations** feature helps you quickly identify metrics and charts relevant to a specific time window of interest, allowing for faster root cause analysis.
44

5-
By filtering the standard Netdata dashboard to display only the most relevant charts, Metric Correlations makes it easier to pinpoint anomalies and investigate issues.
5+
:::tip
66

7-
Since it leverages every available metric in your infrastructure with up to 1-second granularity, Metric Correlations provides highly accurate insights.
7+
By filtering your standard Netdata dashboard to **display only the most relevant charts**, Metric Correlations make it easier for you to pinpoint anomalies and investigate issues.
8+
9+
:::
10+
11+
Since it leverages every available metric in your infrastructure with up to 1-second granularity, **Metric Correlations provides you with highly accurate insights**.
812

913
## Using Metric Correlations
1014

1115
When viewing the [Metrics tab or a single-node dashboard](/docs/dashboards-and-charts/metrics-tab-and-single-node-tabs.md), you'll find the **Metric Correlations** button in the top-right corner.
1216

13-
To start:
17+
<details>
18+
<summary><strong>To start:</strong></summary><br/>
1419

1520
1. Click **Metric Correlations**.
16-
2. [Highlight a selection of metrics](/docs/dashboards-and-charts/netdata-charts.md#highlight) on a single chart. The selected timeframe must be at least 15 seconds.
17-
3. The menu displays details about the selected area and reference baseline. Metric Correlations compares the highlighted window to a reference baseline, which is four times its length and precedes it immediately.
18-
4. Click **Find Correlations**. This button is only active if a valid timeframe is selected.
19-
5. The process evaluates all available metrics and returns a filtered Netdata dashboard showing only the most changed metrics between the baseline and the highlighted window.
21+
2. Highlight a selection of metrics on a single chart. **The selected timeframe must be at least 15 seconds**.
22+
3. The menu displays details about your selected area and reference baseline. Metric Correlations compares your highlighted window to a reference baseline, which is four times its length and precedes it immediately.
23+
4. Click **Find Correlations**.
24+
25+
:::note
26+
27+
This button is only active if you've selected a valid timeframe.
28+
29+
:::
30+
31+
5. **The process evaluates all your available metrics and returns a filtered Netdata dashboard** showing only the most changed metrics between the baseline and your highlighted window.
2032
6. If needed, select another window and press **Find Correlations** again to refine your analysis.
2133

34+
</details>
35+
36+
## Integration with Anomaly Detection
37+
38+
You can combine Metric Correlations with Anomaly Detection for powerful troubleshooting:
39+
40+
:::tip
41+
42+
When you notice an anomaly in your system, use Metric Correlations with the **Anomaly Rate** data type to quickly identify which metrics are contributing to the anomalous behavior.
43+
44+
:::
45+
46+
### How to Use Together
47+
48+
```mermaid
49+
flowchart TD
50+
%% Node styling
51+
classDef neutral fill:#f9f9f9,stroke:#000000,color:#000000,stroke-width:2px
52+
classDef success fill:#4caf50,stroke:#000000,color:#000000,stroke-width:2px
53+
classDef warning fill:#ffeb3b,stroke:#000000,color:#000000,stroke-width:2px
54+
classDef danger fill:#f44336,stroke:#000000,color:#000000,stroke-width:2px
55+
56+
A[Spot a spike in the<br/>node anomaly rate chart] --> B[Highlight that<br/>time period]
57+
B --> C[Select Anomaly Rate<br/>as data type<br/>and Volume as method]
58+
C --> D[Click Find Correlations]
59+
D --> E[Review metrics with<br/>highest anomaly rates]
60+
E --> F[Examine these metrics<br/>in detail to determine<br/>root cause]
61+
62+
%% Apply styles
63+
class A,B neutral
64+
class C,D warning
65+
class E success
66+
class F danger
67+
```
68+
69+
:::tip
70+
71+
**This workflow helps you move from detecting** that *"something is wrong"* **to understanding** exactly which components are behaving abnormally, significantly reducing your troubleshooting time.
72+
73+
:::
74+
75+
## API Access
76+
77+
You can access anomaly detection data and use it with metric correlations through Netdata's API:
78+
79+
<details>
80+
<summary><strong>Querying Anomaly Bits</strong></summary><br/>
81+
82+
To get the anomaly bits for any metric, add the `options=anomaly-bit` parameter to your API query:
83+
84+
```
85+
https://your-netdata-node/api/v1/data?chart=system.cpu&dimensions=user&after=-60&options=anomaly-bit
86+
```
87+
88+
Sample response:
89+
90+
```json
91+
{
92+
"labels": [
93+
"time",
94+
"user"
95+
],
96+
"data": [
97+
[
98+
1684852570,
99+
0
100+
],
101+
[
102+
1684852569,
103+
0
104+
],
105+
[
106+
1684852568,
107+
0
108+
],
109+
[
110+
1684852567,
111+
0
112+
],
113+
[
114+
1684852566,
115+
0
116+
],
117+
[
118+
1684852565,
119+
0
120+
],
121+
[
122+
1684852564,
123+
0
124+
],
125+
[
126+
1684852563,
127+
0
128+
],
129+
[
130+
1684852562,
131+
0
132+
],
133+
[
134+
1684852561,
135+
0
136+
]
137+
]
138+
}
139+
```
140+
141+
</details>
142+
143+
<details>
144+
<summary><strong>Querying Anomaly Rates</strong></summary><br/>
145+
146+
For anomaly rates over a time window, use the same parameter but with aggregated data:
147+
148+
```
149+
https://your-netdata-node/api/v1/data?chart=system.cpu&dimensions=user&after=-600&before=0&points=10&options=anomaly-bit
150+
```
151+
152+
Sample response showing the percentage of time each metric was anomalous:
153+
154+
```json
155+
{
156+
"labels": [
157+
"time",
158+
"user"
159+
],
160+
"data": [
161+
[
162+
1684852770,
163+
0
164+
],
165+
[
166+
1684852710,
167+
20
168+
],
169+
[
170+
1684852650,
171+
0
172+
],
173+
[
174+
1684852590,
175+
10
176+
],
177+
[
178+
1684852530,
179+
0
180+
],
181+
[
182+
1684852470,
183+
0
184+
],
185+
[
186+
1684852410,
187+
30
188+
],
189+
[
190+
1684852350,
191+
0
192+
],
193+
[
194+
1684852290,
195+
0
196+
],
197+
[
198+
1684852230,
199+
0
200+
]
201+
]
202+
}
203+
```
204+
205+
</details>
206+
207+
:::tip
208+
209+
You can programmatically access this data to build custom dashboards or alerts based on anomaly patterns in your infrastructure.
210+
211+
:::
212+
22213
## Metric Correlations Options
23214

24-
Metric Correlations offers adjustable parameters for deeper data exploration. Since different data types and incidents require different approaches, these settings allow for flexible analysis.
215+
Metric Correlations offer adjustable parameters for deeper data exploration. Since different data types and incidents require different approaches, **these settings allow for flexible analysis**.
25216

26-
### Method
217+
<details>
218+
<summary><strong>Method</strong></summary><br/>
27219

28220
Two algorithms are available for scoring metrics based on changes between the baseline and highlight windows:
29221

30-
- **`KS2` (Kolmogorov-Smirnov Test)**: A statistical method comparing distributions between the highlighted and baseline windows to detect significant changes. [Implementation details](https://github.com/netdata/netdata/blob/d917f9831c0a1638ef4a56580f321eb6c9a88037/database/metric_correlations.c#L212).
31-
- **`Volume`**: A heuristic approach based on percentage change in averages, designed to handle edge cases. [Implementation details](https://github.com/netdata/netdata/blob/d917f9831c0a1638ef4a56580f321eb6c9a88037/database/metric_correlations.c#L516).
222+
* **`KS2` (Kolmogorov-Smirnov Test)**: A statistical method comparing distributions between the highlighted and baseline windows to detect significant changes. [Implementation details](https://github.com/netdata/netdata/blob/d917f9831c0a1638ef4a56580f321eb6c9a88037/database/metric_correlations.c#L212).
223+
* **`Volume`**: A heuristic approach based on percentage change in averages, designed to handle edge cases. [Implementation details](https://github.com/netdata/netdata/blob/d917f9831c0a1638ef4a56580f321eb6c9a88037/database/metric_correlations.c#L516).
224+
225+
</details>
226+
227+
<details>
228+
<summary><strong>Aggregation</strong></summary><br/>
32229

33-
### Aggregation
230+
To accommodate different window lengths, Netdata aggregates your raw data as needed. The default aggregation method is `Average`, but you can also choose `Median`, `Min`, `Max`, or `Stddev`.
231+
</details>
34232

35-
To accommodate different window lengths, Netdata aggregates raw data as needed. The default aggregation method is `Average`, but you can also choose `Median`, `Min`, `Max`, or `Stddev`.
233+
<details>
234+
<summary><strong>Data Type</strong></summary><br/>
36235

37-
### Data Type
236+
Netdata assigns an [Anomaly Bit](https://github.com/netdata/netdata/tree/master/src/ml#anomaly-bit) to each of your metrics in real-time, flagging whether it deviates significantly from normal behavior. You can analyze either raw data or anomaly rates:
38237

39-
Netdata assigns an [Anomaly Bit](https://github.com/netdata/netdata/tree/master/src/ml#anomaly-bit) to each metric in real-time, flagging whether it deviates significantly from normal behavior. You can analyze either raw data or anomaly rates:
238+
* **`Metrics`**: Runs Metric Correlations on your raw metric values.
239+
* **`Anomaly Rate`**: Runs Metric Correlations on anomaly rates for each of your metrics.
40240

41-
- **`Metrics`**: Runs Metric Correlations on raw metric values.
42-
- **`Anomaly Rate`**: Runs Metric Correlations on anomaly rates for each metric.
241+
</details>
43242

44243
## Metric Correlations on the Agent
45244

46245
Metric Correlations (MC) requests to Netdata Cloud are handled in two ways:
47246

48-
1. If MC is enabled on any node, the request is routed to the highest-level node (a Parent node or the node itself).
49-
2. If MC is not enabled on any node, Netdata Cloud processes the request by collecting data from nodes and computing correlations on its backend.
247+
1. **If MC is enabled** on any of your nodes, the request is routed to the highest-level node (a Parent node or the node itself).
248+
2. **If MC is not enabled** on any of your nodes, Netdata Cloud processes the request by collecting data from your nodes and computing correlations on its backend.
249+
250+
## Interpreting Combined Results
251+
252+
When you use Metric Correlations together with Anomaly Detection, you'll want to understand how to interpret the results:
253+
254+
:::tip
255+
256+
**High anomaly rates combined with significant metric changes** often indicate genuine issues rather than false positives.
257+
258+
:::
259+
260+
Here's how to interpret different scenarios:
261+
262+
| Anomaly Rate | Metric Correlation | Interpretation |
263+
|--------------|--------------------|------------------------------------------------------|
264+
| High | Strong | Likely a significant issue affecting system behavior |
265+
| High | Weak | Possible edge case or intermittent issue |
266+
| Low | Strong | Normal but significant change in system behavior |
267+
| Low | Weak | Likely normal system operation |
268+
269+
:::tip
270+
271+
By examining both the anomaly rate and the correlation strength, you can prioritize your troubleshooting efforts more effectively.
272+
273+
:::
50274

51275
## Usage Tips
52276

53-
- When running Metric Correlations from the [Metrics tab](/docs/dashboards-and-charts/metrics-tab-and-single-node-tabs.md) across multiple nodes, refine your results by grouping by node:
54-
1. Run MC on all nodes if you're unsure which ones are relevant.
55-
2. Group the most interesting charts by node to determine whether changes affect all nodes or just a subset.
56-
3. If a subset of nodes stands out, filter for those nodes and rerun MC to get more precise results.
277+
:::tip
278+
279+
When running Metric Correlations from the [Metrics tab](/docs/dashboards-and-charts/metrics-tab-and-single-node-tabs.md) across multiple nodes, refine your results by grouping by node:
280+
281+
1. Run MC on all your nodes if you're unsure which ones are relevant.
282+
2. Group the most interesting charts by node to determine whether changes affect all your nodes or just a subset.
283+
3. If a subset of your nodes stands out, filter for those nodes and rerun MC to get more precise results.
284+
285+
Choose the **`Volume`** algorithm for sparse metrics (e.g., request latency with few requests). Otherwise, use **`KS2`**.
286+
287+
- **`KS2`** is ideal for detecting complex distribution changes in your metrics, such as shifts in variance.
288+
- **`Volume`** is better for detecting your metrics that were inactive and then spiked (or vice versa).
289+
290+
**Example:**
57291

58-
- Choose the **`Volume`** algorithm for sparse metrics (e.g., request latency with few requests). Otherwise, use **`KS2`**.
59-
- **`KS2`** is ideal for detecting complex distribution changes, such as shifts in variance.
60-
- **`Volume`** is better for detecting metrics that were inactive and then spiked (or vice versa).
292+
- `Volume` can highlight network traffic suddenly turning on in your system.
293+
- `KS2` can detect entropy distribution changes in your data missed by `Volume`.
61294

62-
**Example:**
63-
- `Volume` can highlight network traffic suddenly turning on.
64-
- `KS2` can detect entropy distribution changes missed by `Volume`.
295+
Combine **`Volume`** and **`Anomaly Rate`** to identify the most anomalous metrics within your selected timeframe. Expand the anomaly rate chart to visualize results more clearly.
65296

66-
- Combine **`Volume`** and **`Anomaly Rate`** to identify the most anomalous metrics within a timeframe. Expand the anomaly rate chart to visualize results more clearly.
297+
:::

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy