|
1 | 1 | # Metric Correlations
|
2 | 2 |
|
3 |
| -The Metric Correlations feature helps you quickly identify metrics and charts relevant to a specific time window of interest, allowing for faster root cause analysis. |
| 3 | +The **Metric Correlations** feature helps you quickly identify metrics and charts relevant to a specific time window of interest, allowing for faster root cause analysis. |
4 | 4 |
|
5 |
| -By filtering the standard Netdata dashboard to display only the most relevant charts, Metric Correlations makes it easier to pinpoint anomalies and investigate issues. |
| 5 | +:::tip |
6 | 6 |
|
7 |
| -Since it leverages every available metric in your infrastructure with up to 1-second granularity, Metric Correlations provides highly accurate insights. |
| 7 | +By filtering your standard Netdata dashboard to **display only the most relevant charts**, Metric Correlations make it easier for you to pinpoint anomalies and investigate issues. |
| 8 | + |
| 9 | +::: |
| 10 | + |
| 11 | +Since it leverages every available metric in your infrastructure with up to 1-second granularity, **Metric Correlations provides you with highly accurate insights**. |
8 | 12 |
|
9 | 13 | ## Using Metric Correlations
|
10 | 14 |
|
11 | 15 | When viewing the [Metrics tab or a single-node dashboard](/docs/dashboards-and-charts/metrics-tab-and-single-node-tabs.md), you'll find the **Metric Correlations** button in the top-right corner.
|
12 | 16 |
|
13 |
| -To start: |
| 17 | +<details> |
| 18 | +<summary><strong>To start:</strong></summary><br/> |
14 | 19 |
|
15 | 20 | 1. Click **Metric Correlations**.
|
16 |
| -2. [Highlight a selection of metrics](/docs/dashboards-and-charts/netdata-charts.md#highlight) on a single chart. The selected timeframe must be at least 15 seconds. |
17 |
| -3. The menu displays details about the selected area and reference baseline. Metric Correlations compares the highlighted window to a reference baseline, which is four times its length and precedes it immediately. |
18 |
| -4. Click **Find Correlations**. This button is only active if a valid timeframe is selected. |
19 |
| -5. The process evaluates all available metrics and returns a filtered Netdata dashboard showing only the most changed metrics between the baseline and the highlighted window. |
| 21 | +2. Highlight a selection of metrics on a single chart. **The selected timeframe must be at least 15 seconds**. |
| 22 | +3. The menu displays details about your selected area and reference baseline. Metric Correlations compares your highlighted window to a reference baseline, which is four times its length and precedes it immediately. |
| 23 | +4. Click **Find Correlations**. |
| 24 | + |
| 25 | +:::note |
| 26 | + |
| 27 | +This button is only active if you've selected a valid timeframe. |
| 28 | + |
| 29 | +::: |
| 30 | + |
| 31 | +5. **The process evaluates all your available metrics and returns a filtered Netdata dashboard** showing only the most changed metrics between the baseline and your highlighted window. |
20 | 32 | 6. If needed, select another window and press **Find Correlations** again to refine your analysis.
|
21 | 33 |
|
| 34 | +</details> |
| 35 | + |
| 36 | +## Integration with Anomaly Detection |
| 37 | + |
| 38 | +You can combine Metric Correlations with Anomaly Detection for powerful troubleshooting: |
| 39 | + |
| 40 | +:::tip |
| 41 | + |
| 42 | +When you notice an anomaly in your system, use Metric Correlations with the **Anomaly Rate** data type to quickly identify which metrics are contributing to the anomalous behavior. |
| 43 | + |
| 44 | +::: |
| 45 | + |
| 46 | +### How to Use Together |
| 47 | + |
| 48 | +```mermaid |
| 49 | +flowchart TD |
| 50 | + %% Node styling |
| 51 | + classDef neutral fill:#f9f9f9,stroke:#000000,color:#000000,stroke-width:2px |
| 52 | + classDef success fill:#4caf50,stroke:#000000,color:#000000,stroke-width:2px |
| 53 | + classDef warning fill:#ffeb3b,stroke:#000000,color:#000000,stroke-width:2px |
| 54 | + classDef danger fill:#f44336,stroke:#000000,color:#000000,stroke-width:2px |
| 55 | + |
| 56 | + A[Spot a spike in the<br/>node anomaly rate chart] --> B[Highlight that<br/>time period] |
| 57 | + B --> C[Select Anomaly Rate<br/>as data type<br/>and Volume as method] |
| 58 | + C --> D[Click Find Correlations] |
| 59 | + D --> E[Review metrics with<br/>highest anomaly rates] |
| 60 | + E --> F[Examine these metrics<br/>in detail to determine<br/>root cause] |
| 61 | + |
| 62 | + %% Apply styles |
| 63 | + class A,B neutral |
| 64 | + class C,D warning |
| 65 | + class E success |
| 66 | + class F danger |
| 67 | +``` |
| 68 | + |
| 69 | +:::tip |
| 70 | + |
| 71 | +**This workflow helps you move from detecting** that *"something is wrong"* **to understanding** exactly which components are behaving abnormally, significantly reducing your troubleshooting time. |
| 72 | + |
| 73 | +::: |
| 74 | + |
| 75 | +## API Access |
| 76 | + |
| 77 | +You can access anomaly detection data and use it with metric correlations through Netdata's API: |
| 78 | + |
| 79 | +<details> |
| 80 | +<summary><strong>Querying Anomaly Bits</strong></summary><br/> |
| 81 | + |
| 82 | +To get the anomaly bits for any metric, add the `options=anomaly-bit` parameter to your API query: |
| 83 | + |
| 84 | +``` |
| 85 | +https://your-netdata-node/api/v1/data?chart=system.cpu&dimensions=user&after=-60&options=anomaly-bit |
| 86 | +``` |
| 87 | + |
| 88 | +Sample response: |
| 89 | + |
| 90 | +```json |
| 91 | +{ |
| 92 | + "labels": [ |
| 93 | + "time", |
| 94 | + "user" |
| 95 | + ], |
| 96 | + "data": [ |
| 97 | + [ |
| 98 | + 1684852570, |
| 99 | + 0 |
| 100 | + ], |
| 101 | + [ |
| 102 | + 1684852569, |
| 103 | + 0 |
| 104 | + ], |
| 105 | + [ |
| 106 | + 1684852568, |
| 107 | + 0 |
| 108 | + ], |
| 109 | + [ |
| 110 | + 1684852567, |
| 111 | + 0 |
| 112 | + ], |
| 113 | + [ |
| 114 | + 1684852566, |
| 115 | + 0 |
| 116 | + ], |
| 117 | + [ |
| 118 | + 1684852565, |
| 119 | + 0 |
| 120 | + ], |
| 121 | + [ |
| 122 | + 1684852564, |
| 123 | + 0 |
| 124 | + ], |
| 125 | + [ |
| 126 | + 1684852563, |
| 127 | + 0 |
| 128 | + ], |
| 129 | + [ |
| 130 | + 1684852562, |
| 131 | + 0 |
| 132 | + ], |
| 133 | + [ |
| 134 | + 1684852561, |
| 135 | + 0 |
| 136 | + ] |
| 137 | + ] |
| 138 | +} |
| 139 | +``` |
| 140 | + |
| 141 | +</details> |
| 142 | + |
| 143 | +<details> |
| 144 | +<summary><strong>Querying Anomaly Rates</strong></summary><br/> |
| 145 | + |
| 146 | +For anomaly rates over a time window, use the same parameter but with aggregated data: |
| 147 | + |
| 148 | +``` |
| 149 | +https://your-netdata-node/api/v1/data?chart=system.cpu&dimensions=user&after=-600&before=0&points=10&options=anomaly-bit |
| 150 | +``` |
| 151 | + |
| 152 | +Sample response showing the percentage of time each metric was anomalous: |
| 153 | + |
| 154 | +```json |
| 155 | +{ |
| 156 | + "labels": [ |
| 157 | + "time", |
| 158 | + "user" |
| 159 | + ], |
| 160 | + "data": [ |
| 161 | + [ |
| 162 | + 1684852770, |
| 163 | + 0 |
| 164 | + ], |
| 165 | + [ |
| 166 | + 1684852710, |
| 167 | + 20 |
| 168 | + ], |
| 169 | + [ |
| 170 | + 1684852650, |
| 171 | + 0 |
| 172 | + ], |
| 173 | + [ |
| 174 | + 1684852590, |
| 175 | + 10 |
| 176 | + ], |
| 177 | + [ |
| 178 | + 1684852530, |
| 179 | + 0 |
| 180 | + ], |
| 181 | + [ |
| 182 | + 1684852470, |
| 183 | + 0 |
| 184 | + ], |
| 185 | + [ |
| 186 | + 1684852410, |
| 187 | + 30 |
| 188 | + ], |
| 189 | + [ |
| 190 | + 1684852350, |
| 191 | + 0 |
| 192 | + ], |
| 193 | + [ |
| 194 | + 1684852290, |
| 195 | + 0 |
| 196 | + ], |
| 197 | + [ |
| 198 | + 1684852230, |
| 199 | + 0 |
| 200 | + ] |
| 201 | + ] |
| 202 | +} |
| 203 | +``` |
| 204 | + |
| 205 | +</details> |
| 206 | + |
| 207 | +:::tip |
| 208 | + |
| 209 | +You can programmatically access this data to build custom dashboards or alerts based on anomaly patterns in your infrastructure. |
| 210 | + |
| 211 | +::: |
| 212 | + |
22 | 213 | ## Metric Correlations Options
|
23 | 214 |
|
24 |
| -Metric Correlations offers adjustable parameters for deeper data exploration. Since different data types and incidents require different approaches, these settings allow for flexible analysis. |
| 215 | +Metric Correlations offer adjustable parameters for deeper data exploration. Since different data types and incidents require different approaches, **these settings allow for flexible analysis**. |
25 | 216 |
|
26 |
| -### Method |
| 217 | +<details> |
| 218 | +<summary><strong>Method</strong></summary><br/> |
27 | 219 |
|
28 | 220 | Two algorithms are available for scoring metrics based on changes between the baseline and highlight windows:
|
29 | 221 |
|
30 |
| -- **`KS2` (Kolmogorov-Smirnov Test)**: A statistical method comparing distributions between the highlighted and baseline windows to detect significant changes. [Implementation details](https://github.com/netdata/netdata/blob/d917f9831c0a1638ef4a56580f321eb6c9a88037/database/metric_correlations.c#L212). |
31 |
| -- **`Volume`**: A heuristic approach based on percentage change in averages, designed to handle edge cases. [Implementation details](https://github.com/netdata/netdata/blob/d917f9831c0a1638ef4a56580f321eb6c9a88037/database/metric_correlations.c#L516). |
| 222 | +* **`KS2` (Kolmogorov-Smirnov Test)**: A statistical method comparing distributions between the highlighted and baseline windows to detect significant changes. [Implementation details](https://github.com/netdata/netdata/blob/d917f9831c0a1638ef4a56580f321eb6c9a88037/database/metric_correlations.c#L212). |
| 223 | +* **`Volume`**: A heuristic approach based on percentage change in averages, designed to handle edge cases. [Implementation details](https://github.com/netdata/netdata/blob/d917f9831c0a1638ef4a56580f321eb6c9a88037/database/metric_correlations.c#L516). |
| 224 | + |
| 225 | +</details> |
| 226 | + |
| 227 | +<details> |
| 228 | +<summary><strong>Aggregation</strong></summary><br/> |
32 | 229 |
|
33 |
| -### Aggregation |
| 230 | +To accommodate different window lengths, Netdata aggregates your raw data as needed. The default aggregation method is `Average`, but you can also choose `Median`, `Min`, `Max`, or `Stddev`. |
| 231 | +</details> |
34 | 232 |
|
35 |
| -To accommodate different window lengths, Netdata aggregates raw data as needed. The default aggregation method is `Average`, but you can also choose `Median`, `Min`, `Max`, or `Stddev`. |
| 233 | +<details> |
| 234 | +<summary><strong>Data Type</strong></summary><br/> |
36 | 235 |
|
37 |
| -### Data Type |
| 236 | +Netdata assigns an [Anomaly Bit](https://github.com/netdata/netdata/tree/master/src/ml#anomaly-bit) to each of your metrics in real-time, flagging whether it deviates significantly from normal behavior. You can analyze either raw data or anomaly rates: |
38 | 237 |
|
39 |
| -Netdata assigns an [Anomaly Bit](https://github.com/netdata/netdata/tree/master/src/ml#anomaly-bit) to each metric in real-time, flagging whether it deviates significantly from normal behavior. You can analyze either raw data or anomaly rates: |
| 238 | +* **`Metrics`**: Runs Metric Correlations on your raw metric values. |
| 239 | +* **`Anomaly Rate`**: Runs Metric Correlations on anomaly rates for each of your metrics. |
40 | 240 |
|
41 |
| -- **`Metrics`**: Runs Metric Correlations on raw metric values. |
42 |
| -- **`Anomaly Rate`**: Runs Metric Correlations on anomaly rates for each metric. |
| 241 | +</details> |
43 | 242 |
|
44 | 243 | ## Metric Correlations on the Agent
|
45 | 244 |
|
46 | 245 | Metric Correlations (MC) requests to Netdata Cloud are handled in two ways:
|
47 | 246 |
|
48 |
| -1. If MC is enabled on any node, the request is routed to the highest-level node (a Parent node or the node itself). |
49 |
| -2. If MC is not enabled on any node, Netdata Cloud processes the request by collecting data from nodes and computing correlations on its backend. |
| 247 | +1. **If MC is enabled** on any of your nodes, the request is routed to the highest-level node (a Parent node or the node itself). |
| 248 | +2. **If MC is not enabled** on any of your nodes, Netdata Cloud processes the request by collecting data from your nodes and computing correlations on its backend. |
| 249 | + |
| 250 | +## Interpreting Combined Results |
| 251 | + |
| 252 | +When you use Metric Correlations together with Anomaly Detection, you'll want to understand how to interpret the results: |
| 253 | + |
| 254 | +:::tip |
| 255 | + |
| 256 | +**High anomaly rates combined with significant metric changes** often indicate genuine issues rather than false positives. |
| 257 | + |
| 258 | +::: |
| 259 | + |
| 260 | +Here's how to interpret different scenarios: |
| 261 | + |
| 262 | +| Anomaly Rate | Metric Correlation | Interpretation | |
| 263 | +|--------------|--------------------|------------------------------------------------------| |
| 264 | +| High | Strong | Likely a significant issue affecting system behavior | |
| 265 | +| High | Weak | Possible edge case or intermittent issue | |
| 266 | +| Low | Strong | Normal but significant change in system behavior | |
| 267 | +| Low | Weak | Likely normal system operation | |
| 268 | + |
| 269 | +:::tip |
| 270 | + |
| 271 | +By examining both the anomaly rate and the correlation strength, you can prioritize your troubleshooting efforts more effectively. |
| 272 | + |
| 273 | +::: |
50 | 274 |
|
51 | 275 | ## Usage Tips
|
52 | 276 |
|
53 |
| -- When running Metric Correlations from the [Metrics tab](/docs/dashboards-and-charts/metrics-tab-and-single-node-tabs.md) across multiple nodes, refine your results by grouping by node: |
54 |
| - 1. Run MC on all nodes if you're unsure which ones are relevant. |
55 |
| - 2. Group the most interesting charts by node to determine whether changes affect all nodes or just a subset. |
56 |
| - 3. If a subset of nodes stands out, filter for those nodes and rerun MC to get more precise results. |
| 277 | +:::tip |
| 278 | + |
| 279 | +When running Metric Correlations from the [Metrics tab](/docs/dashboards-and-charts/metrics-tab-and-single-node-tabs.md) across multiple nodes, refine your results by grouping by node: |
| 280 | + |
| 281 | +1. Run MC on all your nodes if you're unsure which ones are relevant. |
| 282 | +2. Group the most interesting charts by node to determine whether changes affect all your nodes or just a subset. |
| 283 | +3. If a subset of your nodes stands out, filter for those nodes and rerun MC to get more precise results. |
| 284 | + |
| 285 | +Choose the **`Volume`** algorithm for sparse metrics (e.g., request latency with few requests). Otherwise, use **`KS2`**. |
| 286 | + |
| 287 | +- **`KS2`** is ideal for detecting complex distribution changes in your metrics, such as shifts in variance. |
| 288 | +- **`Volume`** is better for detecting your metrics that were inactive and then spiked (or vice versa). |
| 289 | + |
| 290 | +**Example:** |
57 | 291 |
|
58 |
| -- Choose the **`Volume`** algorithm for sparse metrics (e.g., request latency with few requests). Otherwise, use **`KS2`**. |
59 |
| - - **`KS2`** is ideal for detecting complex distribution changes, such as shifts in variance. |
60 |
| - - **`Volume`** is better for detecting metrics that were inactive and then spiked (or vice versa). |
| 292 | +- `Volume` can highlight network traffic suddenly turning on in your system. |
| 293 | +- `KS2` can detect entropy distribution changes in your data missed by `Volume`. |
61 | 294 |
|
62 |
| - **Example:** |
63 |
| - - `Volume` can highlight network traffic suddenly turning on. |
64 |
| - - `KS2` can detect entropy distribution changes missed by `Volume`. |
| 295 | +Combine **`Volume`** and **`Anomaly Rate`** to identify the most anomalous metrics within your selected timeframe. Expand the anomaly rate chart to visualize results more clearly. |
65 | 296 |
|
66 |
| -- Combine **`Volume`** and **`Anomaly Rate`** to identify the most anomalous metrics within a timeframe. Expand the anomaly rate chart to visualize results more clearly. |
| 297 | +::: |
0 commit comments