Commit Graph

144 Commits

Author SHA1 Message Date
Shireesh Anjal 4caa3422bd
Add process metrics in `metrics-v3` (#19612)
endpoint: /minio/metrics/v3/system/process
metrics:
- locks_read_total
- locks_write_total
- cpu_total_seconds
- go_routine_total
- io_rchar_bytes
- io_read_bytes
- io_wchar_bytes
- io_write_bytes
- start_time_seconds
- uptime_seconds
- file_descriptor_limit_total
- file_descriptor_open_total
- syscall_read_total
- syscall_write_total
- resident_memory_bytes
- virtual_memory_bytes
- virtual_memory_max_bytes

Since the standard process collector implements only a subset of these
metrics, remove it and implement our own custom process collector that
captures all the process metrics we need.
2024-04-26 09:07:23 -07:00
Harshavardhana c54ffde568
add metrics ioerror counter for alerts on I/O errors (#19618) 2024-04-25 15:01:31 -07:00
Bala FA 14cdadfb56
Add cluster notification metrics in metrics-v3 (#19533)
Signed-off-by: Bala.FA <bala@minio.io>
2024-04-23 21:10:35 -07:00
Shireesh Anjal f7b665347e
Add system CPU metrics to metrics-v3 (#19560)
endpoint: /minio/metrics/v3/system/cpu

metrics:
- minio_system_cpu_avg_idle
- minio_system_cpu_avg_iowait
- minio_system_cpu_load
- minio_system_cpu_load_perc
- minio_system_cpu_nice
- minio_system_cpu_steal
- minio_system_cpu_system
- minio_system_cpu_user
2024-04-23 16:56:12 -07:00
Shireesh Anjal ca5fab8656
Add cluster audit metrics in metrics-v3 (#19514)
endpoint: /minio/metrics/v3/cluster/audit
metrics:
- failed_messages (counter)
- total_messages (counter)
- target_queue_length (gauge)
2024-04-17 02:18:02 -07:00
Shireesh Anjal 6df76ca73c
Add system memory metrics in v3 (#19486)
Following memory metrics will be added under /system/memory

- available
- buffers
- cache
- free
- shared
- total
- used
- used_perc
2024-04-16 22:10:25 -07:00
Markus Wagner 0cf3d93360
removed hardcoded datasource uid (#19477) 2024-04-15 03:03:01 -07:00
Shubhendu d3a07c29ba
Correct sample for node scrape configuration (#19491)
As node metrics should be scraped per node basis, use a sample
configuartion using all the nodes in targets.

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-04-12 08:49:30 -07:00
Harshavardhana 41ec038523
remove permission denied error for being drive error (#19478) 2024-04-11 14:22:15 -07:00
Shireesh Anjal 08d3d06a06
Add drive metrics in metrics-v3 (#19452)
Add following metrics:

- used_inodes
- total_inodes
- healing
- online
- reads_per_sec
- reads_kb_per_sec
- reads_await
- writes_per_sec
- writes_kb_per_sec
- writes_await
- perc_util

To be able to calculate the `per_sec` values, we capture the IOStats-related 
data in the beginning (along with the time at which they were captured), 
and compare them against the current values subsequently. This is because 
dividing by "time since server uptime." doesn't work in k8s environments.
2024-04-11 10:46:34 -07:00
Shubhendu d96d696841
Dont use deprecated angular (#19396)
Support for Angular would be stopped with newer versions of grafana

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-04-03 19:01:53 -07:00
Shubhendu d87f91720b
Split the replication dashboard in cluster and node level (#19374)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-03-28 10:15:39 -07:00
Shubhendu d63e603040
Pre populate the server names using a query (#19367)
User doesn't need to remember and enter the server values,
rather they can select from the pre populated list.

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-03-28 08:14:26 -07:00
Shubhendu 3d4fc28ec9
Render node graphs by node (#19356)
As total drives count, online vs offline are per node basis, its
corect to select node for which graphs need to be rendered.

Set prometheus scrape jobs to fetch metrics from all nodes. A sample
scrape job for node metrics could be as below

```
- job_name: minio-job-node
  bearer_token: <token>
  metrics_path: /minio/v2/metrics/node
  scheme: https
  tls_config:
    insecure_skip_verify: true
  static_configs:
  - targets: [tenant1-ss-0-0.tenant1-hl.tenant-ns.svc.cluster.local:9000,tenant1-ss-0-1.tenant1-hl.tenant-ns.svc.cluster.local:9000,tenant1-ss-0-2.tenant1-hl.tenant-ns.svc.cluster.local:9000,tenant1-ss-0-3.tenant1-hl.tenant-ns.svc.cluster.local:9000]
```

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-03-27 10:41:08 -07:00
Shubhendu 53a14c7301
Adding dashboard for MinIO node metrics (#19329)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-03-26 08:01:28 -07:00
Aditya Manthramurthy b2c5b75efa
feat: Add Metrics V3 API (#19068)
Metrics v3 is mainly a reorganization of metrics into smaller groups of
metrics and the removal of internal aggregation of metrics received from
peer nodes in a MinIO cluster.

This change adds the endpoint `/minio/metrics/v3` as the top-level metrics
endpoint and under this, various sub-endpoints are implemented. These
are currently documented in `docs/metrics/v3.md`

The handler will serve metrics at any path
`/minio/metrics/v3/PATH`, as follows:

when PATH is a sub-endpoint listed above => serves the group of
metrics under that path; or when PATH is a (non-empty) parent 
directory of the sub-endpoints listed above => serves metrics
from each child sub-endpoint of PATH. otherwise, returns a no 
resource found error

All available metrics are listed in the `docs/metrics/v3.md`. More will
be added subsequently.
2024-03-10 01:15:15 -08:00
Ravind Kumar f3e7c42425
Update metrics list.md with new metrics from RELEASE.2024-01-05 (#19161) 2024-02-29 14:53:54 -08:00
Shubhendu f46bee242c
Re-organized grafana dashboards (#19157)
Moved different dashboards to their specific directories. Also
mentioned that these dashbards are examples of how to create
graphs using MinIO provided and metrics and customers should
change / add graphs on their specific need basis.

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-02-29 10:35:20 -08:00
schmittey c44f311c4f
Add missing yaml syntax highlighting in prometheus README.md (#19087) 2024-02-20 16:22:37 -08:00
Shubhendu cb7dab17cb
Graph cluster and bucket replication proxied requests (#19078)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-02-20 01:45:00 -08:00
Harshavardhana eac4e4b279
honor replaced disk properly by updating globalLocalDrives (#19038)
globalLocalDrives seem to be not updated during the
HealFormat() leads to a requirement where the server
needs to be restarted for the healing to continue.
2024-02-12 13:00:20 -08:00
Poorna 27d02ea6f7
metrics: add replication metrics on proxied requests (#18957) 2024-02-05 22:00:45 -08:00
Daniel Valdivia 403ec7cf21
fix: metrics URI path in prometheus docs (#18907) 2024-01-29 14:34:21 -08:00
Harshavardhana 944f3c1477
remove local disk metrics from cluster metrics (#18886)
local disk metrics were polluting cluster metrics
Please remove them instead of adding relevant ones.

- batch job metrics were incorrectly kept at bucket
  metrics endpoint, move it to cluster metrics.

- add tier metrics to cluster peer metrics from the node.

- fix missing set level cluster health metrics
2024-01-28 12:53:59 -08:00
Cesar N 1a91edecae
Update list.md node_cpu wording (#18878) 2024-01-26 18:57:58 -08:00
Harshavardhana e11d851aee
add new drive I/O waiting/tokens metric (#18836)
Bonus: add virtual memory used as well part of the system resource metrics.
2024-01-19 14:51:36 -08:00
Harshavardhana dd2542e96c
add codespell action (#18818)
Original work here, #18474,  refixed and updated.
2024-01-17 23:03:17 -08:00
Shubhendu 9434fff215
Added list of scanner metrics to document (#18731)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2024-01-02 10:41:33 -08:00
Mario Bros fbd8dfe60f
Adding ~ to match job when multiple jobs (#18706) 2023-12-27 15:39:20 -08:00
Shubhendu 9d7660b409
Graph cluster wide where applicable (#18705)
Graph the maximum value reported across nodes at cluster
level for applicable scenarios.

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2023-12-22 08:14:32 -08:00
Krishnan Parthasarathi 56b7045c20
Export tier metrics (#18678)
minio_node_tier_ttlb_seconds - Distribution of time to last byte for streaming objects from warm tier
minio_node_tier_requests_success - Number of requests to download object from warm tier that were successful
minio_node_tier_requests_failure - Number of requests to download object from warm tier that failed
2023-12-20 20:13:40 -08:00
Anugrah Vijay 6acf038a84
docs: fix bucket metrics API\ path in docs (#18661) 2023-12-18 08:21:08 -08:00
Praveen raj Mani 10ca0a6936
Label the notification target metrics by their target IDs (#18633)
This patch adds the targetID to the existing notification target metrics
and deprecates the current target metrics which points to the overall
event notification subsystem
2023-12-14 09:09:26 -08:00
Shubhendu 6d4c1156d6
Changed the expression to render the value (#18627)
The metrics `minio_bucket_replication_received_bytes` and
`minio_bucket_replication_sent_bytes` are additive in nature
and rendering the value as is looks fine.

Also added sort order for few graphs for better reading of tool
tips as keeping ones with highest value at top helps.

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2023-12-13 10:05:47 -08:00
Shireesh Anjal 7350a29fec
Capture percentage of cpu load and memory used (#18596)
By default the cpu load is the cumulative of all cores. Capture the
percentage load (load * 100 / cpu-count)

Also capture the percentage memory used (used * 100 / total)
2023-12-06 13:19:59 -08:00
Harshavardhana e98172d72d
avoid hot-tier SLA to be tied to warm-tier SLA (#18581)
it is okay if the warm-tier cannot keep up, we should continue
to take I/O at hot-tier, only fail hot-tier or block it when
we are disk full.

Bonus: add metrics counter for these missed tasks, we will
know for sure if one of the node is lagging behind or is
losing too many tasks during transitioning.
2023-12-02 13:02:12 -08:00
Shubhendu 317b40ef90
Fixed broken docs link (#18486)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2023-11-20 12:04:49 -08:00
Shubhendu e938ece492
Added guidelines for setting prometheus alerts (#18479)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2023-11-19 10:16:08 -08:00
Shubhendu e4b619ce1a
Added graph for Erasure Set Tolerance value (#18472)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2023-11-17 10:38:15 -08:00
vicmunoz da95a2d13f
fix: object versions metric help (#18388) 2023-11-03 11:43:52 -07:00
Shubhendu ef67c39910
Added graphs for KMS metrics (#18321)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2023-10-30 03:20:53 -07:00
Shireesh Anjal 6d20ec3bea
Add support for resource metrics (#18057)
Add a new endpoint for "resource" metrics `/v2/metrics/resource`

This should return system metrics related to drives, network, CPU and
memory. Except for drives, other metrics should have corresponding "avg"
and "max" values also.

Reuse the real-time feature to capture the required data,
introducing CPU and memory metrics in it.

Collect the data every minute and keep updating the average and max values
accordingly, returning the latest values when the API is called.
2023-09-30 13:40:20 -07:00
Harshavardhana 822cbd4b43 add couple of missing things from #18027 2023-09-13 23:26:48 -07:00
Ravind Kumar 3c19a9308d
DOCS-987: Reorganizing list.md for better RST compatibility (#18027) 2023-09-13 23:23:37 -07:00
Shubhendu e47e625f73
Added replication graphs for site replication metrics (#17951)
This dashboard graphs the metrics when site replication is enabled
across MinIO instances.

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2023-08-31 08:31:16 -07:00
Shubhendu 0ce9e00ffa
Added node scanner and node drives graphs (#17949)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2023-08-30 14:01:51 -07:00
Shubhendu c778c381b5
Added new bucket replication graphs (#17947)
This PR adds new bucket replication graphs for better and granular
monitoring of bucket replication. Also arranged all replication graphs
together.

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2023-08-30 11:57:41 -07:00
Poorna b48bbe08b2
Add additional info for replication metrics API (#17293)
to track the replication transfer rate across different nodes,
number of active workers in use and in-queue stats to get
an idea of the current workload.

This PR also adds replication metrics to the site replication
status API. For site replication, prometheus metrics are
no longer at the bucket level - but at the cluster level.

Add prometheus metric to track credential errors since uptime
2023-08-30 01:00:59 -07:00
Shubhendu c3c8441a1d
Corrected the count of buckets and objects graphs (#17883)
In distributed setup with a load balancer, randmoly any server
would report the metrics `minio_cluster_bucket_total` and
`minio_cluster_usage_object_total` and while graphing it, we should
take max of reported values.

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2023-08-21 09:04:38 -07:00
Harshavardhana 8a9b886011
update grafana dashboard with disk -> drive rename (#17857) 2023-08-15 16:04:20 -07:00