Updated Prometheus metrics (#11141)

* Add metrics for nodes online and offline * Add cluster capacity metrics * Introduce v2 metrics
2025-11-07 21:02:58 -05:00 · 2021-01-18 20:35:38 -08:00
parent 3bda8f755c
commit b4add82bb6
27 changed files with 1669 additions and 252 deletions
--- a/docs/metrics/README.md
+++ b/docs/metrics/README.md
@@ -13,8 +13,15 @@ Read more on how to use these endpoints in [MinIO healthcheck guide](https://git

 ### Prometheus Probe

-MinIO server exposes Prometheus compatible data on a single endpoint. By default, the endpoint is authenticated.
+MinIO allows reading metrics for the entire cluster from any single node. The cluster wide metrics can be read at
+`<Address for MinIO Service>/minio/prometheus/cluster`.

- Prometheus data available at `/minio/prometheus/metrics`
+The additional node specific metrics which include go metrics or process metrics are exposed at
+`<Address for MinIO Node>/minio/prometheus/node`.

 To use this endpoint, setup Prometheus to scrape data from this endpoint. Read more on how to configure and use Prometheus to monitor MinIO server in [How to monitor MinIO server with Prometheus](https://github.com/minio/minio/blob/master/docs/metrics/prometheus/README.md).
+
+**Deprecated metrics monitoring**
+
+- Prometheus' data available at `/minio/prometheus/metrics` is deprecated
+
--- a/docs/metrics/prometheus/README.md
+++ b/docs/metrics/prometheus/README.md
@@ -1,8 +1,13 @@
 # How to monitor MinIO server with Prometheus [![Slack](https://slack.min.io/slack?type=svg)](https://slack.min.io)

-[Prometheus](https://prometheus.io) is a cloud-native monitoring platform, built originally at SoundCloud. Prometheus offers a multi-dimensional data model with time series data identified by metric name and key/value pairs. The data collection happens via a pull model over HTTP/HTTPS. Targets to pull data from are discovered via service discovery or static configuration.
+[Prometheus](https://prometheus.io) is a cloud-native monitoring platform. 

-MinIO exports Prometheus compatible data by default as an authorized endpoint at `/minio/prometheus/metrics`. Users looking to monitor their MinIO instances can point Prometheus configuration to scrape data from this endpoint.
+Prometheus offers a multi-dimensional data model with time series data identified by metric name and key/value pairs. 
+The data collection happens via a pull model over HTTP/HTTPS.
+
+MinIO exports Prometheus compatible data by default as an authorized endpoint at `/minio/prometheus/metrics/cluster`. 
+
+Users looking to monitor their MinIO instances can point Prometheus configuration to scrape data from this endpoint.

 This document explains how to setup Prometheus and configure it to scrape data from MinIO servers.

@@ -20,7 +25,8 @@ This document explains how to setup Prometheus and configure it to scrape data f
 - [List of metrics exposed by MinIO](#list-of-metrics-exposed-by-minio)

 ## Prerequisites
-To get started with MinIO, refer [MinIO QuickStart Document](https://docs.min.io/docs/minio-quickstart-guide). Follow below steps to get started with MinIO monitoring using Prometheus.
+To get started with MinIO, refer [MinIO QuickStart Document](https://docs.min.io/docs/minio-quickstart-guide). 
+Follow below steps to get started with MinIO monitoring using Prometheus.

 ### 1. Download Prometheus

@@ -68,7 +74,7 @@ The command will generate the `scrape_configs` section of the prometheus.yml as
 scrape_configs:
 - job_name: minio-job
  bearer_token: <secret>
-  metrics_path: /minio/prometheus/metrics
+  metrics_path: /minio/v2/metrics/cluster
  scheme: http
  static_configs:
  - targets: ['localhost:9000']
@@ -77,16 +83,26 @@ scrape_configs:
 #### 3.2 Public Prometheus config

 If Prometheus endpoint authentication type is set to `public`. Following prometheus config is sufficient to start scraping metrics data from MinIO.
-
+This can be collected from any server once per collection.
+##### Cluster
 ```yaml
 scrape_configs:
 - job_name: minio-job
-  metrics_path: /minio/prometheus/metrics
+  metrics_path: /minio/v2/metrics/cluster
+  scheme: http
+  static_configs:
+  - targets: ['localhost:9000']
+```
+##### Node
+Optionally you can also collect per node metrics. This needs to be done on a per server instance.
+```yaml
+scrape_configs:
+- job_name: minio-job
+  metrics_path: /minio/v2/metrics/node
  scheme: http
  static_configs:
  - targets: ['localhost:9000']
 ```
-
 ### 4. Update `scrape_configs` section in prometheus.yml

 To authorize every scrape request, copy and paste the generated `scrape_configs` section in the prometheus.yml and restart the Prometheus service.
@@ -103,172 +119,16 @@ Here `prometheus.yml` is the name of configuration file. You can now see MinIO m

 ### 6. Configure Grafana

-After Prometheus is configured, you can use Grafana to visualize MinIO metrics. Refer the [document here to setup Grafana with MinIO prometheus metrics](https://github.com/minio/minio/blob/master/docs/metrics/prometheus/grafana/README.md).
+After Prometheus is configured, you can use Grafana to visualize MinIO metrics. 
+Refer the [document here to setup Grafana with MinIO prometheus metrics](https://github.com/minio/minio/blob/master/docs/metrics/prometheus/grafana/README.md).

 ## List of metrics exposed by MinIO

-MinIO server exposes the following metrics on `/minio/prometheus/metrics` endpoint. All of these can be accessed via Prometheus dashboard. The full list of exposed metrics along with their definition is available in the demo server at https://play.min.io:9000/minio/prometheus/metrics
+MinIO server exposes the following metrics on `/minio/prometheus/metrics/cluster` endpoint. 
+All of these can be accessed via Prometheus dashboard. 
+A sample list of exposed metrics along with their definition is available in the demo server at 
+`curl https://play.min.io:9000/minio/prometheus/metrics/cluster`

-These are the new set of metrics which will be in effect after `RELEASE.2019-10-16*`. Some of the key changes in this update are listed below.
-    - Metrics are bound the respective nodes and is not cluster-wide. Each and every node in a cluster will expose its own metrics.
-    - Additional metrics to cover the s3 and internode traffic statistics were added.
-    - Metrics that records the http statistics and latencies are labeled to their respective APIs (putobject,getobject etc).
-    - Disk usage metrics are distributed and labeled to the respective disk paths.
+### List of metrics reported 

-For more details, please check the `Migration guide for the new set of metrics`.
-
-The list of metrics and its definition are as follows. (NOTE: instance here is one MinIO node)
-
-> NOTES:
-    > 1. Instance here is one MinIO node.
-    > 2. `s3 requests` exclude internode requests.
-
-### Default set of information
-| name        | description                     |
-|:------------|:--------------------------------|
-| `go_`       | all standard go runtime metrics |
-| `process_`  | all process level metrics       |
-| `promhttp_` | all prometheus scrape metrics   |
-
-### MinIO node specific information
-| name                       | description                                                                    |
-|:---------------------------|:-------------------------------------------------------------------------------|
-| `minio_version_info`       | Current MinIO version with its commit-id                                       |
-| `minio_disks_offline`      | Total number of offline disks on current MinIO instance                        |
-| `minio_disks_total`        | Total number of disks on current MinIO instance                                |
-
-### Disk metrics are labeled by 'disk' which indentifies each disk
-| name                       | description                                                                    |
-|:---------------------------|:-------------------------------------------------------------------------------|
-| `disk_storage_total`       | Total size of the disk                                                         |
-| `disk_storage_used`        | Total disk space used per disk                                                 |
-| `disk_storage_available`   | Total available disk space per disk                                            |
-
-### S3 API metrics are labeled by 'api' which identifies different S3 API requests
-| name                       | description                                                                    |
-|:---------------------------|:-------------------------------------------------------------------------------|
-| `s3_requests_total`        | Total number of s3 requests in current MinIO instance                          |
-| `s3_errors_total`          | Total number of errors in s3 requests in current MinIO instance                |
-| `s3_requests_current`      | Total number of active s3 requests in current MinIO instance                   |
-| `s3_rx_bytes_total`        | Total number of s3 bytes received by current MinIO server instance             |
-| `s3_tx_bytes_total`        | Total number of s3 bytes sent by current MinIO server instance                 |
-| `s3_ttfb_seconds`          | Histogram that holds the latency information of the requests                   |
-
-#### Internode metrics only available in a distributed setup
-| name                       | description                                                                    |
-|:---------------------------|:-------------------------------------------------------------------------------|
-| `internode_rx_bytes_total` | Total number of internode bytes received by current MinIO server instance      |
-| `internode_tx_bytes_total` | Total number of bytes sent to the other nodes by current MinIO server instance |
-
-Apart from above metrics, MinIO also exposes below mode specific metrics
-
-### Bucket usage specific metrics
-All metrics are labeled by `bucket`, each metric is displayed per bucket. `buckets_objects_histogram` is additionally labeled by `object_size` string which is represented by any of the following values
-
- *LESS_THAN_1024_B*
- *BETWEEN_1024_B_AND_1_MB*
- *BETWEEN_1_MB_AND_10_MB*
- *BETWEEN_10_MB_AND_64_MB*
- *BETWEEN_64_MB_AND_128_MB*
- *BETWEEN_128_MB_AND_512_MB*
- *GREATER_THAN_512_MB*
-
-Units defintions:
- 1 MB = 1024 KB
- 1 KB = 1024 B
-
-| name                                | description                                         |
-|:------------------------------------|:----------------------------------------------------|
-| `bucket_usage_size`                 | Total size of the bucket                            |
-| `bucket_objects_count`              | Total number of objects in a bucket                 |
-| `bucket_objects_histogram`          | Total number of objects filtered by different sizes |
-| `bucket_replication_pending_size`   | Total capacity not replicated                       |
-| `bucket_replication_failed_size`    | Total capacity failed to replicate at least once    |
-| `bucket_replication_successful_size`| Total capacity successfully replicated              |
-| `bucket_replication_received_size`  | Total capacity received as replicated objects       |
-
-### Cache specific metrics
-
-MinIO Gateway instances enabled with Disk-Caching expose caching related metrics.
-
-#### Global cache metrics
-| name                 | description                                       |
-|:---------------------|:--------------------------------------------------|
-| `cache_hits_total`   | Total number of cache hits                        |
-| `cache_misses_total` | Total number of cache misses                      |
-| `cache_data_served`  | Total number of bytes served from cache           |
-
-#### Per disk cache metrics
-| name                   | description                                                                      |
-|:-----------------------|:---------------------------------------------------------------------------------|
-| `cache_usage_size`     | Total cache usage in bytes                                                       |
-| `cache_total_capacity` | Total size of cache disk                                                         |
-| `cache_usage_percent`  | Total percentage cache usage                                                     |
-| `cache_usage_state`    | Indicates cache usage is high or low, relative to current cache 'quota' settings |
-
-`cache_usage_state` holds only two states
-
- '1' indicates high disk usage
- '0' indicates low disk usage
-
-### Gateway specific metrics
-MinIO Gateway instance exposes metrics related to Gateway communication with the cloud backend (S3, Azure & GCS Gateway).
-
-`<gateway_type>` changes based on the gateway in use can be 's3', 'gcs' or 'azure'. Other metrics are labeled with `method` that identifies HTTP GET, HEAD, PUT and POST requests to the backend.
-
-| name                                    | description                                                                |
-|:----------------------------------------|:---------------------------------------------------------------------------|
-| `gateway_<gateway_type>_requests`       | Total number of requests made to the gateway backend                       |
-| `gateway_<gateway_type>_bytes_sent`     | Total number of bytes sent to cloud backend (in PUT & POST Requests)       |
-| `gateway_<gateway_type>_bytes_received` | Total number of bytes received from cloud backend (in GET & HEAD Requests) |
-
-Note that this is currently only support for Azure, S3 and GCS Gateway.
-
-### MinIO self-healing metrics - `self_heal_*`
-
-MinIO exposes self-healing related metrics for erasure-code deployments _only_. These metrics are _not_ available on Gateway or Single Node, Single Drive deployments. Note that these metrics will be exposed _only_ when there is a relevant event happening on MinIO server.
-
-| name                                 | description                                                                                                                                                                 |
-|:-------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `self_heal_time_since_last_activity` | Time elapsed since last self-healing related activity                                                                                                                       |
-| `self_heal_objects_scanned`          | Number of objects scanned by self-healing thread in its current run. This will reset when a fresh self-healing run starts. This is labeled with the object type scanned     |
-| `self_heal_objects_healed`           | Number of objects healing by self-healing thread in its current run. This will reset when a fresh self-healing run starts. This is labeled with the object type scanned     |
-| `self_heal_objects_heal_failed`      | Number of objects for which self-healing failed in its current run. This will reset when a fresh self-healing run starts. This is labeled with disk status and its endpoint |
-
-## Migration guide for the new set of metrics
-
-This migration guide applies for older releases or any releases before `RELEASE.2019-10-23*`
-
-### MinIO disk level metrics - `disk_*`
-
-The migrations include
-
- `minio_total_disks` to `minio_disks_total`
- `minio_offline_disks` to `minio_disks_offline`
-
-### MinIO disk level metrics - `disk_storage_*`
-
-These metrics have one label.
-
- `disk`: Holds the disk path
-
-The migrations include
-
- `minio_disk_storage_used_bytes` to `disk_storage_used`
- `minio_disk_storage_available_bytes` to `disk_storage_available`
- `minio_disk_storage_total_bytes` to `disk_storage_total`
-
-### MinIO network level metrics
-
-These metrics are detailed to cover the s3 and internode network statistics.
-
-The migrations include
-
- `minio_network_sent_bytes_total` to `s3_tx_bytes_total` and `internode_tx_bytes_total`
- `minio_network_received_bytes_total` to `s3_rx_bytes_total` and `internode_rx_bytes_total`
-
-Some of the additional metrics added were
-
- `s3_requests_total`
- `s3_errors_total`
- `s3_ttfb_seconds`
+[The list of metrics reported can be here](https://github.com/minio/minio/blob/master/docs/metrics/prometheus/list.md)
--- a/docs/metrics/prometheus/list.md
+++ b/docs/metrics/prometheus/list.md
@@ -0,0 +1,47 @@
+# List of metrics reported cluster wide
+
+Each metric includes a label for the server that calculated the metric.
+Each metric has a label for the server that generated the metric.
+
+These metrics can be from any MinIO server once per collection.
+
+| Name                                           | Description                                                                                                                 |
+|:-----------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------|
+|`minio_bucket_objects_size_distribution`        |Distribution of object sizes in the bucket, includes label for the bucket name.                                              |
+|`minio_bucket_replication_failed_bytes`         |Total number of bytes failed at least once to replicate.                                                                     |
+|`minio_bucket_replication_pending_bytes`        |Total bytes pending to replicate.                                                                                            |
+|`minio_bucket_replication_received_bytes`       |Total number of bytes replicated to this bucket from another source bucket.                                                  |
+|`minio_bucket_replication_sent_bytes`           |Total number of bytes replicated to the target bucket.                                                                       |
+|`minio_bucket_usage_object_total`               |Total number of objects                                                                                                      |
+|`minio_bucket_usage_total_bytes`                |Total bucket size in bytes                                                                                                   |
+|`minio_cluster_capacity_raw_free_bytes`         |Total free capacity online in the cluster.                                                                                   |
+|`minio_cluster_capacity_raw_total_bytes`        |Total capacity online in the cluster.                                                                                        |
+|`minio_cluster_capacity_usable_free_bytes`      |Total free usable capacity online in the cluster.                                                                            |
+|`minio_cluster_capacity_usable_total_bytes`     |Total usable capacity online in the cluster.                                                                                 |
+|`minio_cluster_disk_offline_total`              |Total disks offline.                                                                                                         |
+|`minio_cluster_disk_online_total`               |Total disks online.                                                                                                          |
+|`minio_cluster_nodes_offline_total`             |Total number of MinIO nodes offline.                                                                                         |
+|`minio_cluster_nodes_online_total`              |Total number of MinIO nodes online.                                                                                          |
+|`minio_heal_objects_error_total`                |Objects for which healing failed in current self healing run                                                                 |
+|`minio_heal_objects_heal_total`                 |Objects healed in current self healing run                                                                                   |
+|`minio_heal_objects_total`                      |Objects scanned in current self healing run                                                                                  |
+|`minio_heal_time_last_activity_nano_seconds`    |Time elapsed (in nano seconds) since last self healing activity. This is set to -1 until initial self heal activity          |
+|`minio_inter_node_traffic_received_bytes`       |Total number of bytes received from other peer nodes.                                                                        |
+|`minio_inter_node_traffic_sent_bytes`           |Total number of bytes sent to the other peer nodes.                                                                          |
+|`minio_node_disk_free_bytes`                    |Total storage available on a disk.                                                                                           |
+|`minio_node_disk_total_bytes`                   |Total storage on a disk.                                                                                                     |
+|`minio_node_disk_used_bytes`                    |Total storage used on a disk.                                                                                                |
+|`minio_s3_requests_error_total`                 |Total number S3 requests with errors                                                                                         |
+|`minio_s3_requests_inflight_total`              |Total number of S3 requests currently in flight.                                                                             |
+|`minio_s3_requests_total`                       |Total number S3 requests                                                                                                     |
+|`minio_s3_time_ttbf_seconds_distribution`       |Distribution of the time to first byte across API calls.                                                                     |
+|`minio_s3_traffic_received_bytes`               |Total number of s3 bytes received.                                                                                           |
+|`minio_s3_traffic_sent_bytes`                   |Total number of s3 bytes sent                                                                                                |
+|`minio_cache_hits_total`                        |Total number of disk cache hits                                                                                              |
+|`minio_cache_missed_total`                      |Total number of disk cache misses                                                                                            |
+|`minio_cache_sent_bytes`                        |Total number of bytes served from cache                                                                                      |
+|`minio_cache_total_bytes`                       |Total size of cache disk in bytes                                                                                            |
+|`minio_cache_usage_info`                        |Total percentage cache usage, value of 1 indicates high and 0 low, label level is set as well                                |
+|`minio_cache_used_bytes`                        |Current cache usage in bytes                                                                                                 |
+|`minio_software_commit_info`                    |Git commit hash for the MinIO release.                                                                                       |
+|`minio_software_version_info`                   |MinIO Release tag for the server                                                                                             |