mirror of
https://github.com/minio/minio.git
synced 2025-11-07 21:02:58 -05:00
Updated Prometheus metrics (#11141)
* Add metrics for nodes online and offline * Add cluster capacity metrics * Introduce v2 metrics
This commit is contained in:
@@ -13,8 +13,15 @@ Read more on how to use these endpoints in [MinIO healthcheck guide](https://git
|
||||
|
||||
### Prometheus Probe
|
||||
|
||||
MinIO server exposes Prometheus compatible data on a single endpoint. By default, the endpoint is authenticated.
|
||||
MinIO allows reading metrics for the entire cluster from any single node. The cluster wide metrics can be read at
|
||||
`<Address for MinIO Service>/minio/prometheus/cluster`.
|
||||
|
||||
- Prometheus data available at `/minio/prometheus/metrics`
|
||||
The additional node specific metrics which include go metrics or process metrics are exposed at
|
||||
`<Address for MinIO Node>/minio/prometheus/node`.
|
||||
|
||||
To use this endpoint, setup Prometheus to scrape data from this endpoint. Read more on how to configure and use Prometheus to monitor MinIO server in [How to monitor MinIO server with Prometheus](https://github.com/minio/minio/blob/master/docs/metrics/prometheus/README.md).
|
||||
|
||||
**Deprecated metrics monitoring**
|
||||
|
||||
- Prometheus' data available at `/minio/prometheus/metrics` is deprecated
|
||||
|
||||
|
||||
@@ -1,8 +1,13 @@
|
||||
# How to monitor MinIO server with Prometheus [](https://slack.min.io)
|
||||
|
||||
[Prometheus](https://prometheus.io) is a cloud-native monitoring platform, built originally at SoundCloud. Prometheus offers a multi-dimensional data model with time series data identified by metric name and key/value pairs. The data collection happens via a pull model over HTTP/HTTPS. Targets to pull data from are discovered via service discovery or static configuration.
|
||||
[Prometheus](https://prometheus.io) is a cloud-native monitoring platform.
|
||||
|
||||
MinIO exports Prometheus compatible data by default as an authorized endpoint at `/minio/prometheus/metrics`. Users looking to monitor their MinIO instances can point Prometheus configuration to scrape data from this endpoint.
|
||||
Prometheus offers a multi-dimensional data model with time series data identified by metric name and key/value pairs.
|
||||
The data collection happens via a pull model over HTTP/HTTPS.
|
||||
|
||||
MinIO exports Prometheus compatible data by default as an authorized endpoint at `/minio/prometheus/metrics/cluster`.
|
||||
|
||||
Users looking to monitor their MinIO instances can point Prometheus configuration to scrape data from this endpoint.
|
||||
|
||||
This document explains how to setup Prometheus and configure it to scrape data from MinIO servers.
|
||||
|
||||
@@ -20,7 +25,8 @@ This document explains how to setup Prometheus and configure it to scrape data f
|
||||
- [List of metrics exposed by MinIO](#list-of-metrics-exposed-by-minio)
|
||||
|
||||
## Prerequisites
|
||||
To get started with MinIO, refer [MinIO QuickStart Document](https://docs.min.io/docs/minio-quickstart-guide). Follow below steps to get started with MinIO monitoring using Prometheus.
|
||||
To get started with MinIO, refer [MinIO QuickStart Document](https://docs.min.io/docs/minio-quickstart-guide).
|
||||
Follow below steps to get started with MinIO monitoring using Prometheus.
|
||||
|
||||
### 1. Download Prometheus
|
||||
|
||||
@@ -68,7 +74,7 @@ The command will generate the `scrape_configs` section of the prometheus.yml as
|
||||
scrape_configs:
|
||||
- job_name: minio-job
|
||||
bearer_token: <secret>
|
||||
metrics_path: /minio/prometheus/metrics
|
||||
metrics_path: /minio/v2/metrics/cluster
|
||||
scheme: http
|
||||
static_configs:
|
||||
- targets: ['localhost:9000']
|
||||
@@ -77,16 +83,26 @@ scrape_configs:
|
||||
#### 3.2 Public Prometheus config
|
||||
|
||||
If Prometheus endpoint authentication type is set to `public`. Following prometheus config is sufficient to start scraping metrics data from MinIO.
|
||||
|
||||
This can be collected from any server once per collection.
|
||||
##### Cluster
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: minio-job
|
||||
metrics_path: /minio/prometheus/metrics
|
||||
metrics_path: /minio/v2/metrics/cluster
|
||||
scheme: http
|
||||
static_configs:
|
||||
- targets: ['localhost:9000']
|
||||
```
|
||||
##### Node
|
||||
Optionally you can also collect per node metrics. This needs to be done on a per server instance.
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: minio-job
|
||||
metrics_path: /minio/v2/metrics/node
|
||||
scheme: http
|
||||
static_configs:
|
||||
- targets: ['localhost:9000']
|
||||
```
|
||||
|
||||
### 4. Update `scrape_configs` section in prometheus.yml
|
||||
|
||||
To authorize every scrape request, copy and paste the generated `scrape_configs` section in the prometheus.yml and restart the Prometheus service.
|
||||
@@ -103,172 +119,16 @@ Here `prometheus.yml` is the name of configuration file. You can now see MinIO m
|
||||
|
||||
### 6. Configure Grafana
|
||||
|
||||
After Prometheus is configured, you can use Grafana to visualize MinIO metrics. Refer the [document here to setup Grafana with MinIO prometheus metrics](https://github.com/minio/minio/blob/master/docs/metrics/prometheus/grafana/README.md).
|
||||
After Prometheus is configured, you can use Grafana to visualize MinIO metrics.
|
||||
Refer the [document here to setup Grafana with MinIO prometheus metrics](https://github.com/minio/minio/blob/master/docs/metrics/prometheus/grafana/README.md).
|
||||
|
||||
## List of metrics exposed by MinIO
|
||||
|
||||
MinIO server exposes the following metrics on `/minio/prometheus/metrics` endpoint. All of these can be accessed via Prometheus dashboard. The full list of exposed metrics along with their definition is available in the demo server at https://play.min.io:9000/minio/prometheus/metrics
|
||||
MinIO server exposes the following metrics on `/minio/prometheus/metrics/cluster` endpoint.
|
||||
All of these can be accessed via Prometheus dashboard.
|
||||
A sample list of exposed metrics along with their definition is available in the demo server at
|
||||
`curl https://play.min.io:9000/minio/prometheus/metrics/cluster`
|
||||
|
||||
These are the new set of metrics which will be in effect after `RELEASE.2019-10-16*`. Some of the key changes in this update are listed below.
|
||||
- Metrics are bound the respective nodes and is not cluster-wide. Each and every node in a cluster will expose its own metrics.
|
||||
- Additional metrics to cover the s3 and internode traffic statistics were added.
|
||||
- Metrics that records the http statistics and latencies are labeled to their respective APIs (putobject,getobject etc).
|
||||
- Disk usage metrics are distributed and labeled to the respective disk paths.
|
||||
### List of metrics reported
|
||||
|
||||
For more details, please check the `Migration guide for the new set of metrics`.
|
||||
|
||||
The list of metrics and its definition are as follows. (NOTE: instance here is one MinIO node)
|
||||
|
||||
> NOTES:
|
||||
> 1. Instance here is one MinIO node.
|
||||
> 2. `s3 requests` exclude internode requests.
|
||||
|
||||
### Default set of information
|
||||
| name | description |
|
||||
|:------------|:--------------------------------|
|
||||
| `go_` | all standard go runtime metrics |
|
||||
| `process_` | all process level metrics |
|
||||
| `promhttp_` | all prometheus scrape metrics |
|
||||
|
||||
### MinIO node specific information
|
||||
| name | description |
|
||||
|:---------------------------|:-------------------------------------------------------------------------------|
|
||||
| `minio_version_info` | Current MinIO version with its commit-id |
|
||||
| `minio_disks_offline` | Total number of offline disks on current MinIO instance |
|
||||
| `minio_disks_total` | Total number of disks on current MinIO instance |
|
||||
|
||||
### Disk metrics are labeled by 'disk' which indentifies each disk
|
||||
| name | description |
|
||||
|:---------------------------|:-------------------------------------------------------------------------------|
|
||||
| `disk_storage_total` | Total size of the disk |
|
||||
| `disk_storage_used` | Total disk space used per disk |
|
||||
| `disk_storage_available` | Total available disk space per disk |
|
||||
|
||||
### S3 API metrics are labeled by 'api' which identifies different S3 API requests
|
||||
| name | description |
|
||||
|:---------------------------|:-------------------------------------------------------------------------------|
|
||||
| `s3_requests_total` | Total number of s3 requests in current MinIO instance |
|
||||
| `s3_errors_total` | Total number of errors in s3 requests in current MinIO instance |
|
||||
| `s3_requests_current` | Total number of active s3 requests in current MinIO instance |
|
||||
| `s3_rx_bytes_total` | Total number of s3 bytes received by current MinIO server instance |
|
||||
| `s3_tx_bytes_total` | Total number of s3 bytes sent by current MinIO server instance |
|
||||
| `s3_ttfb_seconds` | Histogram that holds the latency information of the requests |
|
||||
|
||||
#### Internode metrics only available in a distributed setup
|
||||
| name | description |
|
||||
|:---------------------------|:-------------------------------------------------------------------------------|
|
||||
| `internode_rx_bytes_total` | Total number of internode bytes received by current MinIO server instance |
|
||||
| `internode_tx_bytes_total` | Total number of bytes sent to the other nodes by current MinIO server instance |
|
||||
|
||||
Apart from above metrics, MinIO also exposes below mode specific metrics
|
||||
|
||||
### Bucket usage specific metrics
|
||||
All metrics are labeled by `bucket`, each metric is displayed per bucket. `buckets_objects_histogram` is additionally labeled by `object_size` string which is represented by any of the following values
|
||||
|
||||
- *LESS_THAN_1024_B*
|
||||
- *BETWEEN_1024_B_AND_1_MB*
|
||||
- *BETWEEN_1_MB_AND_10_MB*
|
||||
- *BETWEEN_10_MB_AND_64_MB*
|
||||
- *BETWEEN_64_MB_AND_128_MB*
|
||||
- *BETWEEN_128_MB_AND_512_MB*
|
||||
- *GREATER_THAN_512_MB*
|
||||
|
||||
Units defintions:
|
||||
- 1 MB = 1024 KB
|
||||
- 1 KB = 1024 B
|
||||
|
||||
| name | description |
|
||||
|:------------------------------------|:----------------------------------------------------|
|
||||
| `bucket_usage_size` | Total size of the bucket |
|
||||
| `bucket_objects_count` | Total number of objects in a bucket |
|
||||
| `bucket_objects_histogram` | Total number of objects filtered by different sizes |
|
||||
| `bucket_replication_pending_size` | Total capacity not replicated |
|
||||
| `bucket_replication_failed_size` | Total capacity failed to replicate at least once |
|
||||
| `bucket_replication_successful_size`| Total capacity successfully replicated |
|
||||
| `bucket_replication_received_size` | Total capacity received as replicated objects |
|
||||
|
||||
### Cache specific metrics
|
||||
|
||||
MinIO Gateway instances enabled with Disk-Caching expose caching related metrics.
|
||||
|
||||
#### Global cache metrics
|
||||
| name | description |
|
||||
|:---------------------|:--------------------------------------------------|
|
||||
| `cache_hits_total` | Total number of cache hits |
|
||||
| `cache_misses_total` | Total number of cache misses |
|
||||
| `cache_data_served` | Total number of bytes served from cache |
|
||||
|
||||
#### Per disk cache metrics
|
||||
| name | description |
|
||||
|:-----------------------|:---------------------------------------------------------------------------------|
|
||||
| `cache_usage_size` | Total cache usage in bytes |
|
||||
| `cache_total_capacity` | Total size of cache disk |
|
||||
| `cache_usage_percent` | Total percentage cache usage |
|
||||
| `cache_usage_state` | Indicates cache usage is high or low, relative to current cache 'quota' settings |
|
||||
|
||||
`cache_usage_state` holds only two states
|
||||
|
||||
- '1' indicates high disk usage
|
||||
- '0' indicates low disk usage
|
||||
|
||||
### Gateway specific metrics
|
||||
MinIO Gateway instance exposes metrics related to Gateway communication with the cloud backend (S3, Azure & GCS Gateway).
|
||||
|
||||
`<gateway_type>` changes based on the gateway in use can be 's3', 'gcs' or 'azure'. Other metrics are labeled with `method` that identifies HTTP GET, HEAD, PUT and POST requests to the backend.
|
||||
|
||||
| name | description |
|
||||
|:----------------------------------------|:---------------------------------------------------------------------------|
|
||||
| `gateway_<gateway_type>_requests` | Total number of requests made to the gateway backend |
|
||||
| `gateway_<gateway_type>_bytes_sent` | Total number of bytes sent to cloud backend (in PUT & POST Requests) |
|
||||
| `gateway_<gateway_type>_bytes_received` | Total number of bytes received from cloud backend (in GET & HEAD Requests) |
|
||||
|
||||
Note that this is currently only support for Azure, S3 and GCS Gateway.
|
||||
|
||||
### MinIO self-healing metrics - `self_heal_*`
|
||||
|
||||
MinIO exposes self-healing related metrics for erasure-code deployments _only_. These metrics are _not_ available on Gateway or Single Node, Single Drive deployments. Note that these metrics will be exposed _only_ when there is a relevant event happening on MinIO server.
|
||||
|
||||
| name | description |
|
||||
|:-------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `self_heal_time_since_last_activity` | Time elapsed since last self-healing related activity |
|
||||
| `self_heal_objects_scanned` | Number of objects scanned by self-healing thread in its current run. This will reset when a fresh self-healing run starts. This is labeled with the object type scanned |
|
||||
| `self_heal_objects_healed` | Number of objects healing by self-healing thread in its current run. This will reset when a fresh self-healing run starts. This is labeled with the object type scanned |
|
||||
| `self_heal_objects_heal_failed` | Number of objects for which self-healing failed in its current run. This will reset when a fresh self-healing run starts. This is labeled with disk status and its endpoint |
|
||||
|
||||
## Migration guide for the new set of metrics
|
||||
|
||||
This migration guide applies for older releases or any releases before `RELEASE.2019-10-23*`
|
||||
|
||||
### MinIO disk level metrics - `disk_*`
|
||||
|
||||
The migrations include
|
||||
|
||||
- `minio_total_disks` to `minio_disks_total`
|
||||
- `minio_offline_disks` to `minio_disks_offline`
|
||||
|
||||
### MinIO disk level metrics - `disk_storage_*`
|
||||
|
||||
These metrics have one label.
|
||||
|
||||
- `disk`: Holds the disk path
|
||||
|
||||
The migrations include
|
||||
|
||||
- `minio_disk_storage_used_bytes` to `disk_storage_used`
|
||||
- `minio_disk_storage_available_bytes` to `disk_storage_available`
|
||||
- `minio_disk_storage_total_bytes` to `disk_storage_total`
|
||||
|
||||
### MinIO network level metrics
|
||||
|
||||
These metrics are detailed to cover the s3 and internode network statistics.
|
||||
|
||||
The migrations include
|
||||
|
||||
- `minio_network_sent_bytes_total` to `s3_tx_bytes_total` and `internode_tx_bytes_total`
|
||||
- `minio_network_received_bytes_total` to `s3_rx_bytes_total` and `internode_rx_bytes_total`
|
||||
|
||||
Some of the additional metrics added were
|
||||
|
||||
- `s3_requests_total`
|
||||
- `s3_errors_total`
|
||||
- `s3_ttfb_seconds`
|
||||
[The list of metrics reported can be here](https://github.com/minio/minio/blob/master/docs/metrics/prometheus/list.md)
|
||||
|
||||
47
docs/metrics/prometheus/list.md
Normal file
47
docs/metrics/prometheus/list.md
Normal file
@@ -0,0 +1,47 @@
|
||||
# List of metrics reported cluster wide
|
||||
|
||||
Each metric includes a label for the server that calculated the metric.
|
||||
Each metric has a label for the server that generated the metric.
|
||||
|
||||
These metrics can be from any MinIO server once per collection.
|
||||
|
||||
| Name | Description |
|
||||
|:-----------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------|
|
||||
|`minio_bucket_objects_size_distribution` |Distribution of object sizes in the bucket, includes label for the bucket name. |
|
||||
|`minio_bucket_replication_failed_bytes` |Total number of bytes failed at least once to replicate. |
|
||||
|`minio_bucket_replication_pending_bytes` |Total bytes pending to replicate. |
|
||||
|`minio_bucket_replication_received_bytes` |Total number of bytes replicated to this bucket from another source bucket. |
|
||||
|`minio_bucket_replication_sent_bytes` |Total number of bytes replicated to the target bucket. |
|
||||
|`minio_bucket_usage_object_total` |Total number of objects |
|
||||
|`minio_bucket_usage_total_bytes` |Total bucket size in bytes |
|
||||
|`minio_cluster_capacity_raw_free_bytes` |Total free capacity online in the cluster. |
|
||||
|`minio_cluster_capacity_raw_total_bytes` |Total capacity online in the cluster. |
|
||||
|`minio_cluster_capacity_usable_free_bytes` |Total free usable capacity online in the cluster. |
|
||||
|`minio_cluster_capacity_usable_total_bytes` |Total usable capacity online in the cluster. |
|
||||
|`minio_cluster_disk_offline_total` |Total disks offline. |
|
||||
|`minio_cluster_disk_online_total` |Total disks online. |
|
||||
|`minio_cluster_nodes_offline_total` |Total number of MinIO nodes offline. |
|
||||
|`minio_cluster_nodes_online_total` |Total number of MinIO nodes online. |
|
||||
|`minio_heal_objects_error_total` |Objects for which healing failed in current self healing run |
|
||||
|`minio_heal_objects_heal_total` |Objects healed in current self healing run |
|
||||
|`minio_heal_objects_total` |Objects scanned in current self healing run |
|
||||
|`minio_heal_time_last_activity_nano_seconds` |Time elapsed (in nano seconds) since last self healing activity. This is set to -1 until initial self heal activity |
|
||||
|`minio_inter_node_traffic_received_bytes` |Total number of bytes received from other peer nodes. |
|
||||
|`minio_inter_node_traffic_sent_bytes` |Total number of bytes sent to the other peer nodes. |
|
||||
|`minio_node_disk_free_bytes` |Total storage available on a disk. |
|
||||
|`minio_node_disk_total_bytes` |Total storage on a disk. |
|
||||
|`minio_node_disk_used_bytes` |Total storage used on a disk. |
|
||||
|`minio_s3_requests_error_total` |Total number S3 requests with errors |
|
||||
|`minio_s3_requests_inflight_total` |Total number of S3 requests currently in flight. |
|
||||
|`minio_s3_requests_total` |Total number S3 requests |
|
||||
|`minio_s3_time_ttbf_seconds_distribution` |Distribution of the time to first byte across API calls. |
|
||||
|`minio_s3_traffic_received_bytes` |Total number of s3 bytes received. |
|
||||
|`minio_s3_traffic_sent_bytes` |Total number of s3 bytes sent |
|
||||
|`minio_cache_hits_total` |Total number of disk cache hits |
|
||||
|`minio_cache_missed_total` |Total number of disk cache misses |
|
||||
|`minio_cache_sent_bytes` |Total number of bytes served from cache |
|
||||
|`minio_cache_total_bytes` |Total size of cache disk in bytes |
|
||||
|`minio_cache_usage_info` |Total percentage cache usage, value of 1 indicates high and 0 low, label level is set as well |
|
||||
|`minio_cache_used_bytes` |Current cache usage in bytes |
|
||||
|`minio_software_commit_info` |Git commit hash for the MinIO release. |
|
||||
|`minio_software_version_info` |MinIO Release tag for the server |
|
||||
Reference in New Issue
Block a user