minio/docs/metrics/v3.md
Aditya Manthramurthy b2c5b75efa
feat: Add Metrics V3 API (#19068)
Metrics v3 is mainly a reorganization of metrics into smaller groups of
metrics and the removal of internal aggregation of metrics received from
peer nodes in a MinIO cluster.

This change adds the endpoint `/minio/metrics/v3` as the top-level metrics
endpoint and under this, various sub-endpoints are implemented. These
are currently documented in `docs/metrics/v3.md`

The handler will serve metrics at any path
`/minio/metrics/v3/PATH`, as follows:

when PATH is a sub-endpoint listed above => serves the group of
metrics under that path; or when PATH is a (non-empty) parent 
directory of the sub-endpoints listed above => serves metrics
from each child sub-endpoint of PATH. otherwise, returns a no 
resource found error

All available metrics are listed in the `docs/metrics/v3.md`. More will
be added subsequently.
2024-03-10 01:15:15 -08:00

17 KiB

Metrics Version 3

In metrics version 3, all metrics are available under the endpoint:

/minio/metrics/v3

however, a specific path under this is required.

Metrics are organized into groups at paths relative to the top-level endpoint above.

Metrics Request Handling

Each endpoint below can be queried at different intervals as needed via a scrape configuration in Prometheus or a compatible metrics collection tool.

For ease of configuration, each (non-empty) parent of the path serves all metric endpoints that are at descendant paths. For example, to query all system metrics one needs to only scrape /minio/metrics/v3/system/.

Some metrics are bucket specific. These will have a /bucket component in their path. As the number of buckets can be large, the metrics scrape operation needs to be provided with a specific list of buckets via the bucket query parameter. Only metrics for the given buckets will be returned (with the bucket label set). For example to query API metrics for buckets test1 and test2, make a scrape request to /minio/metrics/v3/api/bucket?buckets=test1,test2.

Instead of a metrics scrape, it is also possible to list the metrics that would be returned by a path. This is done by adding a ?list query parameter. The MinIO server will then list all possible metrics that could be returned. During an actual metrics scrape, only available metrics are returned - not all of them. With the list query parameter, the output format can be selected - just set the request Content-Type to application/json for JSON output, or text/plain for a simple markdown formatted table. The latter is the default.

Request, System and Cluster Metrics

At a high level metrics are grouped into three categories, listed in the following sub-sections. The path in each of the tables is relative to the top-level endpoint.

Request metrics

These are metrics about requests served by the (current) node.

Path Description
/api/requests Metrics over all requests
/api/bucket Metrics over all requests split by bucket labels

System metrics

These are metrics about the minio process and the node.

Path Description
/system/drive Metrics about drives on the system
/system/network/internode Metrics about internode requests made by the node
/system/process Standard process metrics
/system/go Standard Go lang metrics

Cluster metrics

These present metrics about the whole MinIO cluster.

Path Description
/cluster/health Cluster health metrics
/cluster/usage/objects Object statistics
/cluster/usage/buckets Object statistics by bucket
/cluster/erasure-set Erasure set metrics

Metrics Listing

Each of the following sub-sections list metrics returned by each of the endpoints.

The standard metrics groups for ProcessCollector and GoCollector are not shown below.

/api/requests

Name Type Help Labels
minio_api_requests_rejected_auth_total counter Total number of requests rejected for auth failure type,pool_index,server
minio_api_requests_rejected_header_total counter Total number of requests rejected for invalid header type,pool_index,server
minio_api_requests_rejected_timestamp_total counter Total number of requests rejected for invalid timestamp type,pool_index,server
minio_api_requests_rejected_invalid_total counter Total number of invalid requests type,pool_index,server
minio_api_requests_waiting_total gauge Total number of requests in the waiting queue type,pool_index,server
minio_api_requests_incoming_total gauge Total number of incoming requests type,pool_index,server
minio_api_requests_inflight_total gauge Total number of requests currently in flight name,type,pool_index,server
minio_api_requests_total counter Total number of requests name,type,pool_index,server
minio_api_requests_errors_total counter Total number of requests with (4xx and 5xx) errors name,type,pool_index,server
minio_api_requests_5xx_errors_total counter Total number of requests with 5xx errors name,type,pool_index,server
minio_api_requests_4xx_errors_total counter Total number of requests with 4xx errors name,type,pool_index,server
minio_api_requests_canceled_total counter Total number of requests canceled by the client name,type,pool_index,server
minio_api_requests_ttfb_seconds_distribution counter Distribution of time to first byte across API calls name,type,le,pool_index,server
minio_api_requests_traffic_sent_bytes counter Total number of bytes sent type,pool_index,server
minio_api_requests_traffic_received_bytes counter Total number of bytes received type,pool_index,server

/api/bucket

Name Type Help Labels
minio_api_bucket_traffic_received_bytes counter Total number of bytes sent for a bucket bucket,type,server,pool_index
minio_api_bucket_traffic_sent_bytes counter Total number of bytes received for a bucket bucket,type,server,pool_index
minio_api_bucket_inflight_total gauge Total number of requests currently in flight for a bucket bucket,name,type,server,pool_index
minio_api_bucket_total counter Total number of requests for a bucket bucket,name,type,server,pool_index
minio_api_bucket_canceled_total counter Total number of requests canceled by the client for a bucket bucket,name,type,server,pool_index
minio_api_bucket_4xx_errors_total counter Total number of requests with 4xx errors for a bucket bucket,name,type,server,pool_index
minio_api_bucket_5xx_errors_total counter Total number of requests with 5xx errors for a bucket bucket,name,type,server,pool_index
minio_api_bucket_ttfb_seconds_distribution counter Distribution of time to first byte across API calls for a bucket bucket,name,le,type,server,pool_index

/system/drive

Name Type Help Labels
minio_system_drive_used_bytes gauge Total storage used on a drive in bytes drive,set_index,drive_index,pool_index,server
minio_system_drive_free_bytes gauge Total storage free on a drive in bytes drive,set_index,drive_index,pool_index,server
minio_system_drive_total_bytes gauge Total storage available on a drive in bytes drive,set_index,drive_index,pool_index,server
minio_system_drive_free_inodes gauge Total free inodes on a drive drive,set_index,drive_index,pool_index,server
minio_system_drive_timeout_errors_total counter Total timeout errors on a drive drive,set_index,drive_index,pool_index,server
minio_system_drive_availability_errors_total counter Total availability errors (I/O errors, permission denied and timeouts) on a drive drive,set_index,drive_index,pool_index,server
minio_system_drive_waiting_io gauge Total waiting I/O operations on a drive drive,set_index,drive_index,pool_index,server
minio_system_drive_api_latency_micros gauge Average last minute latency in µs for drive API storage operations drive,api,set_index,drive_index,pool_index,server
minio_system_drive_offline_count gauge Count of offline drives pool_index,server
minio_system_drive_online_count gauge Count of online drives pool_index,server
minio_system_drive_count gauge Count of all drives pool_index,server

/system/network/internode

Name Type Help Labels
minio_system_network_internode_errors_total counter Total number of failed internode calls server,pool_index
minio_system_network_internode_dial_errors_total counter Total number of internode TCP dial timeouts and errors server,pool_index
minio_system_network_internode_dial_avg_time_nanos gauge Average dial time of internodes TCP calls in nanoseconds server,pool_index
minio_system_network_internode_sent_bytes_total counter Total number of bytes sent to other peer nodes server,pool_index
minio_system_network_internode_recv_bytes_total counter Total number of bytes received from other peer nodes server,pool_index

/cluster/health

Name Type Help Labels
minio_cluster_health_drives_offline_count gauge Count of offline drives in the cluster
minio_cluster_health_drives_online_count gauge Count of online drives in the cluster
minio_cluster_health_drives_count gauge Count of all drives in the cluster
minio_cluster_health_nodes_offline_count gauge Count of offline nodes in the cluster
minio_cluster_health_nodes_online_count gauge Count of online nodes in the cluster
minio_cluster_health_capacity_raw_total_bytes gauge Total cluster raw storage capacity in bytes
minio_cluster_health_capacity_raw_free_bytes gauge Total cluster raw storage free in bytes
minio_cluster_health_capacity_usable_total_bytes gauge Total cluster usable storage capacity in bytes
minio_cluster_health_capacity_usable_free_bytes gauge Total cluster usable storage free in bytes

/cluster/usage/objects

Name Type Help Labels
minio_cluster_usage_objects_since_last_update_seconds gauge Time since last update of usage metrics in seconds
minio_cluster_usage_objects_total_bytes gauge Total cluster usage in bytes
minio_cluster_usage_objects_count gauge Total cluster objects count
minio_cluster_usage_objects_versions_count gauge Total cluster object versions (including delete markers) count
minio_cluster_usage_objects_delete_markers_count gauge Total cluster delete markers count
minio_cluster_usage_objects_buckets_count gauge Total cluster buckets count
minio_cluster_usage_objects_size_distribution gauge Cluster object size distribution range
minio_cluster_usage_objects_version_count_distribution gauge Cluster object version count distribution range

/cluster/usage/buckets

Name Type Help Labels
minio_cluster_usage_buckets_since_last_update_seconds gauge Time since last update of usage metrics in seconds
minio_cluster_usage_buckets_total_bytes gauge Total bucket size in bytes bucket
minio_cluster_usage_buckets_objects_count gauge Total objects count in bucket bucket
minio_cluster_usage_buckets_versions_count gauge Total object versions (including delete markers) count in bucket bucket
minio_cluster_usage_buckets_delete_markers_count gauge Total delete markers count in bucket bucket
minio_cluster_usage_buckets_quota_total_bytes gauge Total bucket quota in bytes bucket
minio_cluster_usage_buckets_object_size_distribution gauge Bucket object size distribution range,bucket
minio_cluster_usage_buckets_object_version_count_distribution gauge Bucket object version count distribution range,bucket

/cluster/erasure-set

Name Type Help Labels
minio_cluster_erasure_set_overall_write_quorum gauge Overall write quorum across pools and sets
minio_cluster_erasure_set_overall_health gauge Overall health across pools and sets (1=healthy, 0=unhealthy)
minio_cluster_erasure_set_read_quorum gauge Read quorum for the erasure set in a pool pool_id,set_id
minio_cluster_erasure_set_write_quorum gauge Write quorum for the erasure set in a pool pool_id,set_id
minio_cluster_erasure_set_online_drives_count gauge Count of online drives in the erasure set in a pool pool_id,set_id
minio_cluster_erasure_set_healing_drives_count gauge Count of healing drives in the erasure set in a pool pool_id,set_id
minio_cluster_erasure_set_health gauge Health of the erasure set in a pool (1=healthy, 0=unhealthy) pool_id,set_id