Commit Graph

141 Commits

Author SHA1 Message Date
Krishnan Parthasarathi 9eeee92d36
Add deletemarker_total metric (#17689) 2023-07-20 07:52:32 -07:00
Harshavardhana 6426b74770
move bucket centric metrics to /minio/v2/metrics/bucket handlers (#17663)
users/customers do not have a reasonable number of buckets anymore,
this is why we must avoid overpopulating cluster endpoints, instead
move the bucket monitoring to a separate endpoint.

some of it's a breaking change here for a couple of metrics, but
it is imperative that we do it to improve the responsiveness of
our Prometheus cluster endpoint.

Bonus: Added new cluster metrics for usage, objects and histograms
2023-07-18 22:25:12 -07:00
drivebyer 04c792476f
fix: provide a possible slice cap for heal failed metrics items (#17647)
Signed-off-by: Wu <yang.wu@daocloud.io>
2023-07-14 11:02:45 -07:00
Harshavardhana 5b7c83341b
move per bucket metrics to peer location (#17627) 2023-07-11 07:46:24 -07:00
Anis Eleuch 6d0bc5ab1e
prometheus: Fix internode stats (#17594)
Internode calculation was done inside S3 handlers, fix it by moving it
to internode handlers.

Remove admin stats since it is not used.
2023-07-08 07:35:11 -07:00
Harshavardhana abb1f22057 Revert "change ttfb_distribution metrics to histogramMetric (#17115)"
This reverts commit 9112ca4e29.
2023-07-07 13:57:37 -07:00
Harshavardhana 7605d07bb2
add support for bucket level request count per API (#17468)
New metrics added to calculate API request count
per bucket, per API.  Captures errors, including
4xx, 5xx HTTP status codes separately.
2023-06-21 09:41:59 -07:00
Aditya Manthramurthy 5a1612fe32
Bump up madmin-go and pkg deps (#17469) 2023-06-19 17:53:08 -07:00
drivebyer ad2ab6eb3e
fix: Give accurate cap to slice (#17224) 2023-05-17 15:14:09 -07:00
jiuker 9a799065b3
fix: make slice cap of right size (#17192) 2023-05-16 08:10:07 -07:00
Shireesh Anjal c326e5a34e
Add metrics for webhook endpoint stats (#17179) 2023-05-11 11:24:37 -07:00
Harshavardhana 5569acd95c
disallow EC:0 if not set during server startup (#17141) 2023-05-04 14:44:30 -07:00
Harshavardhana 9112ca4e29
change ttfb_distribution metrics to histogramMetric (#17115) 2023-05-03 07:31:00 -07:00
Anis Eleuch a42650c065
Add minio_bucket_usage_version_total metric to Prometheus (#17023) 2023-04-12 20:08:07 -07:00
Harshavardhana 3b7781835e
add lock metrics per node (#16943) 2023-04-03 21:23:24 -07:00
Klaus Post d85da9236e
Add Object Version count histogram (#16739) 2023-03-10 08:53:59 -08:00
Harshavardhana 901887e6bf
feat: add lambda transformation functions target (#16507) 2023-03-07 08:12:41 -08:00
ferhat elmas 714283fae2
cleanup ignored static analysis (#16767) 2023-03-06 08:56:10 -08:00
Aditya Manthramurthy 8cde38404d
Add metrics for custom auth plugin (#16701) 2023-02-27 09:55:18 -08:00
Andreas Auernhammer 74887c7372
kms: add support for KES API keys and switch to KES Go SDK (#16617)
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
2023-02-14 07:19:20 -08:00
Anis Elleuch e05205756f
metrics: Add more logs when unable to read bucket usage (#16405) 2023-01-13 02:32:00 +05:30
Anis Elleuch 27417459fb
metrics: Show healing info for all nodes (#16315) 2022-12-26 08:35:32 -08:00
Harshavardhana 2433698372
fix: remove unnecessary logs for client conn errors (#16261) 2022-12-15 08:25:05 -08:00
Aditya Manthramurthy a30cfdd88f
Bump up madmin-go to v2 (#16162) 2022-12-06 13:46:50 -08:00
Anis Elleuch 932d2c3c62
Add X-Amz-Request-Id to internode calls (#16146) 2022-12-06 09:27:26 -08:00
Harshavardhana 5a8df7efb3
re-implement StorageInfo to be a peer call (#16155) 2022-12-01 14:31:35 -08:00
Klaus Post 5b242f1d11
Add Audit target metrics (#16044) 2022-11-10 10:20:21 -08:00
Klaus Post bbc312fce6
Add notification queue metrics (#16026) 2022-11-08 16:36:47 -08:00
Harshavardhana 23b329b9df
remove gateway completely (#15929) 2022-10-24 17:44:15 -07:00
Anis Elleuch 048a46ec2a
Add RPC tcp timeout/errs and AVG duration to prometheus (#15747) 2022-09-26 09:04:26 -07:00
Poorna 6b9fd256e1
Persist in-memory replication stats to disk (#15594)
to avoid relying on scanner-calculated replication metrics.
This will improve the accuracy of the replication stats reported.

This PR also adds on to #15556 by handing replication
traffic that could not be queued by available workers to the 
MRF queue so that entries in `PENDING` status are healed faster.
2022-09-12 12:40:02 -07:00
ebozduman b57e7321e7
Replaces 'disk'=>'drive' visible to end user (#15464) 2022-08-04 16:10:08 -07:00
Harshavardhana 5e763b71dc
use logger.LogOnce to reduce printing disconnection logs (#15408)
fixes #15334

- re-use net/url parsed value for http.Request{}
- remove gosimple, structcheck and unusued due to https://github.com/golangci/golangci-lint/issues/2649
- unwrapErrs upto leafErr to ensure that we store exactly the correct errors
2022-07-27 09:44:59 -07:00
Andreas Auernhammer f800cee4fa
metric: add KMS-related metrics (#15258)
This commit adds a minimal set of KMS-related metrics:
```
 # HELP minio_cluster_kms_online Reports whether the KMS is online (1) or offline (0)
 # TYPE minio_cluster_kms_online gauge
 minio_cluster_kms_online{server="127.0.0.1:9000"} 1
 # HELP minio_cluster_kms_request_error Number of KMS requests that failed with a well-defined error
 # TYPE minio_cluster_kms_request_error counter
 minio_cluster_kms_request_error{server="127.0.0.1:9000"} 16790
 # HELP minio_cluster_kms_request_success Number of KMS requests that succeeded
 # TYPE minio_cluster_kms_request_success counter
 minio_cluster_kms_request_success{server="127.0.0.1:9000"} 348031
```

Currently, we report whether the KMS is available and how many requests
succeeded/failed. However, KES exposes much more metrics that can be
exposed if necessary. See: https://pkg.go.dev/github.com/minio/kes#Metric

Signed-off-by: Andreas Auernhammer <hi@aead.dev>
2022-07-11 09:17:28 -07:00
Anis Elleuch ed0cbfb31e
fix: rootdisk detection by not using cached value when GetDiskInfo() errors out (#15249)
GetDiskInfo() uses timedValue to cache the disk info for one second.

timedValue behavior was recently changed to return an old cached value
when calculating a new value returns an error.

When a mount point is empty, GetDiskInfo() will return errUnformattedDisk,
timedValue will return cached disk info with unexpected IsRootDisk value,
e.g. false if the mount point belongs to a root disk. Therefore, the mount
point will be considered a valid disk and will be formatted as well.

This commit will also add more defensive code when marking root disks:
always mark a disk offline for any GetDiskInfo() error except
errUnformattedDisk. The server will try anyway to reconnect to those
disks every 10 seconds.
2022-07-07 17:05:23 -07:00
Anis Elleuch 8d98282afd
Better reporting of total/free usable capacity of the cluster (#15230)
The current code uses approximation using a ratio. The approximation 
can skew if we have multiple pools with different disk capacities.

Replace the algorithm with a simpler one which counts data 
disks and ignore parity disks.
2022-07-06 13:29:49 -07:00
Klaus Post ac055b09e9
Add detailed scanner metrics (#15161) 2022-07-05 14:45:49 -07:00
Harshavardhana 63ac260bd5
Simplify Prometheus metrics gather (#15210) 2022-07-01 13:18:39 -07:00
Harshavardhana bd099f5e71
fix: change timedValue to return the previously cached value (#15169)
fix: change timedvalue to return previous cached value

caller can interpret the underlying error and decide
accordingly, places where we do not interpret the
errors upon timedValue.Get() - we should simply use
the previously cached value instead of returning "empty".

Bonus: remove some unused code
2022-06-25 08:50:16 -07:00
Harshavardhana 8082d1fed6
add bucket level S3 received/sent bytes (#15084)
adds bucket level metrics for bytes received and sent bytes on all S3 API calls.
2022-06-14 15:14:24 -07:00
Harshavardhana 7413045f0e
fix: add missing minio_s3_requests_total (#15070)
PR #15052 caused a regression, add the missing metrics back.

Bonus:

- internode information should be only for distributed setups 
- update the dashboard to include 4xx and 5xx error panels.
2022-06-11 00:50:31 -07:00
Anis Elleuch 5fb420c703
prometheus: Add S3 4xx and 5xx S3 monitoring (#15052)
Currently minio_s3_requests_errors_total covers 4xx and 
5xx S3 responses which can be confusing when s3 applications 
sent a lot of HEAD requests with obvious 404 responses or 
when the replication is enabled.

Add 
- minio_s3_requests_4xx_errors_total
- minio_s3_requests_5xx_errors_total

to help users monitor 4xx and 5xx HTTP status codes separately.
2022-06-08 11:22:34 -07:00
Harshavardhana 2420f6c000
fix: make metrics endpoint responsive by reducing the chatter (#15055)
peerOnlineCounter was making NxN calls to many peers, this
can be really long and tedious if there are random servers
that are going down.

Instead we should calculate online peers from the point of
view of "self" and return those online and offline appropriately
by performing a healthcheck.
2022-06-08 02:43:13 -07:00
Harshavardhana f1abb92f0c
feat: Single drive XL implementation (#14970)
Main motivation is move towards a common backend format
for all different types of modes in MinIO, allowing for
a simpler code and predictable behavior across all features.

This PR also brings features such as versioning, replication,
transitioning to single drive setups.
2022-05-30 10:58:37 -07:00
Harshavardhana f8650a3493
fetch bucket replication stats across peers in single call (#14956)
current implementation relied on recursively calling one bucket
at a time across all peers, this would be very slow and chatty
when there are 100's of buckets which would mean 100*peerCount
amount of network operations.

This PR attempts to reduce this entire call into `peerCount`
amount of network calls only. This functionality addresses also a
concern where the Prometheus metrics would significantly slow
down when one of the peers is offline.
2022-05-23 09:15:30 -07:00
Aditya Manthramurthy 165d60421d
Add metrics for observing IAM sync operations (#14680) 2022-04-03 13:08:59 -07:00
Poorna 9e25475475
Validate tier manager is initialized in tier Empty() check (#14646)
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
2022-03-29 10:10:06 -07:00
Krishnan Parthasarathi 80ef1ae51c
Simplify assembling of tierStats from data-usage (#14504) 2022-03-08 12:08:29 -08:00
Anis Elleuch bacf6156c1
metrics: Avoid crash when fetching tier metrics (#14493)
Data usage does not always contain tiering info even if the data usage
information is valid. Avoid a crash in that case.

(e.g. the scanner scanned the namespace, the user enables tiering,
prometheus scrapes the server before the scanner gets a chance to
update the data usage with new tiering information)
2022-03-07 10:59:32 -08:00
Krishnan Parthasarathi 0ee2933234
Export tier metrics via Prometheus (#13413)
e.g
```
minio_cluster_ilm_transitioned_bytes{server="minio3:9000",tier="S3TIER-1"} 1.36317772e+08
minio_cluster_ilm_transitioned_bytes{server="minio3:9000",tier="S3TIER-2"} 2892
minio_cluster_ilm_transitioned_bytes{server="minio3:9000",tier="STANDARD"}
1.3631488e+08

minio_cluster_ilm_transitioned_objects{server="minio3:9000",tier="S3TIER-1"} 1
minio_cluster_ilm_transitioned_objects{server="minio3:9000",tier="S3TIER-2"} 0
minio_cluster_ilm_transitioned_objects{server="minio3:9000",tier="STANDARD"} 1

minio_cluster_ilm_transitioned_versions{server="minio3:9000",tier="S3TIER-1"} 3
minio_cluster_ilm_transitioned_versions{server="minio3:9000",tier="S3TIER-2"} 2
minio_cluster_ilm_transitioned_versions{server="minio3:9000",tier="STANDARD"} 1
```
2022-02-08 12:45:28 -08:00
Anis Elleuch 2ee337ead5
prometheus: Add incoming requests metrics since last scrape (#14261)
Some users running MinIO claim that their system became slow. One 
way to investigate is to look at this Prometheus history of the number of
the requests reaching the server. The existing current S3 requests metric
is not enough because it can increase of the system really becomes slow, 
due to disk issues for example.
2022-02-07 16:30:14 -08:00
Harshavardhana 01e550a9be
ignore unreadable metrics on certain closed systems (#14234)
fixes #14233
2022-02-03 09:45:12 -08:00
Harshavardhana 74faed166a
Add quota usage as part of prometheus metrics (#14222)
Bonus: pass caller context when needed to all bucket metadata handling calls.
2022-01-31 17:27:43 -08:00
Harshavardhana b5d35c7e09
ignore disk metrics for single drive mode (#14212)
fixes #14211
2022-01-31 00:44:26 -08:00
Anis Elleuch 45a99c3fd3
publish storage API latency through node metrics (#14117)
Publish storage functions latency to help compare the performance 
of different disks in a single deployment.

e.g.:
```
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/1",server="localhost:9001"} 226
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/2",server="localhost:9002"} 1180
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/3",server="localhost:9003"} 1183
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/4",server="localhost:9004"} 1625
```
2022-01-25 16:31:44 -08:00
Harshavardhana ba708f51f2
fix: copyMetrics to avoid map references elsewhere (#14113)
map labels might have been referenced else, this
can lead to concurrent access at lower layers.

avoid this by copying the information while
concurrently serving the metrics.
2022-01-14 16:48:19 -08:00
Harshavardhana f527c708f2
run gofumpt cleanup across code-base (#14015) 2022-01-02 09:15:06 -08:00
Harshavardhana aa508591c1
cache only metrics served from the disks (#13940)
do not need to cache in-memory instant metrics
2021-12-17 11:40:09 -08:00
Harshavardhana 818f0201fc
re-implement prometheus metrics endpoint to be simpler (#13922)
data-structures were repeatedly initialized
this causes GC pressure, instead re-use the
collectors.

Initialize collectors in `init()`, also make
sure to honor the cache semantics for performance
requirements.

Avoid a global map and a global lock for metrics
lookup instead let them all be lock-free unless
the cache is being invalidated.
2021-12-17 10:11:04 -08:00
Anis Elleuch 4caed7cc0d
metrics: Add replication latency metrics (#13515)
Add a new Prometheus metric for bucket replication latency

e.g.:
minio_bucket_replication_latency_ns{
    bucket="testbucket",
    operation="upload",
    range="LESS_THAN_1_MiB",
    server="127.0.0.1:9001",
    targetArn="arn:minio:replication::45da043c-14f5-4da4-9316-aba5f77bf730:testbucket"} 2.2015663e+07

Co-authored-by: Klaus Post <klauspost@gmail.com>
2021-11-17 12:10:57 -08:00
Harshavardhana 661b263e77
add gocritic/ruleguard checks back again, cleanup code. (#13665)
- remove some duplicated code
- reported a bug, separately fixed in #13664
- using strings.ReplaceAll() when needed
- using filepath.ToSlash() use when needed
- remove all non-Go style comments from the codebase

Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
2021-11-16 09:28:29 -08:00
Anis Elleuch 20761e053e
replication: Fix replica stats during crawling (#13499)
Also show replica stats with an ARN in Prometheus output.
2021-10-22 19:13:50 -07:00
Klaus Post 75699a3825
Add basic scanner metrics (#13317)
Add number of objects/versions/folders scanned as well as ILM action outcomes.
2021-10-02 09:31:05 -07:00
Poorna Krishnamoorthy c4373ef290
Add support for multi site replication (#12880) 2021-09-18 13:31:35 -07:00
Krishnan Parthasarathi cf8abd8888
Add prometheus metrics for ILM tasks (#12933) 2021-08-17 10:21:19 -07:00
Harshavardhana cdeccb5510
feat: Deprecate embedded browser and import console (#12460)
This feature also changes the default port where
the browser is running, now the port has moved
to 9001 and it can be configured with

```
--console-address ":9001"
```
2021-06-17 20:27:04 -07:00
Harshavardhana 1f262daf6f
rename all remaining packages to internal/ (#12418)
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
2021-06-01 14:59:40 -07:00
Harshavardhana 7148c2490a
avoid metrics not meant for single drive mode (#12415)
fixes #12414
2021-06-01 12:22:42 -07:00
Poorna Krishnamoorthy 3690de0c6b
Drop Pending size and count from replication metrics (#12378)
Real-time metrics calculated in-memory rely on the initial
replication metrics saved with data usage. However, this can
lag behind the actual state of the cluster at the time of server 
restart leading to inaccurate Pending size/counts reported to
Prometheus. Dropping the Pending metrics as this can be more 
reliably monitored by applications with replication notifications.

Signed-off-by: Poorna Krishnamoorthy <poorna@minio.io>
2021-05-31 20:26:52 -07:00
Harshavardhana ebf75ef10d
fix: remove all unused code (#12360) 2021-05-24 09:28:19 -07:00
Harshavardhana d495cb68d3
fix: crash in prometherus metrics collector (#12244)
node_health metrics crashes in gateway mode,
in gateway mode ignore node health metrics.

fixes #12243
2021-05-06 15:43:34 -07:00
Nitish Tiwari 776589f0da Add free inode metric for Prometheus (#12225) 2021-05-06 12:50:48 -07:00
Nitish Tiwari c8aa56ccd7
Add node cpu & memory metrics to Prometheus cluster endpoint (#12214) 2021-05-04 10:17:10 -07:00
Harshavardhana 069432566f update license change for MinIO
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-23 11:58:53 -07:00
Poorna Krishnamoorthy 47c09a1e6f
Various improvements in replication (#11949)
- collect real time replication metrics for prometheus.
- add pending_count, failed_count metric for total pending/failed replication operations.

- add API to get replication metrics

- add MRF worker to handle spill-over replication operations

- multiple issues found with replication
- fixes an issue when client sends a bucket
 name with `/` at the end from SetRemoteTarget
 API call make sure to trim the bucket name to 
 avoid any extra `/`.

- hold write locks in GetObjectNInfo during replication
  to ensure that object version stack is not overwritten
  while reading the content.

- add additional protection during WriteMetadata() to
  ensure that we always write a valid FileInfo{} and avoid
  ever writing empty FileInfo{} to the lowest layers.

Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
2021-04-03 09:03:42 -07:00
Ritesh H Shukla 3ddd8b04d1
fix: handle unsupported APIs more granularly (#11674) 2021-03-30 23:19:36 -07:00
Anis Elleuch 2c296652f7
Simplify access to local node name (#11907)
The local node name is heavily used in tracing, create a new global 
variable to store it. Multiple goroutines can access it since it won't be
changed later.
2021-03-26 11:37:58 -07:00
Klaus Post b383522743
fix error could not read /proc ion windows. (#11868)
Bonus: Prealloc reasonable sizes for metrics.
2021-03-25 12:58:43 -07:00
Harshavardhana d7f32ad649 xl: avoid sending Delete() remote call for fully successful runs
an optimization to avoid extra syscalls in PutObject(),
adds up to our PutObject response times.
2021-03-24 17:32:12 -07:00
Klaus Post 749e9c5771
metrics: Add canceled requests (#11881)
Add metric for canceled requests
2021-03-24 10:25:27 -07:00
Anis Elleuch fad7b27f15
metrics: Change type of minio_s3_requests_waiting_total to gauge (#11884) 2021-03-24 09:06:37 -07:00
Ritesh H Shukla 23b03dadb8
Add process uptime metric (#11844) 2021-03-20 21:23:27 -07:00
Ritesh H Shukla b5dcaaccb4
Introduce metrics caching for performant metrics (#11831) 2021-03-19 00:04:29 -07:00
Harshavardhana 2c198ae7b6
fix: prometheus metrics disks_online count when disks are down (#11689)
prometheus metrics was using total disks instead
of online disk count, when disks were down, this
PR fixes this and also adds a new metric for
total_disk_count
2021-03-03 11:18:41 -08:00
Harshavardhana c6a120df0e
fix: Prometheus metrics to re-use storage disks (#11647)
also re-use storage disks for all `mc admin server info`
calls as well, implement a new LocalStorageInfo() API
call at ObjectLayer to lookup local disks storageInfo

also fixes bugs where there were double calls to StorageInfo()
2021-03-02 17:28:04 -08:00
Anis Elleuch e8d8dfa3ae
Add metric for internode RPC calls errors (#11669) 2021-03-01 12:31:33 -08:00
Anis Elleuch 98d3f94996
metrics: Add the number of requests in the waiting queue (#11580)
We can use this metric to check if there are too many S3 clients in the
queue and could explain why some of those S3 clients are timing out.

```
minio_s3_requests_waiting_total{server="127.0.0.1:9000"} 9981
```

If max_requests is 10000 then there is a strong possibility that clients
are timing out because of the queue deadline.
2021-02-20 00:21:55 -08:00
Ritesh H Shukla 67a8f37df0
fix: disk usage capacity metric reporting (#11435) 2021-02-04 12:26:58 -08:00
Ritesh H Shukla c4848f9b4f
Add process start time to cluster metrics. (#11405) 2021-02-01 23:02:18 -08:00
Ritesh H Shukla 7575c24037
Add open FD and FD limit to cluster metrics (#11328) 2021-01-22 18:30:16 -08:00
Ritesh H Shukla b4add82bb6
Updated Prometheus metrics (#11141)
* Add metrics for nodes online and offline
* Add cluster capacity metrics
* Introduce v2 metrics
2021-01-18 20:35:38 -08:00