Commit Graph

108 Commits

Author SHA1 Message Date
Harshavardhana e98172d72d
avoid hot-tier SLA to be tied to warm-tier SLA (#18581)
it is okay if the warm-tier cannot keep up, we should continue
to take I/O at hot-tier, only fail hot-tier or block it when
we are disk full.

Bonus: add metrics counter for these missed tasks, we will
know for sure if one of the node is lagging behind or is
losing too many tasks during transitioning.
2023-12-02 13:02:12 -08:00
Krishnan Parthasarathi a50f26b7f5
Implement batch-expiration for objects (#17946)
Based on an initial PR from -
https://github.com/minio/minio/pull/17792

But fully completes it with newer finalized YAML spec.
2023-12-02 02:51:33 -08:00
Anis Eleuch fe63664164
prom: Add drive failure tolerance per erasure set (#18424) 2023-11-13 00:59:48 -08:00
vicmunoz da95a2d13f
fix: object versions metric help (#18388) 2023-11-03 11:43:52 -07:00
Klaus Post 128256e3ab
Add event counters (#18232)
Export metric for global events sent and skipped for the lifetime of the server.
2023-10-12 15:39:22 -07:00
Harshavardhana 6829ae5b13
completely remove drive caching layer from gateway days (#18217)
This has already been deprecated for close to a year now.
2023-10-11 21:18:17 -07:00
Matthew Toohey f731e7ea36
Fix current_send_in_progress metric always being zero (#18160) 2023-10-09 17:28:17 -07:00
Shireesh Anjal 6d20ec3bea
Add support for resource metrics (#18057)
Add a new endpoint for "resource" metrics `/v2/metrics/resource`

This should return system metrics related to drives, network, CPU and
memory. Except for drives, other metrics should have corresponding "avg"
and "max" values also.

Reuse the real-time feature to capture the required data,
introducing CPU and memory metrics in it.

Collect the data every minute and keep updating the average and max values
accordingly, returning the latest values when the API is called.
2023-09-30 13:40:20 -07:00
Harshavardhana 9788d85ea3
remove logging for invalid metadata values (#18068) 2023-09-20 15:49:55 -07:00
Poorna 812f5a02d7
metrics: fix panic in replication stats reporting (#17979) 2023-09-05 10:26:18 -07:00
Poorna b48bbe08b2
Add additional info for replication metrics API (#17293)
to track the replication transfer rate across different nodes,
number of active workers in use and in-queue stats to get
an idea of the current workload.

This PR also adds replication metrics to the site replication
status API. For site replication, prometheus metrics are
no longer at the bucket level - but at the cluster level.

Add prometheus metric to track credential errors since uptime
2023-08-30 01:00:59 -07:00
Harshavardhana ba4566e86d
add missing IAM node metrics to cluster and node endpoint (#17908) 2023-08-24 09:26:37 -07:00
Harshavardhana c4ca0a5a57
add two more drive metrics when metrics is available (#17854) 2023-08-15 10:55:47 -07:00
Yang Wu 23e4895dfc
Create metrics slice when necessary (#17809) 2023-08-07 02:21:22 -07:00
Harshavardhana 2fa561f22e
do not crash on invalid metric values (#17764)
```
minio[1032735]: panic: label value "\xc0.\xc0." is not valid UTF-8
minio[1032735]: goroutine 1781101 [running]:
minio[1032735]: github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...)
```

log such errors for investigation
2023-08-01 00:55:39 -07:00
Harshavardhana 114fab4c70
export cluster health as prometheus metrics (#17741) 2023-07-28 01:16:53 -07:00
drivebyer a7fb3a3853
fix: Create metrics slice when necessary in getCacheMetrics() (#17711) 2023-07-24 08:40:21 -07:00
Krishnan Parthasarathi 9eeee92d36
Add deletemarker_total metric (#17689) 2023-07-20 07:52:32 -07:00
Harshavardhana 6426b74770
move bucket centric metrics to /minio/v2/metrics/bucket handlers (#17663)
users/customers do not have a reasonable number of buckets anymore,
this is why we must avoid overpopulating cluster endpoints, instead
move the bucket monitoring to a separate endpoint.

some of it's a breaking change here for a couple of metrics, but
it is imperative that we do it to improve the responsiveness of
our Prometheus cluster endpoint.

Bonus: Added new cluster metrics for usage, objects and histograms
2023-07-18 22:25:12 -07:00
drivebyer 04c792476f
fix: provide a possible slice cap for heal failed metrics items (#17647)
Signed-off-by: Wu <yang.wu@daocloud.io>
2023-07-14 11:02:45 -07:00
Harshavardhana 5b7c83341b
move per bucket metrics to peer location (#17627) 2023-07-11 07:46:24 -07:00
Anis Eleuch 6d0bc5ab1e
prometheus: Fix internode stats (#17594)
Internode calculation was done inside S3 handlers, fix it by moving it
to internode handlers.

Remove admin stats since it is not used.
2023-07-08 07:35:11 -07:00
Harshavardhana abb1f22057 Revert "change ttfb_distribution metrics to histogramMetric (#17115)"
This reverts commit 9112ca4e29.
2023-07-07 13:57:37 -07:00
Harshavardhana 7605d07bb2
add support for bucket level request count per API (#17468)
New metrics added to calculate API request count
per bucket, per API.  Captures errors, including
4xx, 5xx HTTP status codes separately.
2023-06-21 09:41:59 -07:00
Aditya Manthramurthy 5a1612fe32
Bump up madmin-go and pkg deps (#17469) 2023-06-19 17:53:08 -07:00
drivebyer ad2ab6eb3e
fix: Give accurate cap to slice (#17224) 2023-05-17 15:14:09 -07:00
jiuker 9a799065b3
fix: make slice cap of right size (#17192) 2023-05-16 08:10:07 -07:00
Shireesh Anjal c326e5a34e
Add metrics for webhook endpoint stats (#17179) 2023-05-11 11:24:37 -07:00
Harshavardhana 5569acd95c
disallow EC:0 if not set during server startup (#17141) 2023-05-04 14:44:30 -07:00
Harshavardhana 9112ca4e29
change ttfb_distribution metrics to histogramMetric (#17115) 2023-05-03 07:31:00 -07:00
Anis Eleuch a42650c065
Add minio_bucket_usage_version_total metric to Prometheus (#17023) 2023-04-12 20:08:07 -07:00
Harshavardhana 3b7781835e
add lock metrics per node (#16943) 2023-04-03 21:23:24 -07:00
Klaus Post d85da9236e
Add Object Version count histogram (#16739) 2023-03-10 08:53:59 -08:00
Harshavardhana 901887e6bf
feat: add lambda transformation functions target (#16507) 2023-03-07 08:12:41 -08:00
ferhat elmas 714283fae2
cleanup ignored static analysis (#16767) 2023-03-06 08:56:10 -08:00
Aditya Manthramurthy 8cde38404d
Add metrics for custom auth plugin (#16701) 2023-02-27 09:55:18 -08:00
Andreas Auernhammer 74887c7372
kms: add support for KES API keys and switch to KES Go SDK (#16617)
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
2023-02-14 07:19:20 -08:00
Anis Elleuch e05205756f
metrics: Add more logs when unable to read bucket usage (#16405) 2023-01-13 02:32:00 +05:30
Anis Elleuch 27417459fb
metrics: Show healing info for all nodes (#16315) 2022-12-26 08:35:32 -08:00
Harshavardhana 2433698372
fix: remove unnecessary logs for client conn errors (#16261) 2022-12-15 08:25:05 -08:00
Aditya Manthramurthy a30cfdd88f
Bump up madmin-go to v2 (#16162) 2022-12-06 13:46:50 -08:00
Anis Elleuch 932d2c3c62
Add X-Amz-Request-Id to internode calls (#16146) 2022-12-06 09:27:26 -08:00
Harshavardhana 5a8df7efb3
re-implement StorageInfo to be a peer call (#16155) 2022-12-01 14:31:35 -08:00
Klaus Post 5b242f1d11
Add Audit target metrics (#16044) 2022-11-10 10:20:21 -08:00
Klaus Post bbc312fce6
Add notification queue metrics (#16026) 2022-11-08 16:36:47 -08:00
Harshavardhana 23b329b9df
remove gateway completely (#15929) 2022-10-24 17:44:15 -07:00
Anis Elleuch 048a46ec2a
Add RPC tcp timeout/errs and AVG duration to prometheus (#15747) 2022-09-26 09:04:26 -07:00
Poorna 6b9fd256e1
Persist in-memory replication stats to disk (#15594)
to avoid relying on scanner-calculated replication metrics.
This will improve the accuracy of the replication stats reported.

This PR also adds on to #15556 by handing replication
traffic that could not be queued by available workers to the 
MRF queue so that entries in `PENDING` status are healed faster.
2022-09-12 12:40:02 -07:00
ebozduman b57e7321e7
Replaces 'disk'=>'drive' visible to end user (#15464) 2022-08-04 16:10:08 -07:00
Harshavardhana 5e763b71dc
use logger.LogOnce to reduce printing disconnection logs (#15408)
fixes #15334

- re-use net/url parsed value for http.Request{}
- remove gosimple, structcheck and unusued due to https://github.com/golangci/golangci-lint/issues/2649
- unwrapErrs upto leafErr to ensure that we store exactly the correct errors
2022-07-27 09:44:59 -07:00