minio

Commit Graph

Author	SHA1	Message	Date
Aditya Manthramurthy	b2c5b75efa	feat: Add Metrics V3 API (#19068 ) Metrics v3 is mainly a reorganization of metrics into smaller groups of metrics and the removal of internal aggregation of metrics received from peer nodes in a MinIO cluster. This change adds the endpoint `/minio/metrics/v3` as the top-level metrics endpoint and under this, various sub-endpoints are implemented. These are currently documented in `docs/metrics/v3.md` The handler will serve metrics at any path `/minio/metrics/v3/PATH`, as follows: when PATH is a sub-endpoint listed above => serves the group of metrics under that path; or when PATH is a (non-empty) parent directory of the sub-endpoints listed above => serves metrics from each child sub-endpoint of PATH. otherwise, returns a no resource found error All available metrics are listed in the `docs/metrics/v3.md`. More will be added subsequently.	2024-03-10 01:15:15 -08:00
Harshavardhana	8a698fef71	fix: crash in ResourceMetrics RPC handling concurrent writers (#19123 ) Continuation of #19103 that had fixed the crash in peer metrics for cluster endpoint.	2024-02-25 00:51:38 -08:00
Klaus Post	e06168596f	Convert more peer <--> peer REST calls (#19004 ) * Convert more peer <--> peer REST calls * Clean up in general. * Add JSON wrapper. * Add slice wrapper. * Add option to make handler return nil error if no connection is given, `IgnoreNilConn`. Converts the following: ``` + HandlerGetMetrics + HandlerGetResourceMetrics + HandlerGetMemInfo + HandlerGetProcInfo + HandlerGetOSInfo + HandlerGetPartitions + HandlerGetNetInfo + HandlerGetCPUs + HandlerServerInfo + HandlerGetSysConfig + HandlerGetSysServices + HandlerGetSysErrors + HandlerGetAllBucketStats + HandlerGetBucketStats + HandlerGetSRMetrics + HandlerGetPeerMetrics + HandlerGetMetacacheListing + HandlerUpdateMetacacheListing + HandlerGetPeerBucketMetrics + HandlerStorageInfo + HandlerGetLocks + HandlerBackgroundHealStatus + HandlerGetLastDayTierStats + HandlerSignalService + HandlerGetBandwidth ```	2024-02-19 14:54:46 -08:00
Harshavardhana	607cafadbc	converge clusterRead health into cluster health (#19063 )	2024-02-15 16:48:36 -08:00
Harshavardhana	0c068b15c7	add missing handler for reloading site replication config on peers (#19042 )	2024-02-13 06:55:54 -08:00
Harshavardhana	62761a23e6	remove unnecessary metrics in 'mc admin info' output (#19020 ) Reduce the amount of data transfer on large deployments	2024-02-08 19:28:46 -08:00
Frank Wessels	31743789dc	Fix some leftover issues from PR 18936 (#18946 )	2024-02-01 19:42:56 -08:00
Harshavardhana	6440d0fbf3	move a collection of peer APIs to websockets (#18936 )	2024-02-01 10:47:20 -08:00
Klaus Post	6da4a9c7bb	Improve tracing & notification scalability (#18903 ) * Perform JSON encoding on remote machines and only forward byte slices. * Migrate tracing & notification to WebSockets.	2024-01-30 12:49:02 -08:00
Harshavardhana	1d3bd02089	avoid close 'nil' panics if any (#18890 ) brings a generic implementation that prints a stack trace for 'nil' channel closes(), if not safely closes it.	2024-01-28 10:04:17 -08:00
Harshavardhana	88837fb753	add new update v2 that updates per node, allows idempotent behavior (#18859 ) add new update v2 that updates per node, allows idempotent behavior new API ensures that - binary is correct and can be downloaded checksummed verified - committed to actual path - restart returns back the relevant waiting drives	2024-01-26 08:40:13 -08:00
Harshavardhana	f9b4a8d6e8	improve server update behavior by re-using memory properly (#18831 )	2024-01-19 18:27:58 -08:00
Harshavardhana	ac81f0248c	introduce new ServiceV2 API to handle guided restarts (#18826 ) New API now verifies any hung disks before restart/stop, provides a 'per node' break down of the restart/stop results. Provides also how many blocked syscalls are present on the drives and what users must do about them. Adds options to do pre-flight checks to provide information to the user regarding any hung disks. Provides 'force' option to forcibly attempt a restart() even with waiting syscalls on the drives.	2024-01-19 14:22:36 -08:00
Zhou Ting	31d16f6cc2	allow sha256 payload to be configurable for object perf test (#18712 ) Signed-off-by: Zhou Ting <ting.z.zhou@intel.com>	2023-12-29 23:56:50 -08:00
Anis Eleuch	8432fd5ac2	prom: Add online and healing drives metrics per erasure set (#18700 )	2023-12-21 16:56:43 -08:00
Shireesh Anjal	f6e581ce54	Capture network device info in health report (#18381 )	2023-11-02 09:49:49 -07:00
Shireesh Anjal	bf1c6edb76	Revert "Capture network device info in health report" (#18241 ) Introducing a new version of healthinfo struct for adding this info is not correct. It needs to be implemented differently without adding a new version. This reverts commit 8737025d940f80360ed4b3686b332db5156f6659.	2023-10-13 07:46:36 -07:00
Shireesh Anjal	a66a7f3e97	Capture network device info in health report (#18213 )	2023-10-12 15:33:31 -07:00
Shireesh Anjal	6d20ec3bea	Add support for resource metrics (#18057 ) Add a new endpoint for "resource" metrics `/v2/metrics/resource` This should return system metrics related to drives, network, CPU and memory. Except for drives, other metrics should have corresponding "avg" and "max" values also. Reuse the real-time feature to capture the required data, introducing CPU and memory metrics in it. Collect the data every minute and keep updating the average and max values accordingly, returning the latest values when the API is called.	2023-09-30 13:40:20 -07:00
Aditya Manthramurthy	1c99fb106c	Update to minio/pkg/v2 (#17967 )	2023-09-04 12:57:37 -07:00
Poorna	b48bbe08b2	Add additional info for replication metrics API (#17293 ) to track the replication transfer rate across different nodes, number of active workers in use and in-queue stats to get an idea of the current workload. This PR also adds replication metrics to the site replication status API. For site replication, prometheus metrics are no longer at the bucket level - but at the cluster level. Add prometheus metric to track credential errors since uptime	2023-08-30 01:00:59 -07:00
Harshavardhana	3a0125fa1f	remove unexpected logging from peer calls (#17888 ) also make sure RequestID is set for system logs	2023-08-21 14:25:24 -07:00
drivebyer	14ebd82dbd	fix: missing disk metrics when query metric api from peer (#17738 )	2023-07-27 11:44:13 -07:00
jiuker	a99cd825ab	fix: byHost realTime metrics API (#17681 )	2023-07-18 23:50:30 -07:00
Harshavardhana	6426b74770	move bucket centric metrics to /minio/v2/metrics/bucket handlers (#17663 ) users/customers do not have a reasonable number of buckets anymore, this is why we must avoid overpopulating cluster endpoints, instead move the bucket monitoring to a separate endpoint. some of it's a breaking change here for a couple of metrics, but it is imperative that we do it to improve the responsiveness of our Prometheus cluster endpoint. Bonus: Added new cluster metrics for usage, objects and histograms	2023-07-18 22:25:12 -07:00
Poorna	5e2f8d7a42	replication: Simplify mrf requeueing and add backlog handler (#17171 ) Simplify MRF queueing and add backlog handler - Limit re-tries to 3 to avoid repeated re-queueing. Fall offs to be re-tried when the scanner revisits this object or upon access. - Change MRF to have each node process only its MRF entries. - Collect MRF backlog by the node to allow for current backlog visibility	2023-07-12 23:51:33 -07:00
Aditya Manthramurthy	5a1612fe32	Bump up madmin-go and pkg deps (#17469 )	2023-06-19 17:53:08 -07:00
Harshavardhana	2f9e2147f5	allow quota enforcement to rely on older values (#17351 ) PUT calls cannot afford to have large latency build-ups due to contentious usage.json, or worse letting them fail with some unexpected error, this can happen when this file is concurrently being updated via scanner or it is being healed during a disk replacement heal. However, these are fairly quick in theory, stressed clusters can quickly show visible latency this can add up leading to invalid errors returned during PUT. It is perhaps okay for us to relax this error return requirement instead, make sure that we log that we are proceeding to take in the requests while the quota is using an older value for the quota enforcement. These things will reconcile themselves eventually, via scanner making sure to overwrite the usage.json. Bonus: make sure that storage-rest-client sets ExpectTimeouts to be 'true', such that DiskInfo() call with contextTimeout does not prematurely disconnect the servers leading to a longer healthCheck, back-off routine. This can easily pile up while also causing active callers to disconnect, leading to quorum loss. DiskInfo is actively used in the PUT, Multipart call path for upgrading parity when disks are down, it in-turn shouldn't cause more disks to go down.	2023-06-05 16:56:35 -07:00
Shireesh Anjal	a3d666356c	fix: error in capturing XFS error config in health report (#17176 )	2023-05-10 15:20:48 -07:00
Poorna	a6057c35cc	Avoid peer notification when peer is offline, tune retries (#16737 )	2023-03-07 08:13:28 -08:00
Klaus Post	9acf1024e4	Remove bloom filter (#16682 ) Removes the bloom filter since it has so limited usability, often gets saturated anyway and adds a bunch of complexity to the scanner. Also removes a tiny bit of CPU by each write operation.	2023-02-24 09:03:31 +05:30
Daniel Valdivia	fb17f97cf3	move audit and logger message structure to minio/pkg (#16655 ) Signed-off-by: Daniel Valdivia <18384552+dvaldivia@users.noreply.github.com>	2023-02-21 21:21:17 -08:00
jiuker	1828fb212a	fix: avoid goroutine leak after timeouts in PeerMetrics (#16569 )	2023-02-08 09:11:16 -08:00
Poorna	1b02e046c2	Fix bandwidth monitoring to be per remote target (#16360 )	2023-01-19 18:52:16 +05:30
Aditya Manthramurthy	a30cfdd88f	Bump up madmin-go to v2 (#16162 )	2022-12-06 13:46:50 -08:00
Klaus Post	a713aee3d5	Run staticcheck on CI (#16170 )	2022-12-05 11:18:50 -08:00
Harshavardhana	5a8df7efb3	re-implement StorageInfo to be a peer call (#16155 )	2022-12-01 14:31:35 -08:00
Poorna	d6bc141bd1	feat: Add support for site level resync (#15753 )	2022-11-14 07:16:40 -08:00
Klaus Post	71954faa3a	mark pubsub type safe via generics (#15961 )	2022-10-28 10:55:42 -07:00
Krishnan Parthasarathi	4523da6543	feat: introduce pool-level rebalance (#15483 )	2022-10-25 12:36:57 -07:00
Harshavardhana	2a13cc28f2	feat: implement support batch replication (#15554 )	2022-10-05 23:00:43 -07:00
Poorna	6b9fd256e1	Persist in-memory replication stats to disk (#15594 ) to avoid relying on scanner-calculated replication metrics. This will improve the accuracy of the replication stats reported. This PR also adds on to #15556 by handing replication traffic that could not be queued by available workers to the MRF queue so that entries in `PENDING` status are healed faster.	2022-09-12 12:40:02 -07:00
Anis Elleuch	b8cdf060c8	Properly replicate policy mapping for virtual users (#15558 ) Currently, replicating policy mapping for STS users does not work. Fix it is by passing user type to PolicyDBSet.	2022-08-23 11:11:45 -07:00
Anis Elleuch	5682685c80	Introduce disk io stats metrics (#15512 )	2022-08-16 07:13:49 -07:00
Cesar Celis Hernandez	8ec888d13d	feat: update binary once and push it to other servers (#15407 )	2022-07-29 08:34:30 -07:00
Anis Elleuch	e4b51235f8	upgrade: Split in two steps to ensure a stable retry (#15396 ) Currently, if one server in a distributed setup fails to upgrade due to any reasons, it is not possible to upgrade again unless nodes are restarted. To fix this, split the upgrade process into two steps : - download the new binary on all servers - If successful, overwrite the old binary with the new one	2022-07-25 17:49:47 -07:00
Harshavardhana	b4eb74f5ff	allow custom speedtest bucket (#15271 ) this allows for specifying existing buckets with - object replication enabled - object encryption enabled - object versioning enabled - object locking enabled	2022-07-12 10:12:47 -07:00
Klaus Post	ac055b09e9	Add detailed scanner metrics (#15161 )	2022-07-05 14:45:49 -07:00
Harshavardhana	c7ed6eee5e	fix: background local test also via channel (#15086 ) current implementation for `standalone` setups was blocking the `perf drive`. Bonus: remove all old unused complicated code.	2022-06-15 14:51:42 -07:00
Harshavardhana	f8650a3493	fetch bucket replication stats across peers in single call (#14956 ) current implementation relied on recursively calling one bucket at a time across all peers, this would be very slow and chatty when there are 100's of buckets which would mean 100*peerCount amount of network operations. This PR attempts to reduce this entire call into `peerCount` amount of network calls only. This functionality addresses also a concern where the Prometheus metrics would significantly slow down when one of the peers is offline.	2022-05-23 09:15:30 -07:00

1 2 3 4

170 Commits