Commit Graph

247 Commits

Author SHA1 Message Date
Poorna 27d02ea6f7
metrics: add replication metrics on proxied requests (#18957) 2024-02-05 22:00:45 -08:00
Harshavardhana 100c35c281
avoid excessive logs when peer is down (#18969) 2024-02-04 23:25:42 -08:00
Harshavardhana 1d3bd02089
avoid close 'nil' panics if any (#18890)
brings a generic implementation that
prints a stack trace for 'nil' channel
closes(), if not safely closes it.
2024-01-28 10:04:17 -08:00
Harshavardhana 88837fb753
add new update v2 that updates per node, allows idempotent behavior (#18859)
add new update v2 that updates per node, allows idempotent behavior

new API ensures that

- binary is correct and can be downloaded checksummed verified
- committed to actual path
- restart returns back the relevant waiting drives
2024-01-26 08:40:13 -08:00
Harshavardhana 961f7dea82
compress binary while sending it to all the nodes (#18837)
Also limit the amount of concurrency when sending
binary updates to peers, avoid high network over
TX that can cause disconnection events for the
node sending updates.
2024-01-22 12:16:36 -08:00
Harshavardhana f9b4a8d6e8
improve server update behavior by re-using memory properly (#18831) 2024-01-19 18:27:58 -08:00
Harshavardhana ac81f0248c
introduce new ServiceV2 API to handle guided restarts (#18826)
New API now verifies any hung disks before restart/stop,
provides a 'per node' break down of the restart/stop results.

Provides also how many blocked syscalls are present on the
drives and what users must do about them.

Adds options to do pre-flight checks to provide information
to the user regarding any hung disks. Provides 'force' option
to forcibly attempt a restart() even with waiting syscalls
on the drives.
2024-01-19 14:22:36 -08:00
Anis Eleuch 8432fd5ac2
prom: Add online and healing drives metrics per erasure set (#18700) 2023-12-21 16:56:43 -08:00
Anis Eleuch aed7a1818a
info: Populate pool/set/disk indexes for offline disks (#18613)
This can be calculated from the disk layout and some external
applications would like to know the location of the offline
disks.
2023-12-08 08:13:04 -08:00
bestgopher 95d6f43cc8
fix(cmd/notification.go): no error when retry successful (#18530) 2023-11-27 22:41:03 -08:00
Shireesh Anjal f6e581ce54
Capture network device info in health report (#18381) 2023-11-02 09:49:49 -07:00
Shireesh Anjal bf1c6edb76
Revert "Capture network device info in health report" (#18241)
Introducing a new version of healthinfo struct for adding this info is
not correct. It needs to be implemented differently without adding a new
version.

This reverts commit 8737025d940f80360ed4b3686b332db5156f6659.
2023-10-13 07:46:36 -07:00
Shireesh Anjal a66a7f3e97
Capture network device info in health report (#18213) 2023-10-12 15:33:31 -07:00
Klaus Post 7cd08594f6
Use better host names for metric errors (#18188)
Typically hosts would end up like this:

```
   "hosts": [
        ":9000",
        ":9000",
        ":9000",
...
```

Also add host name to errors.
2023-10-09 17:27:11 -07:00
Shireesh Anjal 6d20ec3bea
Add support for resource metrics (#18057)
Add a new endpoint for "resource" metrics `/v2/metrics/resource`

This should return system metrics related to drives, network, CPU and
memory. Except for drives, other metrics should have corresponding "avg"
and "max" values also.

Reuse the real-time feature to capture the required data,
introducing CPU and memory metrics in it.

Collect the data every minute and keep updating the average and max values
accordingly, returning the latest values when the API is called.
2023-09-30 13:40:20 -07:00
Shubhendu bfddbb8b40
Embed file in ZIP with custom permissions (#17954)
This change enables embedding files in ZIP with custom permissions.
Also uses default creds for starting MinIO based on inspect data.

Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
2023-09-06 09:24:01 -07:00
Harshavardhana 5b114b43f7
refactor bandwidth throttling for replication target (#17980)
This refactor is to allow using the bandwidth throttling
for other purposes.
2023-09-05 20:21:59 -07:00
Aditya Manthramurthy 1c99fb106c
Update to minio/pkg/v2 (#17967) 2023-09-04 12:57:37 -07:00
Poorna b48bbe08b2
Add additional info for replication metrics API (#17293)
to track the replication transfer rate across different nodes,
number of active workers in use and in-queue stats to get
an idea of the current workload.

This PR also adds replication metrics to the site replication
status API. For site replication, prometheus metrics are
no longer at the bucket level - but at the cluster level.

Add prometheus metric to track credential errors since uptime
2023-08-30 01:00:59 -07:00
Harshavardhana 1c5af7c31a
serialize queueMRFHeal(), add timeouts and avoid normal build-ups (#17886)
we expect a certain level of IOPs and latency so this is okay.

fixes other miscellaneous bugs

- such as hanging on mrfCh <- when the context is canceled
- queuing MRF heal when the context is canceled
- remove unused saveStateCh channel
2023-08-21 16:44:50 -07:00
Harshavardhana 6426b74770
move bucket centric metrics to /minio/v2/metrics/bucket handlers (#17663)
users/customers do not have a reasonable number of buckets anymore,
this is why we must avoid overpopulating cluster endpoints, instead
move the bucket monitoring to a separate endpoint.

some of it's a breaking change here for a couple of metrics, but
it is imperative that we do it to improve the responsiveness of
our Prometheus cluster endpoint.

Bonus: Added new cluster metrics for usage, objects and histograms
2023-07-18 22:25:12 -07:00
Poorna 5e2f8d7a42
replication: Simplify mrf requeueing and add backlog handler (#17171)
Simplify MRF queueing and add backlog handler

- Limit re-tries to 3 to avoid repeated re-queueing. Fall offs
to be re-tried when the scanner revisits this object or upon access.

- Change MRF to have each node process only its MRF entries.

- Collect MRF backlog by the node to allow for current backlog visibility
2023-07-12 23:51:33 -07:00
Kaan Kabalak f64d62b01d
Fix style of logOnceIf calls w/unique identifiers (#17631) 2023-07-11 13:17:45 -07:00
Harshavardhana f41edb23e2
add variadic delays in peer notification retries (#17592)
just adds more `jitter` in our retries to avoid
burst flooding for peer calls.
2023-07-07 07:47:38 -07:00
Kaan Kabalak 21fbe88e1f
Print certain log messages once per error (#17484) 2023-06-24 20:29:13 -07:00
Harshavardhana 7605d07bb2
add support for bucket level request count per API (#17468)
New metrics added to calculate API request count
per bucket, per API.  Captures errors, including
4xx, 5xx HTTP status codes separately.
2023-06-21 09:41:59 -07:00
Praveen raj Mani 7c72b25ef0
Add an option to make bucket notifications synchronous (#17406)
With the current asynchronous behaviour in sending notification events
to the targets, we can't provide guaranteed delivery as the systems
might go for restarts.

For such event-driven use-cases, we can provide an option to enable
synchronous events where the APIs wait until the event is successfully
sent or persisted.

This commit adds 'MINIO_API_SYNC_EVENTS' env which when set to 'on'
will enable sending/persisting events to targets synchronously.
2023-06-20 17:38:59 -07:00
Aditya Manthramurthy 5a1612fe32
Bump up madmin-go and pkg deps (#17469) 2023-06-19 17:53:08 -07:00
Praveen raj Mani 72802a5972
Use 'minio/pkg/sync/errgroup' and 'minio/pkg/workers' (#17069) 2023-04-25 22:57:40 -07:00
Anis Eleuch 6addc7a35d
server-info: Return initializing state properly (#17070) 2023-04-24 09:10:02 -07:00
Krishnan Parthasarathi 56c57e2c53
tier-stats: Avoid repeated logs (#16774) 2023-03-07 16:43:05 -08:00
Poorna a6057c35cc
Avoid peer notification when peer is offline, tune retries (#16737) 2023-03-07 08:13:28 -08:00
Klaus Post 9acf1024e4
Remove bloom filter (#16682)
Removes the bloom filter since it has so limited usability, often gets saturated anyway and adds a bunch of complexity to the scanner.

Also removes a tiny bit of CPU by each write operation.
2023-02-24 09:03:31 +05:30
Poorna 1b02e046c2
Fix bandwidth monitoring to be per remote target (#16360) 2023-01-19 18:52:16 +05:30
Krishnan Parthasarathi 71c95ad0d0
Signal stop-rebalance to all rebalancing pools (#16438) 2023-01-19 06:54:23 +05:30
Aditya Manthramurthy a30cfdd88f
Bump up madmin-go to v2 (#16162) 2022-12-06 13:46:50 -08:00
Harshavardhana 5a8df7efb3
re-implement StorageInfo to be a peer call (#16155) 2022-12-01 14:31:35 -08:00
Anis Elleuch 7e73fc2870
Implement inspect data API v2 (#15474)
Co-authored-by: Klaus Post <klauspost@gmail.com>
2022-11-02 13:36:38 -07:00
Krishnan Parthasarathi 4523da6543
feat: introduce pool-level rebalance (#15483) 2022-10-25 12:36:57 -07:00
Harshavardhana 23b329b9df
remove gateway completely (#15929) 2022-10-24 17:44:15 -07:00
Harshavardhana 2a13cc28f2 feat: implement support batch replication (#15554) 2022-10-05 23:00:43 -07:00
Poorna 6b9fd256e1
Persist in-memory replication stats to disk (#15594)
to avoid relying on scanner-calculated replication metrics.
This will improve the accuracy of the replication stats reported.

This PR also adds on to #15556 by handing replication
traffic that could not be queued by available workers to the 
MRF queue so that entries in `PENDING` status are healed faster.
2022-09-12 12:40:02 -07:00
Harshavardhana 433b6fa8fe
upgrade golang-lint to the latest (#15600) 2022-08-26 12:52:29 -07:00
Aditya Manthramurthy afbb63a197
Factor out external event notification funcs (#15574)
This change moves external event notification functionality into
`event-notification.go`. This simplifies notification related code.
2022-08-24 06:42:36 -07:00
Anis Elleuch b8cdf060c8
Properly replicate policy mapping for virtual users (#15558)
Currently, replicating policy mapping for STS users does not work. Fix
it is by passing user type to PolicyDBSet.
2022-08-23 11:11:45 -07:00
Anis Elleuch 5682685c80
Introduce disk io stats metrics (#15512) 2022-08-16 07:13:49 -07:00
ebozduman b57e7321e7
Replaces 'disk'=>'drive' visible to end user (#15464) 2022-08-04 16:10:08 -07:00
Cesar Celis Hernandez 8ec888d13d
feat: update binary once and push it to other servers (#15407) 2022-07-29 08:34:30 -07:00
Harshavardhana 5e763b71dc
use logger.LogOnce to reduce printing disconnection logs (#15408)
fixes #15334

- re-use net/url parsed value for http.Request{}
- remove gosimple, structcheck and unusued due to https://github.com/golangci/golangci-lint/issues/2649
- unwrapErrs upto leafErr to ensure that we store exactly the correct errors
2022-07-27 09:44:59 -07:00
Anis Elleuch e4b51235f8
upgrade: Split in two steps to ensure a stable retry (#15396)
Currently, if one server in a distributed setup fails to upgrade 
due to any reasons, it is not possible to upgrade again unless 
nodes are restarted.

To fix this, split the upgrade process into two steps :

- download the new binary on all servers
- If successful, overwrite the old binary with the new one
2022-07-25 17:49:47 -07:00