This change enables embedding files in ZIP with custom permissions.
Also uses default creds for starting MinIO based on inspect data.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
to track the replication transfer rate across different nodes,
number of active workers in use and in-queue stats to get
an idea of the current workload.
This PR also adds replication metrics to the site replication
status API. For site replication, prometheus metrics are
no longer at the bucket level - but at the cluster level.
Add prometheus metric to track credential errors since uptime
we expect a certain level of IOPs and latency so this is okay.
fixes other miscellaneous bugs
- such as hanging on mrfCh <- when the context is canceled
- queuing MRF heal when the context is canceled
- remove unused saveStateCh channel
users/customers do not have a reasonable number of buckets anymore,
this is why we must avoid overpopulating cluster endpoints, instead
move the bucket monitoring to a separate endpoint.
some of it's a breaking change here for a couple of metrics, but
it is imperative that we do it to improve the responsiveness of
our Prometheus cluster endpoint.
Bonus: Added new cluster metrics for usage, objects and histograms
Simplify MRF queueing and add backlog handler
- Limit re-tries to 3 to avoid repeated re-queueing. Fall offs
to be re-tried when the scanner revisits this object or upon access.
- Change MRF to have each node process only its MRF entries.
- Collect MRF backlog by the node to allow for current backlog visibility
With the current asynchronous behaviour in sending notification events
to the targets, we can't provide guaranteed delivery as the systems
might go for restarts.
For such event-driven use-cases, we can provide an option to enable
synchronous events where the APIs wait until the event is successfully
sent or persisted.
This commit adds 'MINIO_API_SYNC_EVENTS' env which when set to 'on'
will enable sending/persisting events to targets synchronously.
Removes the bloom filter since it has so limited usability, often gets saturated anyway and adds a bunch of complexity to the scanner.
Also removes a tiny bit of CPU by each write operation.
to avoid relying on scanner-calculated replication metrics.
This will improve the accuracy of the replication stats reported.
This PR also adds on to #15556 by handing replication
traffic that could not be queued by available workers to the
MRF queue so that entries in `PENDING` status are healed faster.
fixes#15334
- re-use net/url parsed value for http.Request{}
- remove gosimple, structcheck and unusued due to https://github.com/golangci/golangci-lint/issues/2649
- unwrapErrs upto leafErr to ensure that we store exactly the correct errors
Currently, if one server in a distributed setup fails to upgrade
due to any reasons, it is not possible to upgrade again unless
nodes are restarted.
To fix this, split the upgrade process into two steps :
- download the new binary on all servers
- If successful, overwrite the old binary with the new one
Add cluster info to inspect and profiling archive.
In addition to the existing data generation for both inspect and profiling,
cluster.info file is added. This latter contains some info of the cluster.
The generation of cluster.info is is done as the last step and it can fail
if it exceed 10 seconds.
peerOnlineCounter was making NxN calls to many peers, this
can be really long and tedious if there are random servers
that are going down.
Instead we should calculate online peers from the point of
view of "self" and return those online and offline appropriately
by performing a healthcheck.
current implementation relied on recursively calling one bucket
at a time across all peers, this would be very slow and chatty
when there are 100's of buckets which would mean 100*peerCount
amount of network operations.
This PR attempts to reduce this entire call into `peerCount`
amount of network calls only. This functionality addresses also a
concern where the Prometheus metrics would significantly slow
down when one of the peers is offline.