to track the replication transfer rate across different nodes,
number of active workers in use and in-queue stats to get
an idea of the current workload.
This PR also adds replication metrics to the site replication
status API. For site replication, prometheus metrics are
no longer at the bucket level - but at the cluster level.
Add prometheus metric to track credential errors since uptime
objects/versions that are not expired via NewerNoncurrentVersions
must be properly returned to be applied under further ILM actions.
this would cause legitimately expired objects to be missed
from expiration.
Simplify MRF queueing and add backlog handler
- Limit re-tries to 3 to avoid repeated re-queueing. Fall offs
to be re-tried when the scanner revisits this object or upon access.
- Change MRF to have each node process only its MRF entries.
- Collect MRF backlog by the node to allow for current backlog visibility
Following extension allows users to specify immediate purge of
all versions as soon as the latest version of this object has
expired.
```
<LifecycleConfiguration>
<Rule>
<ID>ClassADocRule</ID>
<Filter>
<Prefix>classA/</Prefix>
</Filter>
<Status>Enabled</Status>
<Expiration>
<Days>3650</Days>
<ExpiredObjectAllVersions>true</ExpiredObjectAllVersions>
</Expiration>
</Rule>
...
```
Removes the bloom filter since it has so limited usability, often gets saturated anyway and adds a bunch of complexity to the scanner.
Also removes a tiny bit of CPU by each write operation.
competing calls on the same object on versioned bucket
mutating calls on the same object may unexpected have
higher delays.
This can be reproduced with a replicated bucket
overwriting the same object writes, deletes repeatedly.
For longer locks like scanner keep the 1sec interval
Queue failed/pending replication for healing during listing and GET/HEAD
API calls. This includes healing of existing objects that were never
replicated or those in the middle of a resync operation.
This PR also fixes a bug in ListObjectVersions where lifecycle filtering
should be done.
Rename Trigger -> Event to be a more appropriate
name for the audit event.
Bonus: fixes a bug in AddMRFWorker() it did not
cancel the waitgroup, leading to waitgroup leaks.
If more than 1M folders (objects or prefixes) are found at the top level in a bucket allow it to be compacted.
While very suboptimal structure we should limit memory usage at some point.
* Add periodic callhome functionality
Periodically (every 24hrs by default), fetch callhome information and
upload it to SUBNET.
New config keys under the `callhome` subsystem:
enable - Set to `on` for enabling callhome. Default `off`
frequency - Interval between callhome cycles. Default `24h`
* Improvements based on review comments
- Update `enableCallhome` safely
- Rename pctx to ctx
- Block during execution of callhome
- Store parsed proxy URL in global subnet config
- Store callhome URL(s) in constants
- Use existing global transport
- Pass auth token to subnetPostReq
- Use `config.EnableOn` instead of `"on"`
* Use atomic package instead of lock
* Use uber atomic package
* Use `Cancel` instead of `cancel`
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
We need to make sure if we cannot read bucket metadata
for some reason, and bucket metadata is not missing and
returning corrupted information we should panic such
handlers to disallow I/O to protect the overall state
on the system.
In-case of such corruption we have a mechanism now
to force recreate the metadata on the bucket, using
`x-minio-force-create` header with `PUT /bucket` API
call.
Additionally fix the versioning config updated state
to be set properly for the site replication healing
to trigger correctly.
it seems in some places we have been wrongly using the
timer.Reset() function, nicely exposed by an example
shared by @donatello https://go.dev/play/p/qoF71_D1oXD
this PR fixes all the usage comprehensively