Each Put, List, Multipart operations heavily rely on making
GetBucketInfo() call to verify if bucket exists or not on
a regular basis. This has a large performance cost when there
are tons of servers involved.
We did optimize this part by vectorizing the bucket calls,
however its not enough, beyond 100 nodes and this becomes
fairly visible in terms of performance.
- healing must not set the write xattr
because that is the job of active healing
to update. what we need to preserve is
permanent deletes.
- remove older env for drive monitoring and
enable it accordingly, as a global value.
local disk metrics were polluting cluster metrics
Please remove them instead of adding relevant ones.
- batch job metrics were incorrectly kept at bucket
metrics endpoint, move it to cluster metrics.
- add tier metrics to cluster peer metrics from the node.
- fix missing set level cluster health metrics
it is entirely possible that a rebalance process which was running
when it was asked to "stop" it failed to write its last statistics
to the disk.
After this a pool expansion can cause disruption and all S3 API
calls would fail at IsPoolRebalancing() function.
This PRs makes sure that we update rebalance.bin under such
conditions to avoid any runtime crashes.
add new update v2 that updates per node, allows idempotent behavior
new API ensures that
- binary is correct and can be downloaded checksummed verified
- committed to actual path
- restart returns back the relevant waiting drives
do not need to be defensive in our approach,
we should simply override anything everything
in import process, do not care about what
currently exists on the disk - backup is the
source of truth.
Right now the format.json is excluded if anything within `.minio.sys` is requested.
I assume the check was meant to exclude only if it was actually requesting it.
- Move RenameFile to websockets
- Move ReadAll that is primarily is used
for reading 'format.json' to to websockets
- Optimize DiskInfo calls, and provide a way
to make a NoOp DiskInfo call.
AlmosAll uses of NewDeadlineWorker, which relied on secondary values, were used in a racy fashion,
which could lead to inconsistent errors/data being returned. It also propagates the deadline downstream.
Rewrite all these to use a generic WithDeadline caller that can return an error alongside a value.
Remove the stateful aspect of DeadlineWorker - it was racy if used - but it wasn't AFAICT.
Fixes races like:
```
WARNING: DATA RACE
Read at 0x00c130b29d10 by goroutine 470237:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:702 +0x611
github.com/minio/minio/cmd.readFileInfo()
github.com/minio/minio/cmd/erasure-metadata-utils.go:160 +0x122
github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.1()
github.com/minio/minio/cmd/erasure-object.go:809 +0x27a
github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.2()
github.com/minio/minio/cmd/erasure-object.go:828 +0x61
Previous write at 0x00c130b29d10 by goroutine 470298:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion.func1()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:698 +0x244
github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33
WARNING: DATA RACE
Write at 0x00c0ba6e6c00 by goroutine 94507:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol.func1()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:419 +0x104
github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33
Previous read at 0x00c0ba6e6c00 by goroutine 94463:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:422 +0x47e
github.com/minio/minio/cmd.getBucketInfoLocal.func1()
github.com/minio/minio/cmd/peer-s3-server.go:275 +0x122
github.com/minio/pkg/v2/sync/errgroup.(*Group).Go.func1()
```
Probably back from #17701
protection was in place. However, it covered only some
areas, so we re-arranged the code to ensure we could hold
locks properly.
Along with this, remove the DataShardFix code altogether,
in deployments with many drive replacements, this can affect
and lead to quorum loss.
Also limit the amount of concurrency when sending
binary updates to peers, avoid high network over
TX that can cause disconnection events for the
node sending updates.
If site replication is enabled, we should still show the size and
version distribution histogram metrics at bucket level.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
New API now verifies any hung disks before restart/stop,
provides a 'per node' break down of the restart/stop results.
Provides also how many blocked syscalls are present on the
drives and what users must do about them.
Adds options to do pre-flight checks to provide information
to the user regarding any hung disks. Provides 'force' option
to forcibly attempt a restart() even with waiting syscalls
on the drives.
On a policy detach operation, if there are no policies remaining
attached to the user/group, remove the policy mapping file, instead of
leaving a file containing an empty list of policies.
Healing dangling buckets is conservative, and it is a typical use case to
fail to remove a dangling bucket because it contains some data because
healing danging bucket code is not allowed to remove data: only healing
the dangling object is allowed to do so.
reference format is constant for any lifetime of
a minio cluster, we do not have to ever replace
it during HealFormat() as it will never change.
additionally we should simply reject reference
formats that we do not understand early on.
GetActualSize() was heavily relying on o.Parts()
to be non-empty to figure out if the object is multipart or not,
However, we have many indicators of whether an object is multipart
or not.
Blindly assuming that o.Parts == nil is not a multipart, is an
incorrect expectation instead, multipart must be obtained via
- Stored metadata value indicating this is a multipart encrypted object.
- Rely on <meta>-actual-size metadata to get the object's actual size.
This value is preserved for additional reasons such as these.
- ETag != 32 length
support proxying of tagging requests in active-active replication
Note: even if proxying is successful, PutObjectTagging/DeleteObjectTagging
will continue to report a 404 since the object is not present locally.
New intervals:
[1024B, 64KiB)
[64KiB, 256KiB)
[256KiB, 512KiB)
[512KiB, 1MiB)
The new intervals helps us see object size distribution with higher
resolution for the interval [1024B, 1MiB).
- HealFormat() was leaking healthcheck goroutines for
disks, we are only interested in enabling healthcheck
for the newly formatted disk, not for existing disks.
- When disk is a root-disk a random disk monitor was
leaking while we ignored the drive.
- When loading the disk for each erasure set, we were
leaking goroutines for the prepare-storage.go disks
which were replaced via the globalLocalDrives slice
- avoid disk monitoring utilizing health tokens that
would cause exhaustion in the tokens, prematurely
which were meant for incoming I/O. This is ensured
by avoiding writing O_DIRECT aligned buffer instead
write 2048 worth of content only as O_DSYNC, which is
sufficient.