skip healing properly in scanner when drive is hotplugged
due to how the state is passed around the SkipHealing
might not be the true state() of the system always, causing
a situation where we might healing from the scanner on the
same drive which is being. Due to this competing heals get
triggered that slow each other down.
due to a historic bug in CopyObject() where
an inlined object loses its metadata, the
check causes an incorrect fallback verifying
data-dir.
CopyObject() bug was fixed in ffa91f9794 however
the occurrence of this problem is historic, so
the aforementioned check is stretching too much.
Bonus: simplify fileInfoRaw() to read xl.json as well,
also recreate buckets properly.
canceled callers might linger around longer,
can potentially overwhelm the system. Instead
provider a caller context and canceled callers
don't hold on to them.
Bonus: we have no reason to cache errors, we should
never cache errors otherwise we can potentially have
quorum errors creeping in unexpectedly. We should
let the cache when invalidating hit the actual resources
instead.
- handle errFileCorrupt properly
- micro-optimization of sending done() response quicker
to close the goroutine.
- fix logger.Event() usage in a couple of places
- handle the rest of the client to return a different error other than
lastErr() when the client is closed.
instead upon any error in renameData(), we still
preserve the existing dataDir in some form for
recoverability in strange situations such as out
of disk space type errors.
Bonus: avoid running list and heal() instead allow
versions disparity to return the actual versions,
uuid to heal. Currently limit this to 100 versions
and lesser disparate objects.
an undo now reverts back the xl.meta from xl.meta.bkp
during overwrites on such flaky setups.
Bonus: Save N depth syscalls via skipping the parents
upon overwrites and versioned updates.
Flaky setup examples are stretch clusters with regular
packet drops etc, we need to add some defensive code
around to avoid dangling objects.
since mid 2018 we do not have any deployments
without deployment-id, it is time to put this
code to rest, this PR removes this old code as
its no longer valuable.
on setups with 1000's of drives these are all
quite expensive operations.
the disk location never changes in the lifetime of a
MinIO cluster, even if it did validate this close to the
disk instead at the higher layer.
Return appropriate errors indicating an invalid drive, so
that the drive is not recognized as part of a valid
drive.
Create new code paths for multiple subsystems in the code. This will
make maintaing this easier later.
Also introduce bugLogIf() for errors that should not happen in the first
place.
we were prematurely not writing 4k pages while we
could have due to the fact that most buffers would
be multiples of 4k upto some number and there shall
be some remainder.
We only need to write the remainder without O_DIRECT.
- Use a shared worker pool for all ILM expiry tasks
- Free version cleanup executes in a separate goroutine
- Add a free version only if removing the remote object fails
- Add ILM expiry metrics to the node namespace
- Move tier journal tasks to expiryState
- Remove unused on-disk journal for tiered objects pending deletion
- Distribute expiry tasks across workers such that the expiry of versions of
the same object serialized
- Ability to resize worker pool without server restart
- Make scaling down of expiryState workers' concurrency safe; Thanks
@klauspost
- Add error logs when expiryState and transition state are not
initialized (yet)
* metrics: Add missed tier journal entry tasks
* Initialize the ILM worker pool after the object layer
This PR fixes a bug that perhaps has been long introduced,
with no visible workarounds. In any deployment, if an entire
erasure set is deleted, there is no way the cluster recovers.
data-dir not being present is okay, however we can still
rely on the `rename()` atomic call instead of relying on
write xl.meta write which may truncate the io.EOF.
the PR in #16541 was incorrect and hand wrong assumptions
about the overall setup, revert this since this expectation
to have offline servers is wrong and we can end up with a
bigger chicken and egg problem.
This reverts commit 5996c8c4d5.
Bonus:
- preserve disk in globalLocalDrives properly upon connectDisks()
- do not return 'nil' from newXLStorage(), getting it ready for
the next set of changes for 'format.json' loading.
a/prefix
a/prefix/1.txt
where `a/prefix` is an object which does not have `/` at the end,
we do not have to aggressively recursively delete all the sub-folders
as well. Instead convert the call into self contained to deleting
'xl.meta' and then subsequently attempting to Remove the parent.
Each Put, List, Multipart operations heavily rely on making
GetBucketInfo() call to verify if bucket exists or not on
a regular basis. This has a large performance cost when there
are tons of servers involved.
We did optimize this part by vectorizing the bucket calls,
however its not enough, beyond 100 nodes and this becomes
fairly visible in terms of performance.
- Move RenameFile to websockets
- Move ReadAll that is primarily is used
for reading 'format.json' to to websockets
- Optimize DiskInfo calls, and provide a way
to make a NoOp DiskInfo call.
AlmosAll uses of NewDeadlineWorker, which relied on secondary values, were used in a racy fashion,
which could lead to inconsistent errors/data being returned. It also propagates the deadline downstream.
Rewrite all these to use a generic WithDeadline caller that can return an error alongside a value.
Remove the stateful aspect of DeadlineWorker - it was racy if used - but it wasn't AFAICT.
Fixes races like:
```
WARNING: DATA RACE
Read at 0x00c130b29d10 by goroutine 470237:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:702 +0x611
github.com/minio/minio/cmd.readFileInfo()
github.com/minio/minio/cmd/erasure-metadata-utils.go:160 +0x122
github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.1()
github.com/minio/minio/cmd/erasure-object.go:809 +0x27a
github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.2()
github.com/minio/minio/cmd/erasure-object.go:828 +0x61
Previous write at 0x00c130b29d10 by goroutine 470298:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion.func1()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:698 +0x244
github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33
WARNING: DATA RACE
Write at 0x00c0ba6e6c00 by goroutine 94507:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol.func1()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:419 +0x104
github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33
Previous read at 0x00c0ba6e6c00 by goroutine 94463:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:422 +0x47e
github.com/minio/minio/cmd.getBucketInfoLocal.func1()
github.com/minio/minio/cmd/peer-s3-server.go:275 +0x122
github.com/minio/pkg/v2/sync/errgroup.(*Group).Go.func1()
```
Probably back from #17701
- HealFormat() was leaking healthcheck goroutines for
disks, we are only interested in enabling healthcheck
for the newly formatted disk, not for existing disks.
- When disk is a root-disk a random disk monitor was
leaking while we ignored the drive.
- When loading the disk for each erasure set, we were
leaking goroutines for the prepare-storage.go disks
which were replaced via the globalLocalDrives slice
- avoid disk monitoring utilizing health tokens that
would cause exhaustion in the tokens, prematurely
which were meant for incoming I/O. This is ensured
by avoiding writing O_DIRECT aligned buffer instead
write 2048 worth of content only as O_DSYNC, which is
sufficient.
NOTE: This feature is not retro-active; it will not cater to previous transactions
on existing setups.
To enable this feature, please set ` _MINIO_DRIVE_QUORUM=on` environment
variable as part of systemd service or k8s configmap.
Once this has been enabled, you need to also set `list_quorum`.
```
~ mc admin config set alias/ api list_quorum=auto`
```
A new debugging tool is available to check for any missing counters.