Healing a large object with a normal scan mode where no parts read
is involved can still fail after 30 seconds if an object has
There are too many parts when hard disks are being used mainly.
The reason is there is a general deadline that checks for all parts we
do a deadline per part.
- optimize writing part.N.meta by writing both part.N
and its meta in sequence without network component.
- remove part.N.meta, part.N which were partially success
ful, in quorum loss situations during renamePart()
- allow for strict read quorum check arbitrated via ETag
for the given part number, this makes it double safer
upon final commit.
- return an appropriate error when read quorum is missing,
instead of returning InvalidPart{}, which is non-retryable
error. This kind of situation can happen when many
nodes are going offline in rotation, an example of such
a restart() behavior is statefulset updates in k8s.
fixes#20091
deadlines per moveToTrash() allows for a more granular timeout
approach for syscalls, instead of an aggregate timeout.
This PR also enhances multipart state cleanup to be optimal by
removing 100's of multipart network rename() calls into single
network call.
canceled callers might linger around longer,
can potentially overwhelm the system. Instead
provider a caller context and canceled callers
don't hold on to them.
Bonus: we have no reason to cache errors, we should
never cache errors otherwise we can potentially have
quorum errors creeping in unexpectedly. We should
let the cache when invalidating hit the actual resources
instead.
- handle errFileCorrupt properly
- micro-optimization of sending done() response quicker
to close the goroutine.
- fix logger.Event() usage in a couple of places
- handle the rest of the client to return a different error other than
lastErr() when the client is closed.
instead upon any error in renameData(), we still
preserve the existing dataDir in some form for
recoverability in strange situations such as out
of disk space type errors.
Bonus: avoid running list and heal() instead allow
versions disparity to return the actual versions,
uuid to heal. Currently limit this to 100 versions
and lesser disparate objects.
an undo now reverts back the xl.meta from xl.meta.bkp
during overwrites on such flaky setups.
Bonus: Save N depth syscalls via skipping the parents
upon overwrites and versioned updates.
Flaky setup examples are stretch clusters with regular
packet drops etc, we need to add some defensive code
around to avoid dangling objects.
RenameData could start operating on inline data after timing out
and the call returned due to WithDeadline.
This could cause a buffer to write to the inline data being written.
Since no writes are in `RenameData` and the call is canceled,
this doesn't present a corruption issue. But a race is a race and
should be fixed.
Copy inline data to a fresh buffer.
since mid 2018 we do not have any deployments
without deployment-id, it is time to put this
code to rest, this PR removes this old code as
its no longer valuable.
on setups with 1000's of drives these are all
quite expensive operations.
the disk location never changes in the lifetime of a
MinIO cluster, even if it did validate this close to the
disk instead at the higher layer.
Return appropriate errors indicating an invalid drive, so
that the drive is not recognized as part of a valid
drive.
Create new code paths for multiple subsystems in the code. This will
make maintaing this easier later.
Also introduce bugLogIf() for errors that should not happen in the first
place.
Add a new function logger.Event() to send the log to Console and
http/kafka log webhooks. This will include some internal events such as
disk healing and rebalance/decommissioning
disk tokens usage is not necessary anymore with the implementation
of deadlines for storage calls and active monitoring of the drive
for I/O timeouts.
Functionality kicking off a bad drive is still supported, it's just that
we do not have to serialize I/O in the manner tokens would do.
Each Put, List, Multipart operations heavily rely on making
GetBucketInfo() call to verify if bucket exists or not on
a regular basis. This has a large performance cost when there
are tons of servers involved.
We did optimize this part by vectorizing the bucket calls,
however its not enough, beyond 100 nodes and this becomes
fairly visible in terms of performance.
- healing must not set the write xattr
because that is the job of active healing
to update. what we need to preserve is
permanent deletes.
- remove older env for drive monitoring and
enable it accordingly, as a global value.
- Move RenameFile to websockets
- Move ReadAll that is primarily is used
for reading 'format.json' to to websockets
- Optimize DiskInfo calls, and provide a way
to make a NoOp DiskInfo call.
AlmosAll uses of NewDeadlineWorker, which relied on secondary values, were used in a racy fashion,
which could lead to inconsistent errors/data being returned. It also propagates the deadline downstream.
Rewrite all these to use a generic WithDeadline caller that can return an error alongside a value.
Remove the stateful aspect of DeadlineWorker - it was racy if used - but it wasn't AFAICT.
Fixes races like:
```
WARNING: DATA RACE
Read at 0x00c130b29d10 by goroutine 470237:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:702 +0x611
github.com/minio/minio/cmd.readFileInfo()
github.com/minio/minio/cmd/erasure-metadata-utils.go:160 +0x122
github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.1()
github.com/minio/minio/cmd/erasure-object.go:809 +0x27a
github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.2()
github.com/minio/minio/cmd/erasure-object.go:828 +0x61
Previous write at 0x00c130b29d10 by goroutine 470298:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion.func1()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:698 +0x244
github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33
WARNING: DATA RACE
Write at 0x00c0ba6e6c00 by goroutine 94507:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol.func1()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:419 +0x104
github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33
Previous read at 0x00c0ba6e6c00 by goroutine 94463:
github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol()
github.com/minio/minio/cmd/xl-storage-disk-id-check.go:422 +0x47e
github.com/minio/minio/cmd.getBucketInfoLocal.func1()
github.com/minio/minio/cmd/peer-s3-server.go:275 +0x122
github.com/minio/pkg/v2/sync/errgroup.(*Group).Go.func1()
```
Probably back from #17701
- HealFormat() was leaking healthcheck goroutines for
disks, we are only interested in enabling healthcheck
for the newly formatted disk, not for existing disks.
- When disk is a root-disk a random disk monitor was
leaking while we ignored the drive.
- When loading the disk for each erasure set, we were
leaking goroutines for the prepare-storage.go disks
which were replaced via the globalLocalDrives slice
- avoid disk monitoring utilizing health tokens that
would cause exhaustion in the tokens, prematurely
which were meant for incoming I/O. This is ensured
by avoiding writing O_DIRECT aligned buffer instead
write 2048 worth of content only as O_DSYNC, which is
sufficient.
Add a hidden configuration under the scanner sub section to configure if
the scanner should sleep between two objects scan. The configuration has
only effect when there is no drive activity related to s3 requests or
healing.
By default, the code will keep the current behavior which is doing
sleep between objects.
To forcefully enable the full scan speed in idle mode, you can do this:
`mc admin config set myminio scanner idle_speed=full`
NOTE: This feature is not retro-active; it will not cater to previous transactions
on existing setups.
To enable this feature, please set ` _MINIO_DRIVE_QUORUM=on` environment
variable as part of systemd service or k8s configmap.
Once this has been enabled, you need to also set `list_quorum`.
```
~ mc admin config set alias/ api list_quorum=auto`
```
A new debugging tool is available to check for any missing counters.
`(*xlStorageDiskIDCheck).CreateFile` wraps the incoming reader in `xioutil.NewDeadlineReader`.
The wrapped reader is handed to `(*xlStorage).CreateFile`. This performs a Read call via `writeAllDirect`,
which reads into an `ODirectPool` buffer.
`(*DeadlineReader).Read` spawns an async read into the buffer. If a timeout is hit while reading,
the read operation returns to `writeAllDirect`. The operation returns an error and the buffer is reused.
However, if the async `Read` call unblocks, it will write to the now recycled buffer.
Fix: Remove the `DeadlineReader` - it is inherently unsafe. Instead, rely on the network timeouts.
This is not a disk timeout, anyway.
Regression in https://github.com/minio/minio/pull/17745
Bonus: allow replication to attempt Deletes/Puts when
the remote returns quorum errors of some kind, this is
to ensure that MinIO can rewrite the namespace with the
latest version that exists on the source.
This PR adds a WebSocket grid feature that allows servers to communicate via
a single two-way connection.
There are two request types:
* Single requests, which are `[]byte => ([]byte, error)`. This is for efficient small
roundtrips with small payloads.
* Streaming requests which are `[]byte, chan []byte => chan []byte (and error)`,
which allows for different combinations of full two-way streams with an initial payload.
Only a single stream is created between two machines - and there is, as such, no
server/client relation since both sides can initiate and handle requests. Which server
initiates the request is decided deterministically on the server names.
Requests are made through a mux client and server, which handles message
passing, congestion, cancelation, timeouts, etc.
If a connection is lost, all requests are canceled, and the calling server will try
to reconnect. Registered handlers can operate directly on byte
slices or use a higher-level generics abstraction.
There is no versioning of handlers/clients, and incompatible changes should
be handled by adding new handlers.
The request path can be changed to a new one for any protocol changes.
First, all servers create a "Manager." The manager must know its address
as well as all remote addresses. This will manage all connections.
To get a connection to any remote, ask the manager to provide it given
the remote address using.
```
func (m *Manager) Connection(host string) *Connection
```
All serverside handlers must also be registered on the manager. This will
make sure that all incoming requests are served. The number of in-flight
requests and responses must also be given for streaming requests.
The "Connection" returned manages the mux-clients. Requests issued
to the connection will be sent to the remote.
* `func (c *Connection) Request(ctx context.Context, h HandlerID, req []byte) ([]byte, error)`
performs a single request and returns the result. Any deadline provided on the request is
forwarded to the server, and canceling the context will make the function return at once.
* `func (c *Connection) NewStream(ctx context.Context, h HandlerID, payload []byte) (st *Stream, err error)`
will initiate a remote call and send the initial payload.
```Go
// A Stream is a two-way stream.
// All responses *must* be read by the caller.
// If the call is canceled through the context,
//The appropriate error will be returned.
type Stream struct {
// Responses from the remote server.
// Channel will be closed after an error or when the remote closes.
// All responses *must* be read by the caller until either an error is returned or the channel is closed.
// Canceling the context will cause the context cancellation error to be returned.
Responses <-chan Response
// Requests sent to the server.
// If the handler is defined with 0 incoming capacity this will be nil.
// Channel *must* be closed to signal the end of the stream.
// If the request context is canceled, the stream will no longer process requests.
Requests chan<- []byte
}
type Response struct {
Msg []byte
Err error
}
```
There are generic versions of the server/client handlers that allow the use of type
safe implementations for data types that support msgpack marshal/unmarshal.
sendfile implementation to perform DMA on all platforms
Go stdlib already supports sendfile/splice implementations
for
- Linux
- Windows
- *BSD
- Solaris
Along with this change however O_DIRECT for reads() must be
removed as well since we need to use sendfile() implementation
The main reason to add O_DIRECT for reads was to reduce the
chances of page-cache causing OOMs for MinIO, however it would
seem that avoiding buffer copies from user-space to kernel space
this issue is not a problem anymore.
There is no Go based memory allocation required, and neither
the page-cache is referenced back to MinIO. This page-
cache reference is fully owned by kernel at this point, this
essentially should solve the problem of page-cache build up.
With this now we also support SG - when NIC supports Scatter/Gather
https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing)
currently the default for all drives is 512, which is a lot
for HDDs the recent testing has revealed moving this to 32
for HDDs seems like a fair value.
.metacache objects are transient in nature, and are better left to
use page-cache effectively to avoid using more IOPs on the disks.
this allows for incoming calls to be not taxed heavily due to
multiple large batch listings.
Bonus:
- avoid calling DiskInfo() calls when missing blocks
instead heal the object using MRF operation.
- change the max_sleep to 250ms beyond that we will
not stop healing.
health checks were missing for drives replaced since
- HealFormat() would replace the drives without a health check
- disconnected drives when they reconnect via connectEndpoint()
the loop also loses health checks for local disks and merges
these into a single code.
- other than this separate cleanUp, health check variables to avoid
overloading them with similar requirements.
- also ensure that we compete via context selector for disk monitoring
such that the canceled disks don't linger around longer waiting for
the ticker to trigger.
- allow disabling active monitoring.