The main goal of this PR is to solve the situation where disks stop
responding to operations. This generally causes an FD build-up and
eventually will crash the server.
This adds detection of hung disks, where calls on disk get stuck.
We add functionality to `xlStorageDiskIDCheck` where it keeps
track of the number of concurrent requests on a given disk.
A total number of 100 operations are allowed. If this limit is reached
we will block (but not reject) new requests, but we will monitor the
state of the disk.
If no requests have been completed or updated within a 15-second
window, we mark the disk as offline. Requests that are blocked will be
unblocked and return an error as "faulty disk".
New requests will be rejected until the disk is marked OK again.
Once a disk has been marked faulty, a check will run every 5 seconds that
will attempt to write and read back a file. As long as this fails the disk will
remain faulty.
To prevent lots of long-running requests to mark the disk faulty we
implement a callback feature that allows updating the status as parts
of these operations are running.
We add a reader and writer wrapper that will update the status of each
successful read/write operation. This should allow fine enough granularity
that a slow, but still operational disk will not reach 15 seconds where
50 operations have not progressed.
Note that errors themselves are not enough to mark a disk faulty.
A nil (or io.EOF) error will mark a disk as "good".
* Make concurrent disk setting configurable via `_MINIO_DISK_MAX_CONCURRENT`.
* de-couple IsOnline() from disk health tracker
The purpose of IsOnline() is to ensure that we
reconnect the drive only when the "drive" was
- disconnected from network we need to validate
if the drive is "correct" and is the same drive
which belongs to this server.
- drive was replaced we have to format it - we
support hot swapping of the drives.
IsOnline() is not meant for taking the drive offline
when it is hung, it is not useful we can let the
drive be online instead "return" errors for relevant
calls.
* return errFaultyDisk for DiskInfo() call
Co-authored-by: Harshavardhana <harsha@minio.io>
Possible future Improvements:
* Unify the REST server and local xlStorageDiskIDCheck. This would also improve stats significantly.
* Allow reads/writes to be aborted by the context.
* Add usage stats, concurrent count, blocked operations, etc.
Data usage does not always contain tiering info even if the data usage
information is valid. Avoid a crash in that case.
(e.g. the scanner scanned the namespace, the user enables tiering,
prometheus scrapes the server before the scanner gets a chance to
update the data usage with new tiering information)
Healing decisions would align with skipped folder counters. This can lead to files
never being selected for heal checks on "clean" paths.
Use different hashing methods and take objectHealProbDiv into account when
calculating the cycle.
Found by @vadmeste
This is a side-affect of the optimization done in PR #13544 which
causes a certain type of delete operations on given object versions
can cause lastVersion indication to be skipped, which leads to
an `xl.meta` where Versions[] slice is empty while the entire
file is intact by itself.
This PR tries to ensure that such files are visible and deletable
by regular means of listing as null 'delete-marker' and also
avoid the situation where this potential issue might arise.
When scanning using normal mode, HealObject() can report an
error saying that it found a corrupted part. This doesn't have
when HealObject() is called with bitrot scan flag. However, when
this happens, we can still restart HealObject() with the bitrot scan.
This is also important because this means the scanner and the
new disks healer will not be able to heal an object that doesn't
exist in a specific disk and has corruption in another disk.
Also without this PR, mc admin heal command without bitrot will report
an error.
This commit removes some duplicate code that
converts KES API errors.
This code was added since KES `0.18.0` changed
some exported API errors. However, the KES SDK
handles this error conversion itself.
Therefore, it is not necessary to duplicate this
behavior in MinIO.
See: 21555fa624/error.go (L94)
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
- Updating KES dependency to v.0.18.0
- Fixing incompatibility issue when checking for errors during KES key creation
Signed-off-by: Lenin Alevski <alevsk.8772@gmail.com>
In a distributed setup, a DiskInfo REST call to an unformatted disk
returns an error with no disk information, such as the disk endpoint
URL, which is unexpected.
metadata headers can have headers without values
as per AWS S3 spec however, we need to skip some
headers that do not have values that potentially
can have empty values set.
small setups do not return appropriate errors when speedtest
cannot run on small tiny setups, allow the tests to fail
appropriately more pro-actively.
many users bring toy setups, this PR simply returns an error
in such situations.
healing disks take active I/O it is possible
that deleted objects might stay in .trash
folder for a really long time until the drive
is fully healed.
this PR changes it such that we are making sure
we purge the active content written to these
disks as well.
- speedtest logs calls that were canceled
spuriously, in situations where it should
be ignored.
- all errors of interest are always sent back
to the client there is no need to log them
on the server console.
- PUT failures should negate the increments
such that GET is not attempted on unsuccessful
calls.
- do not attempt MRF on speedtest objects.
In the testing mode, reformatting disks will fail because the healing
code will complain if one disk is in root mode. This commit will
automatically set all disks as non-root if MINIO_CI_CD is set.
Currently, when applying any dynamic config, the system reloads and
re-applies the config of all the dynamic sub-systems.
This PR refactors the code in such a way that changing config of a given
dynamic sub-system will work on only that sub-system.
An onlineDisk means its a valid disk but it may be a
re-connected disk, this PR verifies that based on LastConn()
to only trigger MRF. Current code would again re-load the
disk 'format.json' which is not necessary and perhaps an
unnecessary call.
A potential side affect of this is closing perfectly online
disks and getting re-replaced by reloading 'format.json'.
This PR tries to avoid this situation by making sure MRF
is triggered but not reloading 'format.json' because of MRF.
Only the first `listAndHeal` would ever be able to write on errCh, blocking all others infinitely.
Instead read all errors but return the first non-nil, if any.
The intention appears to be that this should cancel on any error,
so that part is kept.
Regression from #13990
When more than one gateway reads and writes from the same mount point
and there is a load balancer pointing to those gateways. Each gateway
will try to create its own temporary append file but fails to clear it later
when not needed.
This commit creates a routine that checks all upload IDs saved in
multipart directory and remove any stale entry with the same upload id
in the memory and in the temporary background append folder as well.
Enabled with `mc admin config set alias/ api gzip_objects=on`
Standard filtering applies (1K response minimum, not compressed content
type, not range request, gzip accepted by client).
The current code considers a pool with all root disks to be as part
of a testing environment even if there are other pools with mounted
disks. This will result to illegitimate writing in root disks.
Fix this by simplifing the logic: require MINIO_CI_CD in order to skip
root disk check.
MinIO configuration is loaded after the initialization of the server
handlers, which will miss the initialization of the bucket forwarder
handler.
Though the federation is deprecated, let's fix this for the time being.
S3 spec returns x-amz-restore header in HEAD/GET object with the
following format:
```
x-amz-restore: ongoing-request="false", expiry-date="Fri, 21 Dec 2012
00:00:00 GMT"
```
This commit adds quotes as the current code does not support it. It will
also supports the old format saved in the disk (in xl.meta) for backward
compatibility.
A regression removed support of federation in the gateway mode.
Enable it again.
Federation is deprecated for a while but let's fix this for the time being.
Deleting bulk objects had an issue since the relevant versionID
is not passed through the layers to ensure that the dangling
object purge actually works cleanly.
This is a continuation of quorum related error returned by
multi-object delete API from #14248
This PR ensures that we pass down correct information as
well as extend the scope of dangling object detection.
When setting a config of a particular sub-system, validate the existing
config and notification targets of only that sub-system, so that
existing errors related to one sub-system (e.g. notification target
offline) do not result in errors for other sub-systems.
Some users running MinIO claim that their system became slow. One
way to investigate is to look at this Prometheus history of the number of
the requests reaching the server. The existing current S3 requests metric
is not enough because it can increase of the system really becomes slow,
due to disk issues for example.
startup speed-up, currently getFormatErasureInQuorum()
would spend up to 2-3secs when there are 3000+ drives
for example in a setup, simplify this implementation
to use drive counts.
DeleteMarkers do not have a default quorum, i.e it is possible that
DeleteMarkers were created with n/2+1 quorum as well to make sure
that we satisfy situations such as those we need to make sure delete
markers only expect n/2 read quorum.
Additionally we should also look at additional metadata on the
actual objects that might have been "erasure" upgraded with new
parity when disks are down.
In such a scenario do not default to the standard storage class
parity, instead use the parityBlocks present on the FileInfo to
ensure that we are dealing with the correct quorum for READs and
DELETEs.
Retry listings, when no next marker is returned and the result isn't truncated.
This can happen when an object is queued, but no info can be fetched.
Fixes#14190
The healing code repeatedly tries to heal a root disk when it is empty
the reason is that connectEndpoint() returns errUnformattedDisk even
if the disk is a root disk. Changing that to returning another error
will avoid queueing the disk to the healing code in each connect disks
iteration.
some upgraded objects might not get listed due
to different quorum ratios across objects.
make sure to list all objects that satisfy the
maximum possible quorum.
This change allows the MinIO server to lookup users in different directory
sub-trees by allowing specification of multiple search bases separated by
semicolons.
This PR removes an unnecessary state that gets
passed around for DiskIDs, which is not necessary
since each disk exactly knows which pool and which
set it belongs to on a running system.
Currently cached DiskId's won't work properly
because it always ends up skipping offline disks
and never runs healing when disks are offline, as
it expects all the cached diskIDs to be present
always. This also sort of made things in-flexible
in terms perhaps a new diskID for `format.json`.
(however this is not a big issue)
This is an unnecessary requirement that healing
via scanner needs all drives to be online, instead
healing should trigger even when partial nodes
and drives are available this ensures that we
keep the SLA in-tact on the objects when disks
are offline for a prolonged period of time.
Wrong resource is being fetched, since idx is incremented, but mapID is reused.
Regression caused by #13454 - that part didn't optimize anything anyway.
Publish storage functions latency to help compare the performance
of different disks in a single deployment.
e.g.:
```
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/1",server="localhost:9001"} 226
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/2",server="localhost:9002"} 1180
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/3",server="localhost:9003"} 1183
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/4",server="localhost:9004"} 1625
```
- create internal erasure volumes only if the disk is unformatted
- return a copy of format data in xlStorage.ReadAll
- parse env vars only once, to be re-used by xl-storage
This speed-up is intended for faster startup times
for almost all MinIO operations. Changes here are
- Drives are not re-read for 'format.json' on a regular
basis once read during init is remembered and refreshed
at 5 second intervals.
- Do not do O_DIRECT tests on drives with existing 'format.json'
only fresh setups need this check.
- Parallelize initializing erasureSets for multiple sets.
- Avoid re-reading format.json when migrating 'format.json'
from really old V1->V2->V3
- Keep a copy of local drives for any given server in memory
for a quick lookup.
this helps in caching the resolved values early on, avoids
causing further resolution for individual nodes when
object layer comes online.
this can speed up our startup time during, upgrades etc by
an order of magnitude.
additional changes in connectLoadInitFormats() and parallelize
all calls that might be potentially blocking.
- Site replication was missing replicating users,
groups when an empty site was added.
- Add site replication for groups and users when they
are disabled and enabled.
- Add support for replicating bucket quota config.