This commit will fix one rare case of a multipart object that
can be read in theory but GetObject API returned an error.
It turned out that a six years old code was marking a drive offline
when the bitrot streaming fails to read a part in a disk with any error.
This can affect reading a subsequent part, though having enough shards,
but unable to construct because one drive was marked offline earlier.
This commit will remove the drive marking offline code. It will also
close the bitrotstreaming reader before marking it as nil.
Currently, on enabling callhome (or restarting the server), the callhome
job gets scheduled. This means that one has to wait for 24hrs (the
default frequency duration) to see it in action and to figure out if it
is working as expected.
It will be a better user experience to perform the first callhome
execution immediately after enabling it (or on server start if already
enabled).
Also, generate audit event on callhome execution, setting the error
field in case the execution has failed.
* Store ModTime in the upload ID; return it when listing instead of the current time.
* Use this ModTime to expire and skip reading the file info.
* Consistent upload sorting in listing (since it now has the ModTime).
* Exclude healing disks to avoid returning an empty list.
```
==================
WARNING: DATA RACE
Read at 0x0000082be990 by goroutine 205:
github.com/minio/minio/cmd.setCommonHeaders()
Previous write at 0x0000082be990 by main goroutine:
github.com/minio/minio/cmd.lookupConfigs()
```
Recent Veeam is very picky about storage class names. Add `_MINIO_VEEAM_FORCE_SC` env var.
It will override the storage class returned by the storage backend if it is non-standard
and we detect a Veeam client by checking the User Agent.
Applies to HeadObject/GetObject/ListObject*
add deadlines that can be dynamically changed via
the drive max timeout values.
Bonus: optimize "file not found" case and hung drives/network - circuit break the check and return right
away instead of waiting.
as that is the only API where the TTFB metric is beneficial, and
capturing this for all APIs exponentially increases the response size in
large clusters.
Replace the `io.Pipe` from streamingBitrotWriter -> CreateFile with a fixed size ring buffer.
This will add an output buffer for encoded shards to be written to disk - potentially via RPC.
This will remove blocking when `(*streamingBitrotWriter).Write` is called, and it writes hashes and data.
With current settings, the write looks like this:
```
Outbound
┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐
│ │ Parr. │ │ (http body) │ │ │ │
│ Bitrot Hash │ Write │ Pipe │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │
│ Erasure Shard │ ──────────► │ (unbuffered) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │
│ │ │ │ │ (io.Copy) │ │ │
└───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘
```
We write a Hash (32 bytes). Since the pipe is unbuffered, it will block until the 32 bytes have
been delivered to the TCP buffer, and the next Read hits the Pipe.
Then we write the shard data. This will typically be bigger than 64KB, so it will block until two blocks
have been read from the pipe.
When we insert a ring buffer:
```
Outbound
┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐
│ │ │ │ (http body) │ │ │ │
│ Bitrot Hash │ Write │ Ring Buffer │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │
│ Erasure Shard │ ──────────► │ (2MB) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │
│ │ │ │ │ (io.Copy) │ │ │
└───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘
```
The hash+shard will fit within the ring buffer, so writes will not block - but will complete after a
memcopy. Reads can fill the 64KB buffer if there is data for it.
If the network is congested, the ring buffer will become filled, and all syscalls will be on full buffers.
Only when the ring buffer is filled will erasure coding start blocking.
Since there is always "space" to write output data, we remove the parallel writing since we are
always writing to memory now, and the goroutine synchronization overhead probably not worth taking.
If the output were blocked in the existing, we would still wait for it to unblock in parallel write, so it would
make no difference there - except now the ring buffer smoothes out the load.
There are some micro-optimizations we could look at later. The biggest is that, in most cases,
we could encode directly to the ring buffer - if we are not at a boundary. Also, "force filling" the
Read requests (i.e., blocking until a full read can be completed) could be investigated and maybe
allow concurrent memory on read and write.
Metrics being added:
- read_tolerance: No of drive failures that can be tolerated without
disrupting read operations
- write_tolerance: No of drive failures that can be tolerated without
disrupting write operations
- read_health: Health of the erasure set in a pool for read operations
(1=healthy, 0=unhealthy)
- write_health: Health of the erasure set in a pool for write operations
(1=healthy, 0=unhealthy)
Adds regression test for #19699
Failures are a bit luck based, since it requires objects to be placed on different sets.
However this generates a failure prior to #19699
* Revert "Revert "Fix incorrect merging of slash-suffixed objects (#19699)""
This reverts commit f30417d9a8.
* Don't override when suffix doesn't match. Instead rely on quorum for each.
Instead of having "online" and "healing" as two metrics, replace with a
single metric "health" which can have following values:
0 = offline
1 = healthy
2 = healing
If two objects share everything but one object has a slash prefix, those would be merged in listings,
with secondary properties used for a tiebreak.
Example: An object with the key `prefix/obj` would be merged with an object named `prefix/obj/`.
While this violates the [no object can be a prefix of another](https://min.io/docs/minio/linux/operations/concepts/thresholds.html#conflicting-objects), let's resolve these.
If we have an object with 'name' and a directory named 'name/' discard the directory only - but allow objects
of 'name' and 'name/' (xldir) to be uniquely returned.
Regression from #15772
canceled callers might linger around longer,
can potentially overwhelm the system. Instead
provider a caller context and canceled callers
don't hold on to them.
Bonus: we have no reason to cache errors, we should
never cache errors otherwise we can potentially have
quorum errors creeping in unexpectedly. We should
let the cache when invalidating hit the actual resources
instead.
Accept multipart uploads where the combined checksum provides the expected part count.
It seems this was added by AWS to make the API more consistent, even if the
data is entirely superfluous on multiple levels.
Improves AWS S3 compatibility.
This commit adds support for MinKMS. Now, there are three KMS
implementations in `internal/kms`: Builtin, MinIO KES and MinIO KMS.
Adding another KMS integration required some cleanup. In particular:
- Various KMS APIs that haven't been and are not used have been
removed. A lot of the code was broken anyway.
- Metrics are now monitored by the `kms.KMS` itself. For basic
metrics this is simpler than collecting metrics for external
servers. In particular, each KES server returns its own metrics
and no cluster-level view.
- The builtin KMS now uses the same en/decryption implemented by
MinKMS and KES. It still supports decryption of the previous
ciphertext format. It's backwards compatible.
- Data encryption keys now include a master key version since MinKMS
supports multiple versions (~4 billion in total and 10000 concurrent)
per key name.
Signed-off-by: Andreas Auernhammer <github@aead.dev>