This commit improves the listing of encrypted objects:
- Use `etag.Format` and `etag.Decrypt`
- Detect SSE-S3 single-part objects in a single iteration
- Fix batch size to `250`
- Pass request context to `DecryptAll` to not waste resources
when a client cancels the operation.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds two new functions to the
internal `etag` package:
- `ETag.Format`
- `Decrypt`
The `Decrypt` function decrypts an encrypted
ETag using a decryption key. It returns not
encrypted / multipart ETags unmodified.
The `Decrypt` function is mainly used when
handling SSE-S3 encrypted single-part objects.
In particular, the ETag of an SSE-S3 encrypted
single-part object needs to be decrypted since
S3 clients expect that this ETag is equal to the
content MD5.
The `ETag.Format` method also covers SSE ETag handling.
MinIO encrypts all ETags of SSE single part objects.
However, only the ETag of SSE-S3 encrypted single part
objects needs to be decrypted.
The ETag of an SSE-C or SSE-KMS single part object
does not correspond to its content MD5 and can be
a random value.
The `ETag.Format` function formats an ETag such that
it is an AWS S3 compliant ETag. In particular, it
returns non-encrypted ETags (single / multipart)
unmodified. However, for encrypted ETags it returns
the trailing 16 bytes as ETag. For encrypted ETags
the last 16 bytes will be a random value.
The main purpose of `Format` is to format ETags
such that clients accept them as well-formed AWS S3
ETags.
It differs from the `String` method since `String`
will return string representations for encrypted
ETags that are not AWS S3 compliant.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds support for encrypted KES
client private keys.
Now, it is possible to encrypt the KES client
private key (`MINIO_KMS_KES_KEY_FILE`) with
a password.
For example, KES CLI already supports the
creation of encrypted private keys:
```
kes identity new --encrypt --key client.key --cert client.crt MinIO
```
To decrypt an encrypted private key, the password
needs to be provided:
```
MINIO_KMS_KES_KEY_PASSWORD=<password>
```
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit fixes a subtle bug in the ETag
`IsEncrypted` implementation.
An encrypted ETag may contain random bytes,
i.e. some randomness used for encryption.
This random value can contain a '-' byte
simple due to being randomly generated.
Before, the `IsEncrypted` implementation
incorrectly assumed that an encrypted ETag
cannot contain a '-' since it would be a
multipart ETag. Multipart ETags have a
16 byte value followed by a '-' and the part number.
For example:
```
059ba80b807c3c776fb3bcf3f33e11ae-2
```
However, the following encrypted ETag
```
20000f00db2d90a7b40782d4cff2b41a7799fc1e7ead25972db65150118dfbe2ba76a3c002da28f85c840cd2001a28a9
```
also contains a '-' byte but is not a multipart ETag.
This commit fixes the `IsEncrypted` implementation
simply by checking whether the ETag is at least 32
bytes long. A valid multipart ETag is never 32 bytes
long since a part number must be <= 10000.
However, an encrypted ETag must be at least 32 bytes
long. It contains the encrypted ETag bytes (16 bytes)
and the authentication tag added by the AEAD cipher (again
16 bytes).
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds support for bulk ETag
decryption for SSE-S3 encrypted objects.
If KES supports a bulk decryption API, then
MinIO will check whether its policy grants
access to this API. If so, MinIO will use
a bulk API call instead of sending encrypted
ETags serially to KES.
Note that MinIO will not use the KES bulk API
if its client certificate is an admin identity.
MinIO will process object listings in batches.
A batch has a configurable size that can be set
via `MINIO_KMS_KES_BULK_API_BATCH_SIZE=N`.
It defaults to `500`.
This env. variable is experimental and may be
renamed / removed in the future.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
In riscv64, the `syscall.Uname` function will return a uint8 slice.
func main() {
var buf syscall.Utsname
fmt.Printf("Buffer Type: %T\n", buf.Release)
}
output:
Buffer Type: [65]uint8
This is tested in the Arch Linux RISC-V 64 QEMU environment.
Signed-off-by: Avimitin <avimitin@gmail.com>
Fix `panic: "POST /minio/peer/v21/signalservice?signal=2": sync: WaitGroup is reused before previous Wait has returned`
Log entries already on the channel would cause `logEntry` to increment the
waitgroup when sending messages, after Cancel has been called.
Instead of tracking every single message, just check the send goroutine. Faster
and safe, since it will not decrement until the channel is closed.
Regression from #14289
avoids creating new transport for each `isServerResolvable`
request, instead re-use the available global transport and do
not try to forcibly close connections to avoid TIME_WAIT
build upon large clusters.
Never use httpClient.CloseIdleConnections() since that can have
a drastic effect on existing connections on the transport pool.
Remove it everywhere.
PR introduced in #13819 was incorrect and was not
handling the situation where a buffer is full can
cause incessant amount of logs that would keep the
logger webhook overrun by the requests.
To avoid this only log failures to console logger
instead of all targets as it can cause self reference,
leading to an infinite loop.
This PR simply adds a warning message when it detects older kernel
versions and warn's them about potential performance issues on this
kernel.
The issue can be seen only with parallel I/O across all drives
on denser setups such as 90 drives or 45 drives per server configurations.
The main goal of this PR is to solve the situation where disks stop
responding to operations. This generally causes an FD build-up and
eventually will crash the server.
This adds detection of hung disks, where calls on disk get stuck.
We add functionality to `xlStorageDiskIDCheck` where it keeps
track of the number of concurrent requests on a given disk.
A total number of 100 operations are allowed. If this limit is reached
we will block (but not reject) new requests, but we will monitor the
state of the disk.
If no requests have been completed or updated within a 15-second
window, we mark the disk as offline. Requests that are blocked will be
unblocked and return an error as "faulty disk".
New requests will be rejected until the disk is marked OK again.
Once a disk has been marked faulty, a check will run every 5 seconds that
will attempt to write and read back a file. As long as this fails the disk will
remain faulty.
To prevent lots of long-running requests to mark the disk faulty we
implement a callback feature that allows updating the status as parts
of these operations are running.
We add a reader and writer wrapper that will update the status of each
successful read/write operation. This should allow fine enough granularity
that a slow, but still operational disk will not reach 15 seconds where
50 operations have not progressed.
Note that errors themselves are not enough to mark a disk faulty.
A nil (or io.EOF) error will mark a disk as "good".
* Make concurrent disk setting configurable via `_MINIO_DISK_MAX_CONCURRENT`.
* de-couple IsOnline() from disk health tracker
The purpose of IsOnline() is to ensure that we
reconnect the drive only when the "drive" was
- disconnected from network we need to validate
if the drive is "correct" and is the same drive
which belongs to this server.
- drive was replaced we have to format it - we
support hot swapping of the drives.
IsOnline() is not meant for taking the drive offline
when it is hung, it is not useful we can let the
drive be online instead "return" errors for relevant
calls.
* return errFaultyDisk for DiskInfo() call
Co-authored-by: Harshavardhana <harsha@minio.io>
Possible future Improvements:
* Unify the REST server and local xlStorageDiskIDCheck. This would also improve stats significantly.
* Allow reads/writes to be aborted by the context.
* Add usage stats, concurrent count, blocked operations, etc.
This commit removes some duplicate code that
converts KES API errors.
This code was added since KES `0.18.0` changed
some exported API errors. However, the KES SDK
handles this error conversion itself.
Therefore, it is not necessary to duplicate this
behavior in MinIO.
See: 21555fa624/error.go (L94)
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
- Updating KES dependency to v.0.18.0
- Fixing incompatibility issue when checking for errors during KES key creation
Signed-off-by: Lenin Alevski <alevsk.8772@gmail.com>
fixes a regression introduced in #14269 that refactored
the notification registration logic, all the amqp targets
however online will not be available for use anymore.
fixes#14451
Up until now `InitializeProvider` method of `Config` struct was
implemented on a value receiver which is why changes on `provider`
field where never reflected to method callers. In order to fix this
issue, the method was implemented on a pointer receiver.
This is a regression from #14037, distributed setups
with MQTT was not working anymore. According to MQTT
spec it is expected this is unique per server.
We shall proceed to use unix nano timestamp hex
value instead here.
Currently, when applying any dynamic config, the system reloads and
re-applies the config of all the dynamic sub-systems.
This PR refactors the code in such a way that changing config of a given
dynamic sub-system will work on only that sub-system.
The `LookupConfig` code was not using `GetWithDefault`, because of which
some of the config values were being returned as empty string, and calls
like `strconv.Atoi` and `time.ParseDuration` on these were failing.
Enabled with `mc admin config set alias/ api gzip_objects=on`
Standard filtering applies (1K response minimum, not compressed content
type, not range request, gzip accepted by client).
When setting a config of a particular sub-system, validate the existing
config and notification targets of only that sub-system, so that
existing errors related to one sub-system (e.g. notification target
offline) do not result in errors for other sub-systems.
This change allows the MinIO server to lookup users in different directory
sub-trees by allowing specification of multiple search bases separated by
semicolons.
This speed-up is intended for faster startup times
for almost all MinIO operations. Changes here are
- Drives are not re-read for 'format.json' on a regular
basis once read during init is remembered and refreshed
at 5 second intervals.
- Do not do O_DIRECT tests on drives with existing 'format.json'
only fresh setups need this check.
- Parallelize initializing erasureSets for multiple sets.
- Avoid re-reading format.json when migrating 'format.json'
from really old V1->V2->V3
- Keep a copy of local drives for any given server in memory
for a quick lookup.
repeated reads on single large objects in HPC like
workloads, need the following option to disable
O_DIRECT for a more effective usage of the kernel
page-cache.
However this optional should be used in very specific
situations only, and shouldn't be enabled on all
servers.
NVMe servers benefit always from keeping O_DIRECT on.
The code was not properly deciding if a lock needs to be removed
when it doesn't have quorum anymore. After this commit, a lock will be
forcefully unlocked if nodes reporting they are not able to find a lock
internally breaks the quorum.
Simplify the code as well.
```
λ mc admin decommission start alias/ http://minio{1...2}/data{1...4}
```
```
λ mc admin decommission status alias/
┌─────┬─────────────────────────────────┬──────────────────────────────────┬────────┐
│ ID │ Pools │ Capacity │ Status │
│ 1st │ http://minio{1...2}/data{1...4} │ 439 GiB (used) / 561 GiB (total) │ Active │
│ 2nd │ http://minio{3...4}/data{1...4} │ 329 GiB (used) / 421 GiB (total) │ Active │
└─────┴─────────────────────────────────┴──────────────────────────────────┴────────┘
```
```
λ mc admin decommission status alias/ http://minio{1...2}/data{1...4}
Progress: ===================> [1GiB/sec] [15%] [4TiB/50TiB]
Time Remaining: 4 hours (started 3 hours ago)
```
```
λ mc admin decommission status alias/ http://minio{1...2}/data{1...4}
ERROR: This pool is not scheduled for decommissioning currently.
```
```
λ mc admin decommission cancel alias/
┌─────┬─────────────────────────────────┬──────────────────────────────────┬──────────┐
│ ID │ Pools │ Capacity │ Status │
│ 1st │ http://minio{1...2}/data{1...4} │ 439 GiB (used) / 561 GiB (total) │ Draining │
└─────┴─────────────────────────────────┴──────────────────────────────────┴──────────┘
```
> NOTE: Canceled decommission will not make the pool active again, since we might have
> Potentially partial duplicate content on the other pools, to avoid this scenario be
> very sure to start decommissioning as a planned activity.
```
λ mc admin decommission cancel alias/ http://minio{1...2}/data{1...4}
┌─────┬─────────────────────────────────┬──────────────────────────────────┬────────────────────┐
│ ID │ Pools │ Capacity │ Status │
│ 1st │ http://minio{1...2}/data{1...4} │ 439 GiB (used) / 561 GiB (total) │ Draining(Canceled) │
└─────┴─────────────────────────────────┴──────────────────────────────────┴────────────────────┘
```
- This allows site-replication to be configured when using OpenID or the
internal IDentity Provider.
- Internal IDP IAM users and groups will now be replicated to all members of the
set of replicated sites.
- When using OpenID as the external identity provider, STS and service accounts
are replicated.
- Currently this change dis-allows root service accounts from being
replicated (TODO: discuss security implications).
time.Format() is not necessary prematurely for JSON
marshalling, since JSON marshalling indeed defaults
to RFC3339Nano.
This also ensures the 'time' is remembered until its
logged and it is the same time when the 'caller'
invoked 'log' functions.
Also log all the missed events and logs instead of silently
swallowing the events.
Bonus: Extend the logger webhook to support mTLS
similar to audit webhook target.
Return as soon as an AND fails and whenever an OR succeeds. Faster and more flexible.
For example makes `select * from S3object where _2 != '' AND _2 > 1` able to operate on empty fields.
Followup to #13900
- Rename MaxNoncurrentVersions tag to NewerNoncurrentVersions
Note: We apply overlapping NewerNoncurrentVersions rules such that
we honor the highest among applicable limits. e.g if 2 overlapping rules
are configured with 2 and 3 noncurrent versions to be retained, we
will retain 3.
- Expire newer noncurrent versions after noncurrent days
- MinIO extension: allow noncurrent days to be zero, allowing expiry
of noncurrent version as soon as more than configured
NewerNoncurrentVersions are present.
- Allow NewerNoncurrentVersions rules on object-locked buckets
- No x-amz-expiration when NewerNoncurrentVersions configured
- ComputeAction should skip rules with NewerNoncurrentVersions > 0
- Add unit tests for lifecycle.ComputeAction
- Support lifecycle rules with MaxNoncurrentVersions
- Extend ExpectedExpiryTime to work with zero days
- Fix all-time comparisons to be relative to UTC
The earlier approach of using a license token for
communicating with SUBNET is being replaced
with a simpler mechanism of API keys. Unlike the
license which is a JWT token, these API keys will
be simple UUID tokens and don't have any embedded
information in them. SUBNET would generate the
API key on cluster registration, and then it would
be saved in this config, to be used for subsequent
communication with SUBNET.
- Allows setting a role policy parameter when configuring OIDC provider
- When role policy is set, the server prints a role ARN usable in STS API requests
- The given role policy is applied to STS API requests when the roleARN parameter is provided.
- Service accounts for role policy are also possible and work as expected.
- New sub-system has "region" and "name" fields.
- `region` subsystem is marked as deprecated, however still works, unless the
new region parameter under `site` is set - in this case, the region subsystem is
ignored. `region` subsystem is hidden from top-level help (i.e. from `mc admin
config set myminio`), but appears when specifically requested (i.e. with `mc
admin config set myminio region`).
- MINIO_REGION, MINIO_REGION_NAME are supported as legacy environment variables for server region.
- Adds MINIO_SITE_REGION as the current environment variable to configure the
server region and MINIO_SITE_NAME for the site name.
This unit allows users to limit the maximum number of noncurrent
versions of an object.
To enable this rule you need the following *ilm.json*
```
cat >> ilm.json <<EOF
{
"Rules": [
{
"ID": "test-max-noncurrent",
"Status": "Enabled",
"Filter": {
"Prefix": "user-uploads/"
},
"NoncurrentVersionExpiration": {
"MaxNoncurrentVersions": 5
}
}
]
}
EOF
mc ilm import myminio/mybucket < ilm.json
```
- Go might reset the internal http.ResponseWriter() to `nil`
after Write() failure if the go-routine has returned, do not
flush() such scenarios and avoid spurious flushes() as
returning handlers always flush.
- fix some racy tests with the console
- avoid ticker leaks in certain situations
This feature is useful in situations when console is exposed
over multiple intranent or internet entities when users are
connecting over local IP v/s going through load balancer.
Related console work was merged here
373bfbfe3f
- remove some duplicated code
- reported a bug, separately fixed in #13664
- using strings.ReplaceAll() when needed
- using filepath.ToSlash() use when needed
- remove all non-Go style comments from the codebase
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
If a given MinIO config is dynamic (can be changed without restart),
ensure that it can be reset also without restart.
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
Borrowed idea from Go's usage of this
optimization for ReadFrom() on client
side, we should re-use the 32k buffers
io.Copy() allocates for generic copy
from a reader to writer.
the performance increase for reads for
really tiny objects is at this range
after this change.
> * Fastest: +7.89% (+1.3 MiB/s) throughput, +7.89% (+1308.1) obj/s
Preemptively disable AVX512 until https://github.com/golang/go/issues/49233 has been resolved.
This potentially affects reedsolomon, simdjson, sha256-simd, md5-simd packages.
Init order requires a separate package since main itself is initialized last, but imports are initialized in the order they are imported from main (confirmed).
read/writers are not concurrent in handlers
and self contained - no need to use atomics on
them.
avoids unnecessary contentions where it's not
required.
Logger targets were not race protected against concurrent updates from for example `HTTPConsoleLoggerSys`.
Restrict direct access to targets and make slices immutable so a returned slice can be processed safely without locks.
various situations where the client is retrying the request
server going through shutdown might incorrectly send 403
which is a non-retriable error, this PR allows for clients
when they retry an attempt to go to another healthy pod
or server in a distributed cluster - assuming it is a properly
load-balanced setup.
also simplify readerLocks to be just like
writeLocks, DRWMutex() is never shared
and there are order guarantees that need
for such a thing to work for RLock's
3DES is enabled by default in Golang, this commit will use
tls.CipherSuites() which returns all ciphers excluding those with
security issues, such as 3DES.
Testing with `mc sql --compression BZIP2 --csv-input "rd=\n,fh=USE,fd=;" --query="select COUNT(*) from S3Object" local2/testbucket/nyc-taxi-data-10M.csv.bz2`
Before 96.98s, after 10.79s. Uses about 70% CPU while running.
with some broken clients allow non-strict validation
of sha256 when ContentLength > 0, it has been found in
the wild some applications that need this behavior. This
shall be only allowed if `--no-compat` is used.
LDAP TLS dialer shouldn't be strict with ServerName, there
maybe many certs talking to common DNS endpoint it is
better to allow Dialer to choose appropriate public cert.
additionally optimize for IP only setups, avoid doing
unnecessary lookups if the Dial addr is an IP.
allow support for multiple listeners on same socket,
this is mainly meant for future purposes.
it would seem like using `bufio.Scan()` is very
slow for heavy concurrent I/O, ie. when r.Body
is slow , instead use a proper
binary exchange format, to marshal and unmarshal
the LockArgs datastructure in a cleaner way.
this PR increases performance of the locking
sub-system for tiny repeated read lock requests
on same object.
```
BenchmarkLockArgs
BenchmarkLockArgs-4 6417609 185.7 ns/op 56 B/op 2 allocs/op
BenchmarkLockArgsOld
BenchmarkLockArgsOld-4 1187368 1015 ns/op 4096 B/op 1 allocs/op
```
This PR brings two optimizations mainly
for page-cache build-up and how to avoid
getting OOM killed in the process. Although
these memories are reclaimable Linux is not
fast enough to reclaim them as needed on a
very busy system. fadvise is a system call
implemented in Linux to advise page-cache to
avoid overload as we get significant amount
of requests on the server.
- FADV_SEQUENTIAL tells that all I/O from now
is going to be sequential, allowing for more
resposive throughput.
- FADV_NOREUSE tells kernel to start removing
things for this 'fd' from page-cache.
This was a regression introduced in '14bb969782'
this has the potential to cause corruption when
there are concurrent overwrites attempting to update
the content on the namespace.
This PR adds a situation where PutObject(), CopyObject()
compete properly for the same locks with NewMultipartUpload()
however it ends up turning off competing locks for the actual
object with GetObject() and DeleteObject() - since they do not
compete due to concurrent I/O on a versioned bucket it can lead
to loss of versions.
This PR fixes this bug with multi-pool setup with replication
that causes corruption of inlined data due to lack of competing
locks in a multi-pool setup.
Instead CompleteMultipartUpload holds the necessary
locks when finishing the transaction, knowing the exact
location of an object to schedule the multipart upload
doesn't need to compete in this manner, a pool id location
for existing object.
- Supports object locked buckets that require
PutObject() to set content-md5 always.
- Use SSE-S3 when S3 gateway is being used instead
of SSE-KMS for auto-encryption.
json.Unmarshal expects a pointer receiver, otherwise
kms.Context unmarshal fails with lack of pointer receiver,
this becomes complicated due to type aliasing over
map[string]string - fix it properly.
Some identity providers like GitLab do not provide
information about group membership as part of the
identity token claims. They only expose it via OIDC compatible
'/oauth/userinfo' endpoint, as described in the OpenID
Connect 1.0 sepcification.
But this of course requires application to make sure to add
additional accessToken, since idToken cannot be re-used to
perform the same 'userinfo' call. This is why this is specialized
requirement. Gitlab seems to be the only OpenID vendor that requires
this support for the time being.
fixes#12367
This commit adds a new STS API for X.509 certificate
authentication.
A client can make an HTTP POST request over a TLS connection
and MinIO will verify the provided client certificate, map it to an
S3 policy and return temp. S3 credentials to the client.
So, this STS API allows clients to authenticate with X.509
certificates over TLS and obtain temp. S3 credentials.
For more details and examples refer to the docs/sts/tls.md
documentation.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds the TLS 1.3 ciphers to the list of
supported ciphers. Now, clients can connect to MinIO
using TLS 1.3
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Use a single allocation for reading the file, not the growing buffer of `io.ReadAll`.
Reuse the write buffer if we can when writing metadata in RenameData.
A multi resources lock is a single lock UID with multiple associated
resources. This is created for example by multi objects delete
operation. This commit changes the behavior of Refresh() to iterate over
all locks having the same UID and refresh them.
Bonus: Fix showing top locks for multi delete objects
In the event when a lock is not refreshed in the cluster, this latter
will be automatically removed in the subsequent cleanup of non
refreshed locks routine, but it forgot to clean the local server,
hence having the same weird stale locks present.
This commit will remove the lock locally also in remote nodes, if
removing a lock from a remote node will fail, it will be anyway
removed later in the locks cleanup routine.
- remove sourceCh usage from healing
we already have tasks and resp channel
- use read locks to lookup globalHealConfig
- fix healing resolver to pick candidates quickly
that need healing, without this resolver was
unexpectedly skipping.
healObject() should be non-blocking to ensure
that scanner is not blocked for a long time,
this adversely affects performance of the scanner
and also affects the way usage is updated
subsequently.
This PR allows for a non-blocking behavior for
healing, dropping operations that cannot be queued
anymore.
When reading `TrafficMeter` values, there was a value receiver.
This means that receivers are copied unsafely when invoked.
Fixes race seen with `-race` build.