Keys are helpful to ensure the strict ordering of messages, however currently the
code uses a random request id for every log, hence using the request-id
as a Kafka key is not serve any purpose;
This commit removes the usage of the key, to also fix the audit issue from
internal subsystem that does not have a request ID.
This commit updates the minio/kes-go dependency
to v0.2.0 and updates the existing code to work
with the new KES APIs.
The `SetPolicy` handler got removed since it
may not get implemented by KES at all and could
not have been used in the past since stateless KES
is read-only w.r.t. policies and identities.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Two fields in lifecycles made GOB encoding consistently fail with `gob: type lifecycle.Prefix has no exported fields`.
This meant that in distributed systems listings would never be able to continue and would restart on every call.
Fix issues and be sure to log these errors at least once per bucket. We may see some connectivity errors here, but we shouldn't hide them.
Bonus:
- avoid calling DiskInfo() calls when missing blocks
instead heal the object using MRF operation.
- change the max_sleep to 250ms beyond that we will
not stop healing.
Send() is synchronous and can affect the latency of S3 requests when the
logger buffer is full.
Avoid checking if the HTTP target is online or not and increase the
workers anyway since the buffer is already full.
Also, avoid logs flooding when the audit target is down.
slower drives get knocked off because they are too slow via
active monitoring, we do not need to block calls arbitrarily.
Serializing adds latencies for already slow calls, remove
it for SSDs/NVMEs
Also, add a selection with context when writing to `out <-`
channel, to avoid any potential blocks.
Now it would list details of all KMS instances with additional
attributes `endpoint` and `version`. In the case of k8s-based
deployment the list would consist of a single entry.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Also shutdown poll add jitter, to verify if the shutdown
sequence can finish before 500ms, this reduces the overall
time taken during "restart" of the service.
Provides speedup for `mc admin service restart` during
active I/O, also ensures that systemd doesn't treat the
returned 'error' as a failure, certain configurations in
systemd can cause it to 'auto-restart' the process by-itself
which can interfere with `mc admin service restart`.
It can be observed how now restarting the service is
much snappier.
DNS refresh() in-case of MinIO can safely re-use
the previous values on bare-metal setups, since
bare-metal arrangements do not change DNS in any
manner commonly.
This PR simplifies that, we only ever need DNS caching
on bare-metal setups.
- On containerized setups do not enable DNS
caching at all, as it may have adverse effects on
the overall effectiveness of k8s DNS systems.
k8s DNS systems are dynamic and expect applications
to avoid managing DNS caching themselves, instead
provide a cleaner container native caching
implementations that must be used.
- update IsDocker() detection, including podman runtime
- move to minio/dnscache fork for a simpler package
Following extension allows users to specify immediate purge of
all versions as soon as the latest version of this object has
expired.
```
<LifecycleConfiguration>
<Rule>
<ID>ClassADocRule</ID>
<Filter>
<Prefix>classA/</Prefix>
</Filter>
<Status>Enabled</Status>
<Expiration>
<Days>3650</Days>
<ExpiredObjectAllVersions>true</ExpiredObjectAllVersions>
</Expiration>
</Rule>
...
```
This change adds a `Secret` property to `HelpKV` to identify secrets
like passwords and auth tokens that should not be revealed by the server
in its configuration fetching APIs. Configuration reporting APIs now do
not return secrets.
With the current asynchronous behaviour in sending notification events
to the targets, we can't provide guaranteed delivery as the systems
might go for restarts.
For such event-driven use-cases, we can provide an option to enable
synchronous events where the APIs wait until the event is successfully
sent or persisted.
This commit adds 'MINIO_API_SYNC_EVENTS' env which when set to 'on'
will enable sending/persisting events to targets synchronously.
- lifecycle must return InvalidArgument for rule errors
- do not return `null` versionId in HTTP header
- reject mixed SSE uploads with correct error message
PUT calls cannot afford to have large latency build-ups due
to contentious usage.json, or worse letting them fail with
some unexpected error, this can happen when this file is
concurrently being updated via scanner or it is being
healed during a disk replacement heal.
However, these are fairly quick in theory, stressed clusters
can quickly show visible latency this can add up leading to
invalid errors returned during PUT.
It is perhaps okay for us to relax this error return requirement
instead, make sure that we log that we are proceeding to take in
the requests while the quota is using an older value for the quota
enforcement. These things will reconcile themselves eventually,
via scanner making sure to overwrite the usage.json.
Bonus: make sure that storage-rest-client sets ExpectTimeouts to
be 'true', such that DiskInfo() call with contextTimeout does
not prematurely disconnect the servers leading to a longer
healthCheck, back-off routine. This can easily pile up while also
causing active callers to disconnect, leading to quorum loss.
DiskInfo is actively used in the PUT, Multipart call path for
upgrading parity when disks are down, it in-turn shouldn't cause
more disks to go down.
if the certs are the same in an environment where the
cert files are symlinks (e.g Kubernetes), then we resort
to reloading certs every 15mins - we can avoid reload
of the kes client instance. Ensure that the price to pay
for contending with the lock must happen when necessary.
Ensure delete marker replication success, especially since the
recent optimizations to heal on HEAD, LIST and GET can force
replication attempts on delete marker before underlying object
version could have synced.
minio_inter_node_traffic_errors_total currently does not track
requests body write/read errors of internode REST communications.
This commit fixes this by wrapping resp.Body.
to avoid relying on scanner-calculated replication metrics.
This will improve the accuracy of the replication stats reported.
This PR also adds on to #15556 by handing replication
traffic that could not be queued by available workers to the
MRF queue so that entries in `PENDING` status are healed faster.
A lot of warning messages are printed in CI/CD failures generated by go
test. Avoid that by requiring at least Error level for logging when
doing go test.
inlined data often is bigger than the allowed
O_DIRECT alignment, so potentially we can write
'xl.meta' without O_DSYNC instead we can rely on
O_DIRECT + fdatasync() instead.
This PR allows O_DIRECT on inlined data that
would gain the benefits of performing O_DIRECT,
eventually performing an fdatasync() at the end.
Performance boost can be observed here for small
objects < 128KiB. The performance boost is mainly
seen on HDD, and marginal on NVMe setups.
competing calls on the same object on versioned bucket
mutating calls on the same object may unexpected have
higher delays.
This can be reproduced with a replicated bucket
overwriting the same object writes, deletes repeatedly.
For longer locks like scanner keep the 1sec interval
This commit adds support for automatically reloading
the MinIO client certificate for authentication to KES.
The client certificate will now be reloaded:
- when the private key / certificate file changes
- when a SIGHUP signal is received
- every 15 minutes
Fixes#14869
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
mc admin config reset <alias> notify_webhook:something was not working
properly.
The reason is that GetSubSys() was not calculating the target
name properly because it is quitting early when the number of config
inputs ('notify_webhook:something' in this case) is equal to 1.
This commit will make the code calculates always calculate the target
name if found.
fixes#15334
- re-use net/url parsed value for http.Request{}
- remove gosimple, structcheck and unusued due to https://github.com/golangci/golangci-lint/issues/2649
- unwrapErrs upto leafErr to ensure that we store exactly the correct errors
This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.
Prior to this commit, temporary directory created using `ioutil.TempDir`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
defer func() {
if err := os.RemoveAll(dir); err != nil {
t.Fatal(err)
}
}
is also tedious, but `t.TempDir` handles this for us nicely.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
This commit adds a `context.Context` to the
the KMS `{Stat, CreateKey, GenerateKey}` API
calls.
The context will be used to terminate external calls
as soon as the client requests gets canceled.
A follow-up PR will add a `context.Context` to
the remaining `DecryptKey` API call.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Since this is a MinIO specific extension in the replication config,
default this to Disabled to allow other sdks to be used to configure
replication rules.
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Add up to 256 bytes of padding for compressed+encrypted files.
This will obscure the obvious cases of extremely compressible content
and leave a similar output size for a very wide variety of inputs.
This does *not* mean the compression ratio doesn't leak information
about the content, but the outcome space is much smaller,
so often *less* information is leaked.
Rename Trigger -> Event to be a more appropriate
name for the audit event.
Bonus: fixes a bug in AddMRFWorker() it did not
cancel the waitgroup, leading to waitgroup leaks.
This commit adds a minimal set of KMS-related metrics:
```
# HELP minio_cluster_kms_online Reports whether the KMS is online (1) or offline (0)
# TYPE minio_cluster_kms_online gauge
minio_cluster_kms_online{server="127.0.0.1:9000"} 1
# HELP minio_cluster_kms_request_error Number of KMS requests that failed with a well-defined error
# TYPE minio_cluster_kms_request_error counter
minio_cluster_kms_request_error{server="127.0.0.1:9000"} 16790
# HELP minio_cluster_kms_request_success Number of KMS requests that succeeded
# TYPE minio_cluster_kms_request_success counter
minio_cluster_kms_request_success{server="127.0.0.1:9000"} 348031
```
Currently, we report whether the KMS is available and how many requests
succeeded/failed. However, KES exposes much more metrics that can be
exposed if necessary. See: https://pkg.go.dev/github.com/minio/kes#Metric
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
it is not safe to pass around sync.Map
through pointers, as it may be concurrently
updated by different callers.
this PR simplifies by avoiding sync.Map
altogether, we do not need sync.Map
to keep object->erasureMap association.
This PR fixes a crash when concurrently
using this value when audit logs are
configured.
```
fatal error: concurrent map iteration and map write
goroutine 247651580 [running]:
runtime.throw({0x277a6c1?, 0xc002381400?})
runtime/panic.go:992 +0x71 fp=0xc004d29b20 sp=0xc004d29af0 pc=0x438671
runtime.mapiternext(0xc0d6e87f18?)
runtime/map.go:871 +0x4eb fp=0xc004d29b90 sp=0xc004d29b20 pc=0x41002b
```
This commit fixes the order of elliptic curves.
As documented by https://pkg.go.dev/crypto/tls#Config
```
// CurvePreferences contains the elliptic curves that will be used in
// an ECDHE handshake, in preference order. If empty, the default will
// be used. The client will use the first preference as the type for
// its key share in TLS 1.3. This may change in the future.
```
In general, we should prefer `X25519` over the NIST curves.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
`config.ResolveConfigParam` returns the value of a configuration for any
subsystem based on checking env, config store, and default value. Also returns info
about which config source returned the value.
This is useful to return info about config params overridden via env in the user
APIs. Currently implemented only for OpenID subsystem, but will be extended for
others subsequently.
this allows for customers to use `mc admin service restart`
directly even when performing RPM, DEB upgrades. Upon such 'restart'
after upgrade MinIO will re-read the /etc/default/minio for any
newer environment variables.
As long as `MINIO_CONFIG_ENV_FILE=/etc/default/minio` is set, this
is honored.
* Add periodic callhome functionality
Periodically (every 24hrs by default), fetch callhome information and
upload it to SUBNET.
New config keys under the `callhome` subsystem:
enable - Set to `on` for enabling callhome. Default `off`
frequency - Interval between callhome cycles. Default `24h`
* Improvements based on review comments
- Update `enableCallhome` safely
- Rename pctx to ctx
- Block during execution of callhome
- Store parsed proxy URL in global subnet config
- Store callhome URL(s) in constants
- Use existing global transport
- Pass auth token to subnetPostReq
- Use `config.EnableOn` instead of `"on"`
* Use atomic package instead of lock
* Use uber atomic package
* Use `Cancel` instead of `cancel`
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
We need to make sure if we cannot read bucket metadata
for some reason, and bucket metadata is not missing and
returning corrupted information we should panic such
handlers to disallow I/O to protect the overall state
on the system.
In-case of such corruption we have a mechanism now
to force recreate the metadata on the bucket, using
`x-minio-force-create` header with `PUT /bucket` API
call.
Additionally fix the versioning config updated state
to be set properly for the site replication healing
to trigger correctly.
Main motivation is move towards a common backend format
for all different types of modes in MinIO, allowing for
a simpler code and predictable behavior across all features.
This PR also brings features such as versioning, replication,
transitioning to single drive setups.
Following code can reproduce an unending go-routine buildup,
while keeping connections established due to lack of client
not closing the connections.
https://gist.github.com/harshavardhana/2d00e6f909054d2d2524c71485ad02e1
Without this PR all MinIO deployments can be put into
denial of service attacks, causing entire service to be
unavailable.
We bring in two timeouts at this stage to control such
go-routine build ups, new change
- IdleTimeout (to kill off idle connections)
- ReadHeaderTimeout (to kill off connections that are too slow)
This new change also brings two hidden options to make any
additional relevant changes if desired in some setups.
It would seem like the PR #11623 had chewed more
than it wanted to, non-fips build shouldn't really
be forced to use slower crypto/sha256 even for
presumed "non-performance" codepaths. In MinIO
there are really no "non-performance" codepaths.
This assumption seems to have had an adverse
effect in certain areas of CPU usage.
This PR ensures that we stick to sha256-simd
on all non-FIPS builds, our most common build
to ensure we get the best out of the CPU at
any given point in time.
- Adds an STS API `AssumeRoleWithCustomToken` that can be used to
authenticate via the Id. Mgmt. Plugin.
- Adds a sample identity manager plugin implementation
- Add doc for plugin and STS API
- Add an example program using go SDK for AssumeRoleWithCustomToken
.Reset() documentation states:
For a Timer created with NewTimer, Reset should be invoked only on stopped
or expired timers with drained channels.
This change is just to comply with this requirement as there might be some
runtime dependent situation that might lead to unexpected behavior.
it seems in some places we have been wrongly using the
timer.Reset() function, nicely exposed by an example
shared by @donatello https://go.dev/play/p/qoF71_D1oXD
this PR fixes all the usage comprehensively
When configuring a new target, such as an audit target, the server waits
until all audit events are sent to the audit target before doing the
swap from the old to the new audit target. Therefore current S3 operations
can suffer from this since the audit swap lock will be held.
This behavior is unnecessary as the new audit target can enter in a
functional mode immediately and the old audit will just cancel itself
at its own pace.
Object tags can have special characters such as whitespace. However
the current code doesn't properly consider those characters while
evaluating the lifecycle document.
ObjectInfo.UserTags contains an url encoded form of object tags
(e.g. key+1=val)
This commit fixes the issue by using the tags package to parse object tags.
This PR simplifies few things by splitting
the locks between audit, logger targets to
avoid potential contention between them.
any failures inside audit/logger HTTP
targets must only log to console instead
of other targets to avoid cyclical dependency.
avoids unneeded atomic variables instead
uses RWLock to differentiate a more common
read phase v/s lock phase.
- This change renames the OPA integration as Access Management Plugin - there is
nothing specific to OPA in the integration, it is just a webhook.
- OPA configuration is automatically migrated to Access Management Plugin and
OPA specific configuration is marked as deprecated.
- OPA doc is updated and moved.
- do not need to restrict prefix exclusions that do not
have `/` as suffix, relax this requirement as spark may
have staging folders with other autogenerated characters
, so we are better off doing full prefix March and skip.
- multiple delete objects was incorrectly creating a
null delete marker on a versioned bucket instead of
creating a proper versioned delete marker.
- do not suspend paths on the excluded prefixes during
delete operations to avoid creating `null` delete markers,
honor suspension of versioning only at bucket level for
delete markers.
Spark/Hadoop workloads which use Hadoop MR
Committer v1/v2 algorithm upload objects to a
temporary prefix in a bucket. These objects are
'renamed' to a different prefix on Job commit.
Object storage admins are forced to configure
separate ILM policies to expire these objects
and their versions to reclaim space.
Our solution:
This can be avoided by simply marking objects
under these prefixes to be excluded from versioning,
as shown below. Consequently, these objects are
excluded from replication, and don't require ILM
policies to prune unnecessary versions.
- MinIO Extension to Bucket Version Configuration
```xml
<VersioningConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Status>Enabled</Status>
<ExcludeFolders>true</ExcludeFolders>
<ExcludedPrefixes>
<Prefix>app1-jobs/*/_temporary/</Prefix>
</ExcludedPrefixes>
<ExcludedPrefixes>
<Prefix>app2-jobs/*/__magic/</Prefix>
</ExcludedPrefixes>
<!-- .. up to 10 prefixes in all -->
</VersioningConfiguration>
```
Note: `ExcludeFolders` excludes all folders in a bucket
from versioning. This is required to prevent the parent
folders from accumulating delete markers, especially
those which are shared across spark workloads
spanning projects/teams.
- To enable version exclusion on a list of prefixes
```
mc version enable --excluded-prefixes "app1-jobs/*/_temporary/,app2-jobs/*/_magic," --exclude-prefix-marker myminio/test
```
- When using multiple providers, claim-based providers are not allowed. All
providers must use role policies.
- Update markdown config to allow `details` HTML element
introduce x-minio-force-create environment variable
to force create a bucket and its metadata as required,
it is useful in some situations when bucket metadata
needs recovery.
- This change switches to a new parquet library
- SelectObjectContent now takes a single lock at the beginning and holds it
during the operation. Previously the operation took a lock every time the
parquet library performed a Seek on the underlying object stream.
- Add basic support for LogicalType annotations for timestamps.
This commit improves the listing of encrypted objects:
- Use `etag.Format` and `etag.Decrypt`
- Detect SSE-S3 single-part objects in a single iteration
- Fix batch size to `250`
- Pass request context to `DecryptAll` to not waste resources
when a client cancels the operation.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds two new functions to the
internal `etag` package:
- `ETag.Format`
- `Decrypt`
The `Decrypt` function decrypts an encrypted
ETag using a decryption key. It returns not
encrypted / multipart ETags unmodified.
The `Decrypt` function is mainly used when
handling SSE-S3 encrypted single-part objects.
In particular, the ETag of an SSE-S3 encrypted
single-part object needs to be decrypted since
S3 clients expect that this ETag is equal to the
content MD5.
The `ETag.Format` method also covers SSE ETag handling.
MinIO encrypts all ETags of SSE single part objects.
However, only the ETag of SSE-S3 encrypted single part
objects needs to be decrypted.
The ETag of an SSE-C or SSE-KMS single part object
does not correspond to its content MD5 and can be
a random value.
The `ETag.Format` function formats an ETag such that
it is an AWS S3 compliant ETag. In particular, it
returns non-encrypted ETags (single / multipart)
unmodified. However, for encrypted ETags it returns
the trailing 16 bytes as ETag. For encrypted ETags
the last 16 bytes will be a random value.
The main purpose of `Format` is to format ETags
such that clients accept them as well-formed AWS S3
ETags.
It differs from the `String` method since `String`
will return string representations for encrypted
ETags that are not AWS S3 compliant.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds support for encrypted KES
client private keys.
Now, it is possible to encrypt the KES client
private key (`MINIO_KMS_KES_KEY_FILE`) with
a password.
For example, KES CLI already supports the
creation of encrypted private keys:
```
kes identity new --encrypt --key client.key --cert client.crt MinIO
```
To decrypt an encrypted private key, the password
needs to be provided:
```
MINIO_KMS_KES_KEY_PASSWORD=<password>
```
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit fixes a subtle bug in the ETag
`IsEncrypted` implementation.
An encrypted ETag may contain random bytes,
i.e. some randomness used for encryption.
This random value can contain a '-' byte
simple due to being randomly generated.
Before, the `IsEncrypted` implementation
incorrectly assumed that an encrypted ETag
cannot contain a '-' since it would be a
multipart ETag. Multipart ETags have a
16 byte value followed by a '-' and the part number.
For example:
```
059ba80b807c3c776fb3bcf3f33e11ae-2
```
However, the following encrypted ETag
```
20000f00db2d90a7b40782d4cff2b41a7799fc1e7ead25972db65150118dfbe2ba76a3c002da28f85c840cd2001a28a9
```
also contains a '-' byte but is not a multipart ETag.
This commit fixes the `IsEncrypted` implementation
simply by checking whether the ETag is at least 32
bytes long. A valid multipart ETag is never 32 bytes
long since a part number must be <= 10000.
However, an encrypted ETag must be at least 32 bytes
long. It contains the encrypted ETag bytes (16 bytes)
and the authentication tag added by the AEAD cipher (again
16 bytes).
Signed-off-by: Andreas Auernhammer <hi@aead.dev>