if the certs are the same in an environment where the
cert files are symlinks (e.g Kubernetes), then we resort
to reloading certs every 15mins - we can avoid reload
of the kes client instance. Ensure that the price to pay
for contending with the lock must happen when necessary.
Ensure delete marker replication success, especially since the
recent optimizations to heal on HEAD, LIST and GET can force
replication attempts on delete marker before underlying object
version could have synced.
minio_inter_node_traffic_errors_total currently does not track
requests body write/read errors of internode REST communications.
This commit fixes this by wrapping resp.Body.
to avoid relying on scanner-calculated replication metrics.
This will improve the accuracy of the replication stats reported.
This PR also adds on to #15556 by handing replication
traffic that could not be queued by available workers to the
MRF queue so that entries in `PENDING` status are healed faster.
A lot of warning messages are printed in CI/CD failures generated by go
test. Avoid that by requiring at least Error level for logging when
doing go test.
inlined data often is bigger than the allowed
O_DIRECT alignment, so potentially we can write
'xl.meta' without O_DSYNC instead we can rely on
O_DIRECT + fdatasync() instead.
This PR allows O_DIRECT on inlined data that
would gain the benefits of performing O_DIRECT,
eventually performing an fdatasync() at the end.
Performance boost can be observed here for small
objects < 128KiB. The performance boost is mainly
seen on HDD, and marginal on NVMe setups.
competing calls on the same object on versioned bucket
mutating calls on the same object may unexpected have
higher delays.
This can be reproduced with a replicated bucket
overwriting the same object writes, deletes repeatedly.
For longer locks like scanner keep the 1sec interval
This commit adds support for automatically reloading
the MinIO client certificate for authentication to KES.
The client certificate will now be reloaded:
- when the private key / certificate file changes
- when a SIGHUP signal is received
- every 15 minutes
Fixes#14869
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
mc admin config reset <alias> notify_webhook:something was not working
properly.
The reason is that GetSubSys() was not calculating the target
name properly because it is quitting early when the number of config
inputs ('notify_webhook:something' in this case) is equal to 1.
This commit will make the code calculates always calculate the target
name if found.
fixes#15334
- re-use net/url parsed value for http.Request{}
- remove gosimple, structcheck and unusued due to https://github.com/golangci/golangci-lint/issues/2649
- unwrapErrs upto leafErr to ensure that we store exactly the correct errors
This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.
Prior to this commit, temporary directory created using `ioutil.TempDir`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
defer func() {
if err := os.RemoveAll(dir); err != nil {
t.Fatal(err)
}
}
is also tedious, but `t.TempDir` handles this for us nicely.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
This commit adds a `context.Context` to the
the KMS `{Stat, CreateKey, GenerateKey}` API
calls.
The context will be used to terminate external calls
as soon as the client requests gets canceled.
A follow-up PR will add a `context.Context` to
the remaining `DecryptKey` API call.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Since this is a MinIO specific extension in the replication config,
default this to Disabled to allow other sdks to be used to configure
replication rules.
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Add up to 256 bytes of padding for compressed+encrypted files.
This will obscure the obvious cases of extremely compressible content
and leave a similar output size for a very wide variety of inputs.
This does *not* mean the compression ratio doesn't leak information
about the content, but the outcome space is much smaller,
so often *less* information is leaked.
Rename Trigger -> Event to be a more appropriate
name for the audit event.
Bonus: fixes a bug in AddMRFWorker() it did not
cancel the waitgroup, leading to waitgroup leaks.
This commit adds a minimal set of KMS-related metrics:
```
# HELP minio_cluster_kms_online Reports whether the KMS is online (1) or offline (0)
# TYPE minio_cluster_kms_online gauge
minio_cluster_kms_online{server="127.0.0.1:9000"} 1
# HELP minio_cluster_kms_request_error Number of KMS requests that failed with a well-defined error
# TYPE minio_cluster_kms_request_error counter
minio_cluster_kms_request_error{server="127.0.0.1:9000"} 16790
# HELP minio_cluster_kms_request_success Number of KMS requests that succeeded
# TYPE minio_cluster_kms_request_success counter
minio_cluster_kms_request_success{server="127.0.0.1:9000"} 348031
```
Currently, we report whether the KMS is available and how many requests
succeeded/failed. However, KES exposes much more metrics that can be
exposed if necessary. See: https://pkg.go.dev/github.com/minio/kes#Metric
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
it is not safe to pass around sync.Map
through pointers, as it may be concurrently
updated by different callers.
this PR simplifies by avoiding sync.Map
altogether, we do not need sync.Map
to keep object->erasureMap association.
This PR fixes a crash when concurrently
using this value when audit logs are
configured.
```
fatal error: concurrent map iteration and map write
goroutine 247651580 [running]:
runtime.throw({0x277a6c1?, 0xc002381400?})
runtime/panic.go:992 +0x71 fp=0xc004d29b20 sp=0xc004d29af0 pc=0x438671
runtime.mapiternext(0xc0d6e87f18?)
runtime/map.go:871 +0x4eb fp=0xc004d29b90 sp=0xc004d29b20 pc=0x41002b
```
This commit fixes the order of elliptic curves.
As documented by https://pkg.go.dev/crypto/tls#Config
```
// CurvePreferences contains the elliptic curves that will be used in
// an ECDHE handshake, in preference order. If empty, the default will
// be used. The client will use the first preference as the type for
// its key share in TLS 1.3. This may change in the future.
```
In general, we should prefer `X25519` over the NIST curves.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
`config.ResolveConfigParam` returns the value of a configuration for any
subsystem based on checking env, config store, and default value. Also returns info
about which config source returned the value.
This is useful to return info about config params overridden via env in the user
APIs. Currently implemented only for OpenID subsystem, but will be extended for
others subsequently.
this allows for customers to use `mc admin service restart`
directly even when performing RPM, DEB upgrades. Upon such 'restart'
after upgrade MinIO will re-read the /etc/default/minio for any
newer environment variables.
As long as `MINIO_CONFIG_ENV_FILE=/etc/default/minio` is set, this
is honored.
* Add periodic callhome functionality
Periodically (every 24hrs by default), fetch callhome information and
upload it to SUBNET.
New config keys under the `callhome` subsystem:
enable - Set to `on` for enabling callhome. Default `off`
frequency - Interval between callhome cycles. Default `24h`
* Improvements based on review comments
- Update `enableCallhome` safely
- Rename pctx to ctx
- Block during execution of callhome
- Store parsed proxy URL in global subnet config
- Store callhome URL(s) in constants
- Use existing global transport
- Pass auth token to subnetPostReq
- Use `config.EnableOn` instead of `"on"`
* Use atomic package instead of lock
* Use uber atomic package
* Use `Cancel` instead of `cancel`
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
We need to make sure if we cannot read bucket metadata
for some reason, and bucket metadata is not missing and
returning corrupted information we should panic such
handlers to disallow I/O to protect the overall state
on the system.
In-case of such corruption we have a mechanism now
to force recreate the metadata on the bucket, using
`x-minio-force-create` header with `PUT /bucket` API
call.
Additionally fix the versioning config updated state
to be set properly for the site replication healing
to trigger correctly.
Main motivation is move towards a common backend format
for all different types of modes in MinIO, allowing for
a simpler code and predictable behavior across all features.
This PR also brings features such as versioning, replication,
transitioning to single drive setups.
Following code can reproduce an unending go-routine buildup,
while keeping connections established due to lack of client
not closing the connections.
https://gist.github.com/harshavardhana/2d00e6f909054d2d2524c71485ad02e1
Without this PR all MinIO deployments can be put into
denial of service attacks, causing entire service to be
unavailable.
We bring in two timeouts at this stage to control such
go-routine build ups, new change
- IdleTimeout (to kill off idle connections)
- ReadHeaderTimeout (to kill off connections that are too slow)
This new change also brings two hidden options to make any
additional relevant changes if desired in some setups.
It would seem like the PR #11623 had chewed more
than it wanted to, non-fips build shouldn't really
be forced to use slower crypto/sha256 even for
presumed "non-performance" codepaths. In MinIO
there are really no "non-performance" codepaths.
This assumption seems to have had an adverse
effect in certain areas of CPU usage.
This PR ensures that we stick to sha256-simd
on all non-FIPS builds, our most common build
to ensure we get the best out of the CPU at
any given point in time.
- Adds an STS API `AssumeRoleWithCustomToken` that can be used to
authenticate via the Id. Mgmt. Plugin.
- Adds a sample identity manager plugin implementation
- Add doc for plugin and STS API
- Add an example program using go SDK for AssumeRoleWithCustomToken
.Reset() documentation states:
For a Timer created with NewTimer, Reset should be invoked only on stopped
or expired timers with drained channels.
This change is just to comply with this requirement as there might be some
runtime dependent situation that might lead to unexpected behavior.
it seems in some places we have been wrongly using the
timer.Reset() function, nicely exposed by an example
shared by @donatello https://go.dev/play/p/qoF71_D1oXD
this PR fixes all the usage comprehensively
When configuring a new target, such as an audit target, the server waits
until all audit events are sent to the audit target before doing the
swap from the old to the new audit target. Therefore current S3 operations
can suffer from this since the audit swap lock will be held.
This behavior is unnecessary as the new audit target can enter in a
functional mode immediately and the old audit will just cancel itself
at its own pace.
Object tags can have special characters such as whitespace. However
the current code doesn't properly consider those characters while
evaluating the lifecycle document.
ObjectInfo.UserTags contains an url encoded form of object tags
(e.g. key+1=val)
This commit fixes the issue by using the tags package to parse object tags.
This PR simplifies few things by splitting
the locks between audit, logger targets to
avoid potential contention between them.
any failures inside audit/logger HTTP
targets must only log to console instead
of other targets to avoid cyclical dependency.
avoids unneeded atomic variables instead
uses RWLock to differentiate a more common
read phase v/s lock phase.
- This change renames the OPA integration as Access Management Plugin - there is
nothing specific to OPA in the integration, it is just a webhook.
- OPA configuration is automatically migrated to Access Management Plugin and
OPA specific configuration is marked as deprecated.
- OPA doc is updated and moved.
- do not need to restrict prefix exclusions that do not
have `/` as suffix, relax this requirement as spark may
have staging folders with other autogenerated characters
, so we are better off doing full prefix March and skip.
- multiple delete objects was incorrectly creating a
null delete marker on a versioned bucket instead of
creating a proper versioned delete marker.
- do not suspend paths on the excluded prefixes during
delete operations to avoid creating `null` delete markers,
honor suspension of versioning only at bucket level for
delete markers.
Spark/Hadoop workloads which use Hadoop MR
Committer v1/v2 algorithm upload objects to a
temporary prefix in a bucket. These objects are
'renamed' to a different prefix on Job commit.
Object storage admins are forced to configure
separate ILM policies to expire these objects
and their versions to reclaim space.
Our solution:
This can be avoided by simply marking objects
under these prefixes to be excluded from versioning,
as shown below. Consequently, these objects are
excluded from replication, and don't require ILM
policies to prune unnecessary versions.
- MinIO Extension to Bucket Version Configuration
```xml
<VersioningConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Status>Enabled</Status>
<ExcludeFolders>true</ExcludeFolders>
<ExcludedPrefixes>
<Prefix>app1-jobs/*/_temporary/</Prefix>
</ExcludedPrefixes>
<ExcludedPrefixes>
<Prefix>app2-jobs/*/__magic/</Prefix>
</ExcludedPrefixes>
<!-- .. up to 10 prefixes in all -->
</VersioningConfiguration>
```
Note: `ExcludeFolders` excludes all folders in a bucket
from versioning. This is required to prevent the parent
folders from accumulating delete markers, especially
those which are shared across spark workloads
spanning projects/teams.
- To enable version exclusion on a list of prefixes
```
mc version enable --excluded-prefixes "app1-jobs/*/_temporary/,app2-jobs/*/_magic," --exclude-prefix-marker myminio/test
```
- When using multiple providers, claim-based providers are not allowed. All
providers must use role policies.
- Update markdown config to allow `details` HTML element
introduce x-minio-force-create environment variable
to force create a bucket and its metadata as required,
it is useful in some situations when bucket metadata
needs recovery.
- This change switches to a new parquet library
- SelectObjectContent now takes a single lock at the beginning and holds it
during the operation. Previously the operation took a lock every time the
parquet library performed a Seek on the underlying object stream.
- Add basic support for LogicalType annotations for timestamps.
This commit improves the listing of encrypted objects:
- Use `etag.Format` and `etag.Decrypt`
- Detect SSE-S3 single-part objects in a single iteration
- Fix batch size to `250`
- Pass request context to `DecryptAll` to not waste resources
when a client cancels the operation.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds two new functions to the
internal `etag` package:
- `ETag.Format`
- `Decrypt`
The `Decrypt` function decrypts an encrypted
ETag using a decryption key. It returns not
encrypted / multipart ETags unmodified.
The `Decrypt` function is mainly used when
handling SSE-S3 encrypted single-part objects.
In particular, the ETag of an SSE-S3 encrypted
single-part object needs to be decrypted since
S3 clients expect that this ETag is equal to the
content MD5.
The `ETag.Format` method also covers SSE ETag handling.
MinIO encrypts all ETags of SSE single part objects.
However, only the ETag of SSE-S3 encrypted single part
objects needs to be decrypted.
The ETag of an SSE-C or SSE-KMS single part object
does not correspond to its content MD5 and can be
a random value.
The `ETag.Format` function formats an ETag such that
it is an AWS S3 compliant ETag. In particular, it
returns non-encrypted ETags (single / multipart)
unmodified. However, for encrypted ETags it returns
the trailing 16 bytes as ETag. For encrypted ETags
the last 16 bytes will be a random value.
The main purpose of `Format` is to format ETags
such that clients accept them as well-formed AWS S3
ETags.
It differs from the `String` method since `String`
will return string representations for encrypted
ETags that are not AWS S3 compliant.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds support for encrypted KES
client private keys.
Now, it is possible to encrypt the KES client
private key (`MINIO_KMS_KES_KEY_FILE`) with
a password.
For example, KES CLI already supports the
creation of encrypted private keys:
```
kes identity new --encrypt --key client.key --cert client.crt MinIO
```
To decrypt an encrypted private key, the password
needs to be provided:
```
MINIO_KMS_KES_KEY_PASSWORD=<password>
```
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit fixes a subtle bug in the ETag
`IsEncrypted` implementation.
An encrypted ETag may contain random bytes,
i.e. some randomness used for encryption.
This random value can contain a '-' byte
simple due to being randomly generated.
Before, the `IsEncrypted` implementation
incorrectly assumed that an encrypted ETag
cannot contain a '-' since it would be a
multipart ETag. Multipart ETags have a
16 byte value followed by a '-' and the part number.
For example:
```
059ba80b807c3c776fb3bcf3f33e11ae-2
```
However, the following encrypted ETag
```
20000f00db2d90a7b40782d4cff2b41a7799fc1e7ead25972db65150118dfbe2ba76a3c002da28f85c840cd2001a28a9
```
also contains a '-' byte but is not a multipart ETag.
This commit fixes the `IsEncrypted` implementation
simply by checking whether the ETag is at least 32
bytes long. A valid multipart ETag is never 32 bytes
long since a part number must be <= 10000.
However, an encrypted ETag must be at least 32 bytes
long. It contains the encrypted ETag bytes (16 bytes)
and the authentication tag added by the AEAD cipher (again
16 bytes).
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds support for bulk ETag
decryption for SSE-S3 encrypted objects.
If KES supports a bulk decryption API, then
MinIO will check whether its policy grants
access to this API. If so, MinIO will use
a bulk API call instead of sending encrypted
ETags serially to KES.
Note that MinIO will not use the KES bulk API
if its client certificate is an admin identity.
MinIO will process object listings in batches.
A batch has a configurable size that can be set
via `MINIO_KMS_KES_BULK_API_BATCH_SIZE=N`.
It defaults to `500`.
This env. variable is experimental and may be
renamed / removed in the future.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
In riscv64, the `syscall.Uname` function will return a uint8 slice.
func main() {
var buf syscall.Utsname
fmt.Printf("Buffer Type: %T\n", buf.Release)
}
output:
Buffer Type: [65]uint8
This is tested in the Arch Linux RISC-V 64 QEMU environment.
Signed-off-by: Avimitin <avimitin@gmail.com>
Fix `panic: "POST /minio/peer/v21/signalservice?signal=2": sync: WaitGroup is reused before previous Wait has returned`
Log entries already on the channel would cause `logEntry` to increment the
waitgroup when sending messages, after Cancel has been called.
Instead of tracking every single message, just check the send goroutine. Faster
and safe, since it will not decrement until the channel is closed.
Regression from #14289
avoids creating new transport for each `isServerResolvable`
request, instead re-use the available global transport and do
not try to forcibly close connections to avoid TIME_WAIT
build upon large clusters.
Never use httpClient.CloseIdleConnections() since that can have
a drastic effect on existing connections on the transport pool.
Remove it everywhere.
PR introduced in #13819 was incorrect and was not
handling the situation where a buffer is full can
cause incessant amount of logs that would keep the
logger webhook overrun by the requests.
To avoid this only log failures to console logger
instead of all targets as it can cause self reference,
leading to an infinite loop.
This PR simply adds a warning message when it detects older kernel
versions and warn's them about potential performance issues on this
kernel.
The issue can be seen only with parallel I/O across all drives
on denser setups such as 90 drives or 45 drives per server configurations.
The main goal of this PR is to solve the situation where disks stop
responding to operations. This generally causes an FD build-up and
eventually will crash the server.
This adds detection of hung disks, where calls on disk get stuck.
We add functionality to `xlStorageDiskIDCheck` where it keeps
track of the number of concurrent requests on a given disk.
A total number of 100 operations are allowed. If this limit is reached
we will block (but not reject) new requests, but we will monitor the
state of the disk.
If no requests have been completed or updated within a 15-second
window, we mark the disk as offline. Requests that are blocked will be
unblocked and return an error as "faulty disk".
New requests will be rejected until the disk is marked OK again.
Once a disk has been marked faulty, a check will run every 5 seconds that
will attempt to write and read back a file. As long as this fails the disk will
remain faulty.
To prevent lots of long-running requests to mark the disk faulty we
implement a callback feature that allows updating the status as parts
of these operations are running.
We add a reader and writer wrapper that will update the status of each
successful read/write operation. This should allow fine enough granularity
that a slow, but still operational disk will not reach 15 seconds where
50 operations have not progressed.
Note that errors themselves are not enough to mark a disk faulty.
A nil (or io.EOF) error will mark a disk as "good".
* Make concurrent disk setting configurable via `_MINIO_DISK_MAX_CONCURRENT`.
* de-couple IsOnline() from disk health tracker
The purpose of IsOnline() is to ensure that we
reconnect the drive only when the "drive" was
- disconnected from network we need to validate
if the drive is "correct" and is the same drive
which belongs to this server.
- drive was replaced we have to format it - we
support hot swapping of the drives.
IsOnline() is not meant for taking the drive offline
when it is hung, it is not useful we can let the
drive be online instead "return" errors for relevant
calls.
* return errFaultyDisk for DiskInfo() call
Co-authored-by: Harshavardhana <harsha@minio.io>
Possible future Improvements:
* Unify the REST server and local xlStorageDiskIDCheck. This would also improve stats significantly.
* Allow reads/writes to be aborted by the context.
* Add usage stats, concurrent count, blocked operations, etc.
This commit removes some duplicate code that
converts KES API errors.
This code was added since KES `0.18.0` changed
some exported API errors. However, the KES SDK
handles this error conversion itself.
Therefore, it is not necessary to duplicate this
behavior in MinIO.
See: 21555fa624/error.go (L94)
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
- Updating KES dependency to v.0.18.0
- Fixing incompatibility issue when checking for errors during KES key creation
Signed-off-by: Lenin Alevski <alevsk.8772@gmail.com>
fixes a regression introduced in #14269 that refactored
the notification registration logic, all the amqp targets
however online will not be available for use anymore.
fixes#14451