Create new code paths for multiple subsystems in the code. This will
make maintaing this easier later.
Also introduce bugLogIf() for errors that should not happen in the first
place.
This commit replaces the `KMS.Stat` API call with a
`KMS.GenerateKey` call. This approach is more reliable
since data key generation also works when the KMS backend
is unavailable (temp. offline), but KES has cached the
key. Ref: KES offline caching.
With this change, it is less likely that MinIO readiness
checks fail in cases where the KMS backend is offline.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
Make sure to pass a nil pointer as a Transport to minio-go when the API config
is not initialized, this will make sure that we do not pass an interface
with a known type but a nil value.
This will also fix the update of the API remote_transport_deadline
configuration without requiring the cluster restart.
Use `ODirectPoolSmall` buffers for inline data in PutObject.
Add a separate call for inline data that will fetch a buffer for the inline data before unmarshal.
This fixes a bug where STS Accounts map accumulates accounts in memory
and never removes expired accounts and the STS Policy mappings were not
being refreshed.
The STS purge routine now runs with every IAM credentials load instead
of every 4th time.
The listing of IAM files is now cached on every IAM load operation to
prevent re-listing for STS accounts purging/reload.
Additionally this change makes each server pick a time for IAM loading
that is randomly distributed from a 10 minute interval - this is to
prevent server from thundering while performing the IAM load.
On average, IAM loading will happen between every 5-15min after the
previous IAM load operation completes.
If site replication enabled across sites, replicate the SSE-C
objects as well. These objects could be read from target sites
using the same client encryption keys.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Instead of relying on user input values, we use the DN value returned by
the LDAP server.
This handles cases like when a mapping is set on a DN value
`uid=svc.algorithm,OU=swengg,DC=min,DC=io` with a user input value (with
unicode variation) of `uid=svc﹒algorithm,OU=swengg,DC=min,DC=io`. The
LDAP server on lookup of this DN returns the normalized value where the
unicode dot character `SMALL FULL STOP` (in the user input), gets
replaced with regular full stop.
Bonus: remove persistent md5sum calculation, turn-off
sha256 as well. Instead we always enable crc32c which
is enough for payload verification also support for
trailing headers checksum.
Fix races in IAM cache
Fixes#19344
On the top level we only grab a read lock, but we write to the cache if we manage to fetch it.
a03dac41eb/cmd/iam-store.go (L446) is also flipped to what it should be AFAICT.
Change the internal cache structure to a concurrency safe implementation.
Bonus: Also switch grid implementation.
we must attempt to convert all errors at storage-rest-client
into StorageErr() regardless of what functionality is being
called in, this PR fixes this for multiple callers including
some internally used functions.
- old version was unable to retain messages during config reload
- old version could not go from memory to disk during reload
- new version can batch disk queue entries to single for to reduce I/O load
- error logging has been improved, previous version would miss certain errors.
- logic for spawning/despawning additional workers has been adjusted to trigger when half capacity is reached, instead of when the log queue becomes full.
- old version would json marshall x2 and unmarshal 1x for every log item. Now we only do marshal x1 and then we GetRaw from the store and send it without having to re-marshal.
panic seen due to premature closing of slow channel while listing is still sending or
list has already closed on the sender's side:
```
panic: close of closed channel
goroutine 13666 [running]:
github.com/minio/minio/internal/ioutil.SafeClose[...](0x101ff51e4?)
/Users/kp/code/src/github.com/minio/minio/internal/ioutil/ioutil.go:425 +0x24
github.com/minio/minio/cmd.(*erasureServerPools).Walk.func1()
/Users/kp/code/src/github.com/minio/minio/cmd/erasure-server-pool.go:2142 +0x170
created by github.com/minio/minio/cmd.(*erasureServerPools).Walk in goroutine 1189
/Users/kp/code/src/github.com/minio/minio/cmd/erasure-server-pool.go:1985 +0x228
```
Object names of directory objects qualified for ExpiredObjectAllVersions
must be encoded appropriately before calling on deletePrefix on their
erasure set.
e.g., a directory object and regular objects with overlapping prefixes
could lead to the expiration of regular objects, which is not the
intention of ILM.
```
bucket/dir/ ---> directory object
bucket/dir/obj-1
```
When `bucket/dir/` qualifies for expiration, the current implementation would
remove regular objects under the prefix `bucket/dir/`, in this case,
`bucket/dir/obj-1`.
In handlers related to health diagnostics e.g. CPU, Network, Partitions,
etc, globalMinioHost was being passed as the addr, resulting in empty
value for the same in the health report.
Using globalLocalNodeName instead fixes the issue.
IAM loading is a lazy operation, allow these
fallbacks to be in place when we cannot find
in-memory state().
this allows us to honor the request even if pay
a small price for lookup and populating the data.
When objects have more versions than their ILM policy expects to retain
via NewerNoncurrentVersions, but they don't qualify for expiry due to
NoncurrentDays are configured in that rule.
In this case, applyNewerNoncurrentVersionsLimit method was enqueuing empty
tasks, which lead to a panic (panic: runtime error: index out of range [0] with
length 0) in newerNoncurrentTask.OpHash method, which assumes the task
to contain at least one version to expire.
When returning the status of a decommissioned pool, a pool with zero
time StartedTime will be considered an active pool, which is unexpected.
This commit will always ensure that a pool's canceled/failed/completed
status is returned.
we were prematurely not writing 4k pages while we
could have due to the fact that most buffers would
be multiples of 4k upto some number and there shall
be some remainder.
We only need to write the remainder without O_DIRECT.
at scale customers might start with failed drives,
causing skew in the overall usage ratio per EC set.
make this configurable such that customers can turn
this off as needed depending on how comfortable they
are.
Currently, the code relies on object parity to decide whether it is a
delete marker or a regular object. In the case of a delete marker, the
return quorum is half of the disks in the erasure set. However, this
calculation must be corrected with objects with EC = 0, mainly
because EC is not a one-time fixed configuration.
Though all data are correct, the manifested symptom is a 503 with an
EC=0 object. This bug was manifested after we introduced the
fast Get Object feature that does not read all data from all disks in
case of inlined objects
Metrics v3 is mainly a reorganization of metrics into smaller groups of
metrics and the removal of internal aggregation of metrics received from
peer nodes in a MinIO cluster.
This change adds the endpoint `/minio/metrics/v3` as the top-level metrics
endpoint and under this, various sub-endpoints are implemented. These
are currently documented in `docs/metrics/v3.md`
The handler will serve metrics at any path
`/minio/metrics/v3/PATH`, as follows:
when PATH is a sub-endpoint listed above => serves the group of
metrics under that path; or when PATH is a (non-empty) parent
directory of the sub-endpoints listed above => serves metrics
from each child sub-endpoint of PATH. otherwise, returns a no
resource found error
All available metrics are listed in the `docs/metrics/v3.md`. More will
be added subsequently.
Merging same-object - multiple versions from different pools would not always result in correct ordering.
When merging keep inputs separate.
```
λ mc ls --versions local/testbucket
------ before ------
[2024-03-05 20:17:19 CET] 228B STANDARD 1f163718-9bc5-4b01-bff7-5d8cf09caf10 v3 PUT hosts
[2024-03-05 20:19:56 CET] 19KiB STANDARD null v2 PUT hosts
[2024-03-05 20:17:15 CET] 228B STANDARD 73c9f651-f023-4566-b012-cc537fdb7ce2 v1 PUT hosts
------ after ------
λ mc ls --versions local/testbucket
[2024-03-05 20:19:56 CET] 19KiB STANDARD null v3 PUT hosts
[2024-03-05 20:17:19 CET] 228B STANDARD 1f163718-9bc5-4b01-bff7-5d8cf09caf10 v2 PUT hosts
[2024-03-05 20:17:15 CET] 228B STANDARD 73c9f651-f023-4566-b012-cc537fdb7ce2 v1 PUT hosts
```
Currently, the progress of the batch job is saved in inside the job
request object, which is normally not supported by MinIO. Though there
is no apparent bug, it is better to fix this now.
Batch progress is saved in .minio.sys/batch-jobs/reports/
Co-authored-by: Anis Eleuch <anis@min.io>
our PoolNumber calculation was costly,
while we already had this information per
endpoint, we needed to deduce it appropriately.
This PR addresses this by assigning PoolNumbers
field that carries all the pool numbers that
belong to a server.
properties.PoolNumber still carries a valid value
only when len(properties.PoolNumbers) == 1, otherwise
properties.PoolNumber is set to math.MaxInt (indicating
that this value is undefined) and then one must rely
on properties.PoolNumbers for server participation
in multiple pools.
addresses the issue originating from #11327
there can be a sudden spike in tiny allocations,
due to too much auditing being done, also don't hang
on the
```
h.logCh <- entry
```
after initializing workers if you do not have a way to
dequeue for some reason.
This commits adds support for using the `--endpoint` arg when creating a
tier of type `azure`. This is needed to connect to Azure's Gov Cloud
instance. For example,
```
mc ilm tier add azure TARGET TIER_NAME \
--account-name ACCOUNT \
--account-key KEY \
--bucket CONTAINER \
--endpoint https://ACCOUNT.blob.core.usgovcloudapi.net
--prefix PREFIX \
--storage-class STORAGE_CLASS
```
Prior to this, the endpoint was hardcoded to `https://ACCOUNT.blob.core.windows.net`.
The docs were even explicit about this, stating that `--endpoint` is:
"Required for `s3` or `minio` tier types. This option has no effect for any
other value of `TIER_TYPE`."
Now, if the endpoint arg is present it will be used. If not, it will
fall back to the same default behavior of `ACCOUNT.blob.core.windows.net`.
Remove api.expiration_workers config setting which was inadvertently left behind. Per review comment
https://github.com/minio/minio/pull/18926, expiration_workers can be configured via ilm.expiration_workers.
ext4, xfs support this behavior however
btrfs, nfs may not support it properly.
in-case when we see Nlink < 2 then we know
that we need to fallback on readdir()
fixes a regression from #19100fixes#19181
The middleware sets up tracing, throttling, gzipped responses and
collecting API stats.
Additionally, this change updates the names of handler functions in
metric labels to be the same as the name derived from Go lang reflection
on the handler name.
The metric api labels are now stored in memory the same as the handler
name - they will be camelcased, e.g. `GetObject` instead of `getobject`.
For compatibility, we lowercase the metric api label values when emitting the metrics.
- Use a shared worker pool for all ILM expiry tasks
- Free version cleanup executes in a separate goroutine
- Add a free version only if removing the remote object fails
- Add ILM expiry metrics to the node namespace
- Move tier journal tasks to expiryState
- Remove unused on-disk journal for tiered objects pending deletion
- Distribute expiry tasks across workers such that the expiry of versions of
the same object serialized
- Ability to resize worker pool without server restart
- Make scaling down of expiryState workers' concurrency safe; Thanks
@klauspost
- Add error logs when expiryState and transition state are not
initialized (yet)
* metrics: Add missed tier journal entry tasks
* Initialize the ILM worker pool after the object layer
With this commit, MinIO generates root credentials automatically
and deterministically if:
- No root credentials have been set.
- A KMS (KES) is configured.
- API access for the root credentials is disabled (lockdown mode).
Before, MinIO defaults to `minioadmin` for both the access and
secret keys. Now, MinIO generates unique root credentials
automatically on startup using the KMS.
Therefore, it uses the KMS HMAC function to generate pseudo-random
values. These values never change as long as the KMS key remains
the same, and the KMS key must continue to exist since all IAM data
is encrypted with it.
Backward compatibility:
This commit should not cause existing deployments to break. It only
changes the root credentials of deployments that have a KMS configured
(KES, not a static key) but have not set any admin credentials. Such
implementations should be rare or not exist at all.
Even if the worst case would be updating root credentials in mc
or other clients used to administer the cluster. Root credentials
are anyway not intended for regular S3 operations.
Signed-off-by: Andreas Auernhammer <github@aead.dev>
just like client-conn-read-deadline, added a new flag that does
client-conn-write-deadline as well.
Both are not configured by default, since we do not yet know
what is the right value. Allow this to be configurable if needed.
we should do this to ensure that we focus on
data healing as primary focus, fixing metadata
as part of healing must be done but making
data available is the main focus.
the main reason is metadata inconsistencies can
cause data availability issues, which must be
avoided at all cost.
will be bringing in an additional healing mechanism
that involves "metadata-only" heal, for now we do
not expect to have these checks.
continuation of #19154
Bonus: add a pro-active healthcheck to perform a connection
This change makes the label names consistent with the handler names.
This is in preparation to use reflection based API handler function
names for the api labels so they will be the same as tracing, auditing
and logging names for these API calls.
in k8s things really do come online very asynchronously,
we need to use implementation that allows this randomness.
To facilitate this move WriteAll() as part of the
websocket layer instead.
Bonus: avoid instances of dnscache usage on k8s
New disk healing code skips/expires objects that ILM supposed to expire.
Add more visibility to the user about this activity by calculating those
objects and print it at the end of healing activity.
This PR fixes a bug that perhaps has been long introduced,
with no visible workarounds. In any deployment, if an entire
erasure set is deleted, there is no way the cluster recovers.
This change is to decouple need for root credentials to match between
site replication deployments.
Also ensuring site replication config initialization is re-tried until
it succeeds, this deoendency is critical to STS flow in site replication
scenario.
Currently, we read from `/proc/diskstats` which is found to be
un-reliable in k8s environments. We can read from `sysfs` instead.
Also, cache the latest drive io stats to find the diff and update
the metrics.
* Remove lock for cached operations.
* Rename "Relax" to `ReturnLastGood`.
* Add `CacheError` to allow caching values even on errors.
* Add NoWait that will return current value with async fetching if within 2xTTL.
* Make benchmark somewhat representative.
```
Before: BenchmarkCache-12 16408370 63.12 ns/op 0 B/op
After: BenchmarkCache-12 428282187 2.789 ns/op 0 B/op
```
* Remove `storageRESTClient.scanning`. Nonsensical - RPC clients will not have any idea about scanning.
* Always fetch remote diskinfo metrics and cache them. Seems most calls are requesting metrics.
* Do async fetching of usage caches.
It also fixes a long-standing bug in expiring transitioned objects.
The expiration action was deleting the current version in the case'
of tiered objects instead of adding a delete marker.