In situations with large number of STS credentials on disk, IAM load
time is high. To mitigate this, STS accounts will now be loaded into
memory only on demand - i.e. when the credential is used.
In each IAM cache (re)load we skip loading STS credentials and STS
policy mappings into memory. Since STS accounts only expire and cannot
be deleted, there is no risk of invalid credentials being reused,
because credential validity is checked when it is used.
Currently we have IOPs of these patterns
```
[OS] os.Mkdir play.min.io:9000 /disk1 2.718µs
[OS] os.Mkdir play.min.io:9000 /disk1/data 2.406µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys 4.068µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys/tmp 2.843µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys/tmp/d89c8ceb-f8d1-4cc6-b483-280f87c4719f 20.152µs
```
It can be seen that we can save quite Nx levels such as
if your drive is mounted at `/disk1/minio` you can simply
skip sending an `Mkdir /disk1/` and `Mkdir /disk1/minio`.
Since they are expected to exist already, this PR adds a way
for us to ignore all paths upto the mount or a directory which
ever has been provided to MinIO setup.
Previously existing objects were queued to single worker and MRF re-queues
are also handled by same worker - this does not fully use the available
bandwidth in case there is no incoming workload.
Errors such as
```
returned an error (context deadline exceeded) (*fmt.wrapError)
```
```
(msgp: too few bytes left to read object) (*fmt.wrapError)
```
This change enables embedding files in ZIP with custom permissions.
Also uses default creds for starting MinIO based on inspect data.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
objects with 10,000 parts and many of them can
cause a large memory spike which can potentially
lead to OOM due to lack of GC.
with previous PR reducing the memory usage significantly
in #17963, this PR reduces this further by 80% under
repeated calls.
Scanner sub-system has no use for the slice of Parts(),
it is better left empty.
```
benchmark old ns/op new ns/op delta
BenchmarkToFileInfo/ToFileInfo-8 295658 188143 -36.36%
benchmark old allocs new allocs delta
BenchmarkToFileInfo/ToFileInfo-8 61 60 -1.64%
benchmark old bytes new bytes delta
BenchmarkToFileInfo/ToFileInfo-8 1097210 227255 -79.29%
```
- this PR avoids sending a large ChecksumInfo slice
when its not needed
- also for a file with XLV2 format there is no reason
to allocate Checksum slice while reading
Keys are helpful to ensure the strict ordering of messages, however currently the
code uses a random request id for every log, hence using the request-id
as a Kafka key is not serve any purpose;
This commit removes the usage of the key, to also fix the audit issue from
internal subsystem that does not have a request ID.
to track the replication transfer rate across different nodes,
number of active workers in use and in-queue stats to get
an idea of the current workload.
This PR also adds replication metrics to the site replication
status API. For site replication, prometheus metrics are
no longer at the bucket level - but at the cluster level.
Add prometheus metric to track credential errors since uptime
replicationTimestamp might differ if there were retries
in replication and the retried attempt overwrote in
quorum but enough shards with newer timestamp causing
the existing timestamps on xl.meta to be invalid, we
do not rely on this value for anything external.
this is purely a hint for debugging purposes, but there
is no real value in it considering the object itself
is in-tact we do not have to spend time healing this
situation.
we may consider healing this situation in future but
that needs to be decoupled to make sure that we do not
over calculate how much we have to heal.
.metacache objects are transient in nature, and are better left to
use page-cache effectively to avoid using more IOPs on the disks.
this allows for incoming calls to be not taxed heavily due to
multiple large batch listings.
given a versionId the mtime is always the same, it
can never be different than its original value.
versionIds also do not conflict, since they are uuid's
and unique practically forever.
we expect a certain level of IOPs and latency so this is okay.
fixes other miscellaneous bugs
- such as hanging on mrfCh <- when the context is canceled
- queuing MRF heal when the context is canceled
- remove unused saveStateCh channel
This commit updates the minio/kes-go dependency
to v0.2.0 and updates the existing code to work
with the new KES APIs.
The `SetPolicy` handler got removed since it
may not get implemented by KES at all and could
not have been used in the past since stateless KES
is read-only w.r.t. policies and identities.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Bonus fixes include
- do not have to write final xl.meta (renameData) does this
already, saves some IOPs.
- make sure to purge the multipart directory properly using
a recursive delete, otherwise this can easily pile up and
rely on the stale uploads cleanup.
fixes#17863
This reverts commit bf3901342c.
This is to fix a regression caused when there are inconsistent
versions, but one version is in quorum. SuccessorModTime issue
must be fixed differently.
batch status can perpetually wait after completion
due to a race between the MetricsHandler() returning
the active metrics in intervals of 1sec and delete
of metrics after job completion.
this PR ensures that we keep the 'status' around
for a while, i.e upto 24hrs for all the batch jobs.
Two fields in lifecycles made GOB encoding consistently fail with `gob: type lifecycle.Prefix has no exported fields`.
This meant that in distributed systems listings would never be able to continue and would restart on every call.
Fix issues and be sure to log these errors at least once per bucket. We may see some connectivity errors here, but we shouldn't hide them.
When listing getObjectFileInfo can return `io.EOF` if file is being written.
When we wrap the error it will *not* retry upstream, since `io.EOF` is a valid return value.
Allow one retry before returning errors and canceling the listing.
* optimize deletePrefix, use direct set location via object name
instead of fanning out the calls for an object force delete
we can assume the set location and not do fan-out calls
* Apply suggestions from code review
Co-authored-by: Krishnan Parthasarathi <krisis@users.noreply.github.com>
---------
Co-authored-by: Krishnan Parthasarathi <krisis@users.noreply.github.com>
Bonus:
- avoid calling DiskInfo() calls when missing blocks
instead heal the object using MRF operation.
- change the max_sleep to 250ms beyond that we will
not stop healing.
ignoring valid objects with valid replication metadata
after the Prefix was disabled must still honor the older
metadata.
this can lead to unexpected results, allow it during
READ phase always.
// UnmarshalStrict is like Unmarshal except that any fields that are found
// in the data that do not have corresponding struct members, or mapping
// keys that are duplicates, will result in
// an error.
batch replication pull must preserve versionID regardless
of destination bucket versioning configuration.
This is similar to the issue with decommissioning and rebalancing
health checks were missing for drives replaced since
- HealFormat() would replace the drives without a health check
- disconnected drives when they reconnect via connectEndpoint()
the loop also loses health checks for local disks and merges
these into a single code.
- other than this separate cleanUp, health check variables to avoid
overloading them with similar requirements.
- also ensure that we compete via context selector for disk monitoring
such that the canceled disks don't linger around longer waiting for
the ticker to trigger.
- allow disabling active monitoring.
```
minio[1032735]: panic: label value "\xc0.\xc0." is not valid UTF-8
minio[1032735]: goroutine 1781101 [running]:
minio[1032735]: github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...)
```
log such errors for investigation
Limit large uploads (> 128MiB) to a max of 10 workers, intent is to avoid
larger uploads from using all replication bandwidth, giving room for smaller
uploads to sync faster.
slower drives get knocked off because they are too slow via
active monitoring, we do not need to block calls arbitrarily.
Serializing adds latencies for already slow calls, remove
it for SSDs/NVMEs
Also, add a selection with context when writing to `out <-`
channel, to avoid any potential blocks.
Revert "don't error when asked for 0-based range on empty objects (#17708)"
This reverts commit 7e76d66184.
There is no valid way to specify offsets in a 0-byte file. Blame it on the [RFC](https://datatracker.ietf.org/doc/html/rfc7233#section-4.4)
> The 416 (Range Not Satisfiable) status code indicates that none of the ranges in the
> request's Range header field (Section 3.1) overlap the current extent of the selected resource...
A request for "bytes=0-" is a request for the first byte of a resource. If the resource is 0-length,
the range [0,0] does not overlap the resource content and the server responds with an error.
In a reverse proxying setup, a proxy in front of MinIO may attempt to
request objects in slices for enhanced cache efficiency. Since such a
a proxy cannot have prior knowledge of how large a requested resource is,
it usually sends a header of the form:
Range: 0-$slice_size
... and, depending on the size of the resource, expects either:
- an empty response, if $resource_size == 0
- a full response, if $resource_size <= $slice_size
- a partial response, if $resource_size > $slice_size
Prior to this change, MinIO would respond 416 Range Not Satisfiable if a
client tried to request a range on an empty resource. This behavior is
technically consistent with RFC9110[1] – However, it renders sliced
reverse proxying, such as implemented in Nginx, broken in the case of
empty files. Nginx itself seems to break this convention to enable
"useful" responses in these cases, and MinIO should probably do that
too.
[1]: https://www.rfc-editor.org/rfc/rfc9110#byte.ranges
sending whitespace character with CompleteMultipartUpload()
with 200 OK was an AWS S3 compatible implementation detail,
and it was expected that the client SDK must look for both
successful XML as well as error XML for 200 OK.
But this is not useful anymore on MinIO, since we do not
have any large delayed coalescing of parts anymore.
users/customers do not have a reasonable number of buckets anymore,
this is why we must avoid overpopulating cluster endpoints, instead
move the bucket monitoring to a separate endpoint.
some of it's a breaking change here for a couple of metrics, but
it is imperative that we do it to improve the responsiveness of
our Prometheus cluster endpoint.
Bonus: Added new cluster metrics for usage, objects and histograms
Using this script, post decrypt we should be able to bring up the
MinIO instance with same configuration.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Sometimes IAM fails to load certain items, which could be a user,
a service account or a policy but with not enough information for
us to debug.
This commit will create a more descriptive error to make it easier to
debug in such situations.
mc admin trace -a will be able to quickly show
401 Unauthorized header to pinpoint trivial issues
between nodes, such as wrong root
credentials and skewed time.
objects/versions that are not expired via NewerNoncurrentVersions
must be properly returned to be applied under further ILM actions.
this would cause legitimately expired objects to be missed
from expiration.
this randomness is needed to avoid scanning
the same buckets across different erasure sets,
in the same order.
allow random buckets to be scanned instead
allowing a wider spread of ILM, replication
checks.
Additionally do not loop over twice to fill
the channel, fill the channel regardless of
having bucket new or old.
A new middleware function is added for admin handlers, including options
for modifying certain behaviors. This admin middleware:
- sets the handler context via reflection in the request and sends AuditLog
- checks for object API availability (skipping it if a flag is passed)
- enables gzip compression (skipping it if a flag is passed)
- enables header tracing (adding body tracing if a flag is passed)
While the new function is a middleware, due to the flags used for
conditional behavior modification, which is used in each route registration
call.
To try to ensure that no regressions are introduced, the following
changes were done mechanically mostly with `sed` and regexp:
- Remove defer logger.AuditLog in admin handlers
- Replace newContext() calls with r.Context()
- Update admin routes registration calls
Bonus: remove unused NetSpeedtestHandler
Since the new adminMiddleware function checks for object layer presence
by default, we need to pass the `noObjLayerFlag` explicitly to admin
handlers that should work even when it is not available. The following
admin handlers do not require it:
- ServerInfoHandler
- StartProfilingHandler
- DownloadProfilingHandler
- ProfileHandler
- SiteReplicationDevNull
- SiteReplicationNetPerf
- TraceHandler
For these handlers adminMiddleware does not check for the object layer
presence (disabled by passing the `noObjLayerFlag`), and for all other
handlers, the pre-check ensures that the handler is not called when the
object layer is not available - the client would get a
ErrServerNotInitialized and can retry later.
This `noObjLayerFlag` is added based on existing behavior for these
handlers only.
Add check every 2 minutes to see if a write+read operation can complete.
If disk is unresponsive for 2 minutes or returns errFaultyDisk, take it offline.
Simplify MRF queueing and add backlog handler
- Limit re-tries to 3 to avoid repeated re-queueing. Fall offs
to be re-tried when the scanner revisits this object or upon access.
- Change MRF to have each node process only its MRF entries.
- Collect MRF backlog by the node to allow for current backlog visibility
Now it would list details of all KMS instances with additional
attributes `endpoint` and `version`. In the case of k8s-based
deployment the list would consist of a single entry.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
This would better to record the correct API name so that
any verification around audit logs to figure out if required
APIs are called required no of times, would be correct.
Here in this case of policy attached, API `AttachDetachPolicyBuiltin`
would be called with `requestPath` as `/minio/admin/v3/idp/builtin/policy/attach`
and in case of detach policy the value would be `/minio/admin/v3/idp/builtin/policy/detach`
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Also shutdown poll add jitter, to verify if the shutdown
sequence can finish before 500ms, this reduces the overall
time taken during "restart" of the service.
Provides speedup for `mc admin service restart` during
active I/O, also ensures that systemd doesn't treat the
returned 'error' as a failure, certain configurations in
systemd can cause it to 'auto-restart' the process by-itself
which can interfere with `mc admin service restart`.
It can be observed how now restarting the service is
much snappier.
on unversioned buckets its possible that 0-byte objects
might lose quorum on flaky systems, allow them to be same
as DELETE markers. Since practically speak they have no
content.
Optimize DeleteObject API to avoid extra
GetObjectInfo call on the replicating side.
For receiving side, it is just a regular
DeleteObject call.
Bonus: Fix a corner case where version purged is
absent on target (either due to replication not yet
complete or target version already deleted in a
one-way replication or when replication was disabled).
In such cases, mark version purge complete.
Since `addCustomerHeaders` middleware was after the `httpTracer`
middleware, the request ID was not set in the http tracing context. By
reordering these middleware functions, the request ID header becomes
available. We also avoid setting the tracing context key again in
`newContext`.
Bonus: All middleware functions are renamed with a "Middleware" suffix
to avoid confusion with http Handler functions.
* Reduce allocations
* Add stringsHasPrefixFold which can compare string prefixes, while ignoring case and not allocating.
* Reuse all msgp.Readers
* Reuse metadata buffers when not reading data.
* Make type safe. Make buffer 4K instead of 8.
* Unslice
DNS refresh() in-case of MinIO can safely re-use
the previous values on bare-metal setups, since
bare-metal arrangements do not change DNS in any
manner commonly.
This PR simplifies that, we only ever need DNS caching
on bare-metal setups.
- On containerized setups do not enable DNS
caching at all, as it may have adverse effects on
the overall effectiveness of k8s DNS systems.
k8s DNS systems are dynamic and expect applications
to avoid managing DNS caching themselves, instead
provide a cleaner container native caching
implementations that must be used.
- update IsDocker() detection, including podman runtime
- move to minio/dnscache fork for a simpler package
Following extension allows users to specify immediate purge of
all versions as soon as the latest version of this object has
expired.
```
<LifecycleConfiguration>
<Rule>
<ID>ClassADocRule</ID>
<Filter>
<Prefix>classA/</Prefix>
</Filter>
<Status>Enabled</Status>
<Expiration>
<Days>3650</Days>
<ExpiredObjectAllVersions>true</ExpiredObjectAllVersions>
</Expiration>
</Rule>
...
```
- look for requested encryption while compressing
not just via HTTP Headers, but also via multipart
metadata
- look for SSE-S3 etag decryption not just via HTTP
Headers, but also via multipart metadata
fixes#17519
current decommission traces were missing for
- Skipped ILM expired versions
- Skipped single DELETE marked version
- A success or failure in decommissioning DELETE marker
- allow additional info to be shared in DecomStatus() API
there is a possibility that slow drives can actually add latency
to the overall call, leading to a large spike in latency.
this can happen if there are other parallel listObjects()
calls to the same drive, in-turn causing each other to sort
of serialize.
this potentially improves performance and makes PutObject()
also non-blocking.
This change adds a `Secret` property to `HelpKV` to identify secrets
like passwords and auth tokens that should not be revealed by the server
in its configuration fetching APIs. Configuration reporting APIs now do
not return secrets.
For policy attach/detach API to work correctly the server should hold a
lock before reading existing policy mapping and until after writing the
updated policy mapping. This is fixed in this change.
A site replication bug, where LDAP policy attach/detach were not
correctly propagated is also fixed in this change.
Bonus: Additionally, the server responds with the actual (or net)
changes performed in the attach/detach API call. For e.g. if a user
already has policy A applied, and a call to attach policies A and B is
performed, the server will respond that B was attached successfully.
A continuation of PR #17479 for rebalance behavior must
also match the decommission behavior.
Fixes bug where rebalance would ignore rebalancing object
versions after one of the version returned "ObjectNotFound"
while decommissioning it can so happen that the non-current
versions are all expired but there is a DEL marker as the
latest version.
For such objects, we should not decommission them instead
calculate the remaining versions and if the remaining versions
is one and that version is a DEL marker consider such
an object not to be scheduled for decommissioning.
With the current asynchronous behaviour in sending notification events
to the targets, we can't provide guaranteed delivery as the systems
might go for restarts.
For such event-driven use-cases, we can provide an option to enable
synchronous events where the APIs wait until the event is successfully
sent or persisted.
This commit adds 'MINIO_API_SYNC_EVENTS' env which when set to 'on'
will enable sending/persisting events to targets synchronously.
A state is updated with a delete marker, which does not have parity or
data blocks defined, which can cause the integer divide by zero panics.
This commit fixes to avoid panics.
on "unversioned" buckets there are situations
when successive concurrent I/O can lead to
an inconsistent state() with mtime while the
etag might be the same for the object on disk.
in such a scenario it is possible for us to
allow reading of the object since etag matches
and if etag matches we are guaranteed that we
have enough copies the object will be readable
and same.
This PR allows fallback in such scenarios.
This PR also returns the replication status in
proxy calls and defers replication attempt if
HEAD on object version returned a error different
from NoSuchKey
A specific node should do the decommissioning task, however routing the
start decommissioning to that node was not working properly.
Co-authored-by: Anis Elleuch <anis@min.io>
fixes an issue under bucket replication could cause
ETags for replicated SSE-S3 single part PUT objects,
to fail as we would attempt a decryption while listing,
or stat() operation.
- lifecycle must return InvalidArgument for rule errors
- do not return `null` versionId in HTTP header
- reject mixed SSE uploads with correct error message
- getObjectTagging to be allowed for anonymous policies
- return correct errors for invalid retention period
- return sorted list of tags for an object
- putObjectTagging must return 200 OK not 204 OK
- return 409 ErrObjectLockConfigurationNotAllowed for existing buckets
PUT calls cannot afford to have large latency build-ups due
to contentious usage.json, or worse letting them fail with
some unexpected error, this can happen when this file is
concurrently being updated via scanner or it is being
healed during a disk replacement heal.
However, these are fairly quick in theory, stressed clusters
can quickly show visible latency this can add up leading to
invalid errors returned during PUT.
It is perhaps okay for us to relax this error return requirement
instead, make sure that we log that we are proceeding to take in
the requests while the quota is using an older value for the quota
enforcement. These things will reconcile themselves eventually,
via scanner making sure to overwrite the usage.json.
Bonus: make sure that storage-rest-client sets ExpectTimeouts to
be 'true', such that DiskInfo() call with contextTimeout does
not prematurely disconnect the servers leading to a longer
healthCheck, back-off routine. This can easily pile up while also
causing active callers to disconnect, leading to quorum loss.
DiskInfo is actively used in the PUT, Multipart call path for
upgrading parity when disks are down, it in-turn shouldn't cause
more disks to go down.