PR #15052 caused a regression, add the missing metrics back.
Bonus:
- internode information should be only for distributed setups
- update the dashboard to include 4xx and 5xx error panels.
this allows for customers to use `mc admin service restart`
directly even when performing RPM, DEB upgrades. Upon such 'restart'
after upgrade MinIO will re-read the /etc/default/minio for any
newer environment variables.
As long as `MINIO_CONFIG_ENV_FILE=/etc/default/minio` is set, this
is honored.
Currently minio_s3_requests_errors_total covers 4xx and
5xx S3 responses which can be confusing when s3 applications
sent a lot of HEAD requests with obvious 404 responses or
when the replication is enabled.
Add
- minio_s3_requests_4xx_errors_total
- minio_s3_requests_5xx_errors_total
to help users monitor 4xx and 5xx HTTP status codes separately.
peerOnlineCounter was making NxN calls to many peers, this
can be really long and tedious if there are random servers
that are going down.
Instead we should calculate online peers from the point of
view of "self" and return those online and offline appropriately
by performing a healthcheck.
* Add periodic callhome functionality
Periodically (every 24hrs by default), fetch callhome information and
upload it to SUBNET.
New config keys under the `callhome` subsystem:
enable - Set to `on` for enabling callhome. Default `off`
frequency - Interval between callhome cycles. Default `24h`
* Improvements based on review comments
- Update `enableCallhome` safely
- Rename pctx to ctx
- Block during execution of callhome
- Store parsed proxy URL in global subnet config
- Store callhome URL(s) in constants
- Use existing global transport
- Pass auth token to subnetPostReq
- Use `config.EnableOn` instead of `"on"`
* Use atomic package instead of lock
* Use uber atomic package
* Use `Cancel` instead of `cancel`
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
PR #15041 fixed replicating 'null' version however
due to a regression from #14994 caused the target
versions for these 'null' versioned objects to have
different 'versions', this may cause confusion with
bi-directional replication and cause double replication.
This PR fixes this properly by making sure we replicate
the correct versions on the objects.
mergeEntryChannels has the potential to perpetually
wait on the results channel, context might be closed
and we did not honor the caller context canceling.
The S3 service can be frozen indefinitely if a client or mc asks for object
perf API but quits early or has some networking issues. The reason is
that partialWrite() can block indefinitely.
This commit makes partialWrite() listens to context cancellation as well. It
also renames deadlinedCtx to healthCtx since it covers handler context
cancellation and not only not only the speedtest deadline.
In a streaming response, the client knows the size of a streamed
message but never checks the message size. Add the check to error
out if the response message is truncated.
Indexed streams would be decoded by the legacy loader if there
was an error loading it. Return an error when the stream is indexed
and it cannot be loaded.
Fixes "unknown minor metadata version" on corrupted xl.meta files and
returns an actual error.
We need to make sure if we cannot read bucket metadata
for some reason, and bucket metadata is not missing and
returning corrupted information we should panic such
handlers to disallow I/O to protect the overall state
on the system.
In-case of such corruption we have a mechanism now
to force recreate the metadata on the bucket, using
`x-minio-force-create` header with `PUT /bucket` API
call.
Additionally fix the versioning config updated state
to be set properly for the site replication healing
to trigger correctly.
readAllXL would return inlined data for outdated disks
causing "read" to return incorrect content to the client,
this PR fixes this behavior by making sure we skip such
outdated disks appropriately based on the latest ModTime
on the disk.
Main motivation is move towards a common backend format
for all different types of modes in MinIO, allowing for
a simpler code and predictable behavior across all features.
This PR also brings features such as versioning, replication,
transitioning to single drive setups.
Following code can reproduce an unending go-routine buildup,
while keeping connections established due to lack of client
not closing the connections.
https://gist.github.com/harshavardhana/2d00e6f909054d2d2524c71485ad02e1
Without this PR all MinIO deployments can be put into
denial of service attacks, causing entire service to be
unavailable.
We bring in two timeouts at this stage to control such
go-routine build ups, new change
- IdleTimeout (to kill off idle connections)
- ReadHeaderTimeout (to kill off connections that are too slow)
This new change also brings two hidden options to make any
additional relevant changes if desired in some setups.
It would seem like the PR #11623 had chewed more
than it wanted to, non-fips build shouldn't really
be forced to use slower crypto/sha256 even for
presumed "non-performance" codepaths. In MinIO
there are really no "non-performance" codepaths.
This assumption seems to have had an adverse
effect in certain areas of CPU usage.
This PR ensures that we stick to sha256-simd
on all non-FIPS builds, our most common build
to ensure we get the best out of the CPU at
any given point in time.
- Adds an STS API `AssumeRoleWithCustomToken` that can be used to
authenticate via the Id. Mgmt. Plugin.
- Adds a sample identity manager plugin implementation
- Add doc for plugin and STS API
- Add an example program using go SDK for AssumeRoleWithCustomToken
this PR also fixes a situation where incorrect
partsMetadata slice was used where fi.Data was
re-used from a single drive causing duplication
of the shards across all drives.
This happens for situations where shouldHeal()
returns true for all drives > parityBlocks.
To avoid this we should never attempt to heal on all
drives > parityBlocks, unless we are doing metadata
migration from xl.json -> xl.meta
If one or more pools reach 85% usage in a set, we will only
use pools that have more free space.
In case all pools are above 85% we allow all of them to be used
with the regular distribution.
When a server pool with a different number of sets is added they are
not compensated when choosing a destination pool for new objects.
This leads to the unbalanced placement of objects with smaller pools
getting a bigger number of objects since we only compare the destination
sets directly.
This change will compensate for differences in set sizes when choosing
the destination pool.
Different set sizes are already compensated by fewer disks.
updating metadata with CopyObject on a versioned bucket
causes the latest version to be not readable, this PR fixes
this properly by handling the inline data bug fix introduced
in PR #14780.
This bug affects only inlined data.
* Do not use inline data size in xl.meta quorum calculation
Data shards of one object can different inline/not-inline decision
in multiple disks. This happens with outdated disks when inline
decision changes. For example, enabling bucket versioning configuration
will change the small file threshold.
When the parity of an object becomes low, GET object can return 503
because it is not unable to calculate the xl.meta quorum, just because
some xl.meta has inline data and other are not.
So this commit will be disable taking the size of the inline data into
consideration when calculating the xl.meta quorum.
* Add tests for simulatenous inline/notinline object
Co-authored-by: Anis Elleuch <anis@min.io>
current implementation relied on recursively calling one bucket
at a time across all peers, this would be very slow and chatty
when there are 100's of buckets which would mean 100*peerCount
amount of network operations.
This PR attempts to reduce this entire call into `peerCount`
amount of network calls only. This functionality addresses also a
concern where the Prometheus metrics would significantly slow
down when one of the peers is offline.
Fix fallback hot loop
fd was never refreshed, leading to an infinite hot loop if a disk failed and the fallback disk fails as well.
Fix & simplify retry loop.
Fixes#14960
One usee reported having mc admin heal status output ETA increasing
by time. It turned out it is MRF that is not clearing its data due to a
bug in the code.
pendingItems is increased when an object is queued to be healed but
never decreasd when there is a healing error. This commit will decrease
pendingItems and pendingBytes even when there is an error to give
accurate reporting.
If LDAP is enabled, STS security token policy is evaluated using a
different code path and expects ldapUser claim to exist in the security
token. This means other STS temporary accounts generated by any Assume
Role function, such as AssumeRoleWithCertificate, won't be allowed to do any
operation as these accounts do not have LDAP user claim.
Since IsAllowedLDAPSTS() is similar to IsAllowedSTS(), this commit will
merge both.
Non harmful changes:
- IsAllowed for LDAP will start supporting RoleARN claim
- IsAllowed for LDAP will not check for parent claim anymore. This check doesn't
seem to be useful since all STS login compare access/secret/security-token
with the one saved in the disk.
- LDAP will support $username condition in policy documents.
Co-authored-by: Anis Elleuch <anis@min.io>
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
.Reset() documentation states:
For a Timer created with NewTimer, Reset should be invoked only on stopped
or expired timers with drained channels.
This change is just to comply with this requirement as there might be some
runtime dependent situation that might lead to unexpected behavior.
it seems in some places we have been wrongly using the
timer.Reset() function, nicely exposed by an example
shared by @donatello https://go.dev/play/p/qoF71_D1oXD
this PR fixes all the usage comprehensively
anything that is stuck on the disk today can cause latency
spikes for all incoming S3 I/O, we need to have this
de-coupled so that we can make sure that latency in loading
credentials are not reflected back to the S3 API calls.
The approach this PR takes is by checking if the calls were
updated just in case when the IAM load was in progress,
so that we can use merge instead of "replacement" to avoid
missing state.
The test expects from DeleteFile to return errDiskNotFound when the disk
is not available. It calls os.RemoveAll() to remove one disk after XL storage
initialization. However, this latter contains some goroutines which can
race with os.RemoveAll() and then the test fails sporadically with
returning random errors.
The commit will tweak the initialization routine of the XL storage to
only run deletion of temporary and metacache data in the background,
so TestXLStorageDeleteFile won't fail anymore.
currently, we allowed buckets to be listed from the
API call if and when the user has ListObject()
permission at the global level, this is okay to be
extended to GetBucketLocation() as well since
GetBucketLocation() is a "read" call and allowing "reads"
on a bucket has an implicit assumption that ListBuckets()
should be allowed.
This makes discoverability of access for read-only users
becomes easier or users with specific restrictions on their
policies.
This PR simplifies few things by splitting
the locks between audit, logger targets to
avoid potential contention between them.
any failures inside audit/logger HTTP
targets must only log to console instead
of other targets to avoid cyclical dependency.
avoids unneeded atomic variables instead
uses RWLock to differentiate a more common
read phase v/s lock phase.
- This change renames the OPA integration as Access Management Plugin - there is
nothing specific to OPA in the integration, it is just a webhook.
- OPA configuration is automatically migrated to Access Management Plugin and
OPA specific configuration is marked as deprecated.
- OPA doc is updated and moved.
In case of multi-pools setup, GetObjectNInfo returns a GetObjectReader
but it unlocks the read lock when quitting GetObjectNInfo. This should
not happen, unlock should only happen when GetObjectReader is closed.
- do not need to restrict prefix exclusions that do not
have `/` as suffix, relax this requirement as spark may
have staging folders with other autogenerated characters
, so we are better off doing full prefix March and skip.
- multiple delete objects was incorrectly creating a
null delete marker on a versioned bucket instead of
creating a proper versioned delete marker.
- do not suspend paths on the excluded prefixes during
delete operations to avoid creating `null` delete markers,
honor suspension of versioning only at bucket level for
delete markers.
PR #14828 introduced prefix-level exclusion of versioning
and replication - however our site replication implementation
since it defaults versioning on all buckets did not allow
changing versioning configuration once the bucket was created.
This PR changes this and ensures that such changes are honored
and also propagated/healed across sites appropriately.
Spark/Hadoop workloads which use Hadoop MR
Committer v1/v2 algorithm upload objects to a
temporary prefix in a bucket. These objects are
'renamed' to a different prefix on Job commit.
Object storage admins are forced to configure
separate ILM policies to expire these objects
and their versions to reclaim space.
Our solution:
This can be avoided by simply marking objects
under these prefixes to be excluded from versioning,
as shown below. Consequently, these objects are
excluded from replication, and don't require ILM
policies to prune unnecessary versions.
- MinIO Extension to Bucket Version Configuration
```xml
<VersioningConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Status>Enabled</Status>
<ExcludeFolders>true</ExcludeFolders>
<ExcludedPrefixes>
<Prefix>app1-jobs/*/_temporary/</Prefix>
</ExcludedPrefixes>
<ExcludedPrefixes>
<Prefix>app2-jobs/*/__magic/</Prefix>
</ExcludedPrefixes>
<!-- .. up to 10 prefixes in all -->
</VersioningConfiguration>
```
Note: `ExcludeFolders` excludes all folders in a bucket
from versioning. This is required to prevent the parent
folders from accumulating delete markers, especially
those which are shared across spark workloads
spanning projects/teams.
- To enable version exclusion on a list of prefixes
```
mc version enable --excluded-prefixes "app1-jobs/*/_temporary/,app2-jobs/*/_magic," --exclude-prefix-marker myminio/test
```
when the site is being removed is missing replication config. This can
happen when a new deployment is brought in place of a site that
is lost/destroyed and needs to delink old deployment from site
replication.
console logging peer API was broken as it would
timeout after 15minutes, this never really worked
beyond this value and basically failed to provide
the streaming "log" functionality that was expected
from this implementation.
also fix convoluted channel handling by keeping things
simple, this is rewritten.
do not modify opts.UserDefined after object-handler
has set all the necessary values, any mutation needed
should be done on a copy of this value not directly.
As there are other pieces of code that access opts.UserDefined
concurrently this becomes challenging.
fixes#14856
When a decommission task is successfully completed, failed, or canceled,
this commit allows restarting the decommission again. Restarting is not
allowed when there is an ongoing decommission task.
this PR introduces a few changes such as
- sessionPolicyName is not reused in an extracted manner
to apply policies for incoming authenticated calls,
instead uses a different key to designate this
information for the callers.
- this differentiation is needed to ensure that service
account updates do not accidentally store JSON representation
instead of base64 equivalent on the disk.
- relax requirements for Deleting a service account, allow
deleting a service account that might be unreadable, i.e
a situation where the user might have removed session policy
which now carries a JSON representation, making it unparsable.
- introduce some constants to reuse instead of strings.
fixes#14784
If an invalid status code is generated from an error we risk panicking. Even if there
are no potential problems at the moment we should prevent this in the future.
Add safeguards against this.
Sample trace:
```
May 02 06:41:39 minio[52806]: panic: "GET /20180401230655.PDF": invalid WriteHeader code 0
May 02 06:41:39 minio[52806]: goroutine 16040430822 [running]:
May 02 06:41:39 minio[52806]: runtime/debug.Stack(0xc01fff7c20, 0x25c4b00, 0xc0490e4080)
May 02 06:41:39 minio[52806]: runtime/debug/stack.go:24 +0x9f
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd.setCriticalErrorHandler.func1.1(0xc022048800, 0x4f38ab0, 0xc0406e0fc0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd/generic-handlers.go:469 +0x85
May 02 06:41:39 minio[52806]: panic(0x25c4b00, 0xc0490e4080)
May 02 06:41:39 minio[52806]: runtime/panic.go:965 +0x1b9
May 02 06:41:39 minio[52806]: net/http.checkWriteHeaderCode(...)
May 02 06:41:39 minio[52806]: net/http/server.go:1092
May 02 06:41:39 minio[52806]: net/http.(*response).WriteHeader(0xc0406e0fc0, 0x0)
May 02 06:41:39 minio[52806]: net/http/server.go:1126 +0x718
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger.(*ResponseWriter).WriteHeader(0xc032fa3ea0, 0x0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger/audit.go:116 +0xb1
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger.(*ResponseWriter).WriteHeader(0xc032fa3f40, 0x0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger/audit.go:116 +0xb1
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger.(*ResponseWriter).WriteHeader(0xc002ce8000, 0x0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/internal/logger/audit.go:116 +0xb1
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd.writeResponse(0x4f364a0, 0xc002ce8000, 0x0, 0xc0443b86c0, 0x1cb, 0x224, 0x2a9651e, 0xf)
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd/api-response.go:736 +0x18d
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd.writeErrorResponse(0x4f44218, 0xc069086ae0, 0x4f364a0, 0xc002ce8000, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00656afc0)
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd/api-response.go:798 +0x306
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd.objectAPIHandlers.getObjectHandler(0x4b73768, 0x4b73730, 0x4f44218, 0xc069086ae0, 0x4f82090, 0xc002d80620, 0xc040e03885, 0xe, 0xc040e03894, 0x61, ...)
May 02 06:41:39 minio[52806]: github.com/minio/minio/cmd/object-handlers.go:456 +0x252c
```
space characters at the beginning or at the end can lead to
confusion under various UI elements in differentiating the
actual name of "policy, user or group" - to avoid this behavior
this PR onwards we shall reject such inputs for newer entries.
existing saved entries will behave as is and are going to be
operable until they are removed/renamed to something more
meaningful.
- When using multiple providers, claim-based providers are not allowed. All
providers must use role policies.
- Update markdown config to allow `details` HTML element
this is allowed as long as order is preserved as is
on an existing setup, the new command line is updated
in `pool.bin` to facilitate future decommission's on
these pools.
introduce x-minio-force-create environment variable
to force create a bucket and its metadata as required,
it is useful in some situations when bucket metadata
needs recovery.
improvements in this PR include
- decommission objects that have __XLDIR__ suffix
- decommission objects that have `null` version on
a versioned bucket.
- make sure to look for any "decom" failures to ensure
that we do not wrong conclude decom as complete without
all files getting copied over.
- break out eagerly upon first error for objects with
multiple versions, leave the object as is for support
debugging and analysis.
heal bucket metadata and IAM entries for
sites participating in site replication from
the site with the most updated entry.
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Aditya Manthramurthy <aditya@minio.io>
The site replication status call was using a loop iteration variable sent
directly into go-routines instead of being passed as an argument. As the
variable is being updated in the loop, previously launched go routines do not
necessarily use the value at the time they were launched.
This PR fixes two issues
- The first fix is a regression from #14555, the fix itself in #14555
is correct but the interpretation of that information by the
object layer code for "replication" was not correct. This PR
tries to fix this situation by making sure the "Delete" replication
works as expected when "VersionPurgeStatus" is already set.
Without this fix, there is a DELETE marker created incorrectly on
the source where the "DELETE" was triggered.
- The second fix is perhaps an older problem started since we inlined-data
on the disk for small objects, CopyObject() incorrectly inline's
a non-inlined data. This is due to the fact that we have code where
we read the `part.1` under certain conditions where the size of the
`part.1` is less than the specific "threshold".
This eventually causes problems when we are "deleting" the data that
is only inlined, which means dataDir is ignored leaving such
dataDir on the disk, that looks like an inconsistent content on
the namespace.
fixes#14767
It is wasteful to allow parallel upgrades of MinIO server. This also generates
weird error invoked by selfupdate module when it happens such as:
'rename /opt/bin/.minio.old /opt/bin/..minio.old.old'
currently filterPefix was never used and set
that would filter out entries when needed
when `prefix` doesn't end with `/` - this
often leads to objects getting Walked(), Healed()
that were never requested by the caller.
without this wait there is a potential for some objects
that are in actively being decommissioned would cancel,
however the decommission status might wrongly conclude
this as "Complete".
To avoid this make sure to add waitgroups on the parallel
workers, allowing parallel copies to complete fully before
we return.
In previous releases, mc admin user list would return the list of users
that have policies mapped in IAM database. However, this was removed but
this commit will bring it back until we revamp this.
- This change switches to a new parquet library
- SelectObjectContent now takes a single lock at the beginning and holds it
during the operation. Previously the operation took a lock every time the
parquet library performed a Seek on the underlying object stream.
- Add basic support for LogicalType annotations for timestamps.
Execute the object, drive and net speedtests as part of the healthinfo
(if requested by the client), and include their result in the response.
The options for the speedtests have been picked from the default values
used by `mc support perf` command.
This commit improves the listing of encrypted objects:
- Use `etag.Format` and `etag.Decrypt`
- Detect SSE-S3 single-part objects in a single iteration
- Fix batch size to `250`
- Pass request context to `DecryptAll` to not waste resources
when a client cancels the operation.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds two new functions to the
internal `etag` package:
- `ETag.Format`
- `Decrypt`
The `Decrypt` function decrypts an encrypted
ETag using a decryption key. It returns not
encrypted / multipart ETags unmodified.
The `Decrypt` function is mainly used when
handling SSE-S3 encrypted single-part objects.
In particular, the ETag of an SSE-S3 encrypted
single-part object needs to be decrypted since
S3 clients expect that this ETag is equal to the
content MD5.
The `ETag.Format` method also covers SSE ETag handling.
MinIO encrypts all ETags of SSE single part objects.
However, only the ETag of SSE-S3 encrypted single part
objects needs to be decrypted.
The ETag of an SSE-C or SSE-KMS single part object
does not correspond to its content MD5 and can be
a random value.
The `ETag.Format` function formats an ETag such that
it is an AWS S3 compliant ETag. In particular, it
returns non-encrypted ETags (single / multipart)
unmodified. However, for encrypted ETags it returns
the trailing 16 bytes as ETag. For encrypted ETags
the last 16 bytes will be a random value.
The main purpose of `Format` is to format ETags
such that clients accept them as well-formed AWS S3
ETags.
It differs from the `String` method since `String`
will return string representations for encrypted
ETags that are not AWS S3 compliant.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
The deployment id was being written to the health report towards the end
of the handler. Because of this, if there was a timeout in any of the
data fetching, the deployment id was not getting written at all. Upload
of such reports fails on SUBNET as deployment id is the unique
identifier for a cluster in subnet.
Fixed by writing the deployment id at the beginning of the processing.
This commit simplifies the ETag decryption and size adjustment
when listing object parts.
When listing object parts, MinIO has to decrypt the ETag of all
parts if and only if the object resp. the parts is encrypted using
SSE-S3.
In case of SSE-KMS and SSE-C, MinIO returns a pseudo-random ETag.
This is inline with AWS S3 behavior.
Further, MinIO has to adjust the size of all encrypted parts due to
the encryption overhead.
The ListObjectParts does specifically not use the KMS bulk decryption
API (4d2fc530d0) since the ETags of all
parts are encrypted using the same object encryption key. Therefore,
MinIO only has to connect to the KMS once, even if there are multiple
parts resp. ETags. It can simply reuse the same object encryption key.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit adds support for encrypted KES
client private keys.
Now, it is possible to encrypt the KES client
private key (`MINIO_KMS_KES_KEY_FILE`) with
a password.
For example, KES CLI already supports the
creation of encrypted private keys:
```
kes identity new --encrypt --key client.key --cert client.crt MinIO
```
To decrypt an encrypted private key, the password
needs to be provided:
```
MINIO_KMS_KES_KEY_PASSWORD=<password>
```
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
This commit optimises the ETag decryption when
listing objects.
When MinIO lists objects, it has to decrypt the
ETags of single-part SSE-S3 objects.
It does not need to decrypt ETags of
- plaintext objects => Their ETag is not encrypted
- SSE-C objects => Their ETag is not the content MD5
- SSE-KMS objects => Their ETag is not the content MD5
- multipart objects => Their ETag is not encrypted
Hence, MinIO only needs to make a call to the KMS
when it needs to decrypt a single-part SSE-S3 object.
It can resolve the ETags off all other object types
locally.
This commit implements the above semantics by
processing an object listing in batches.
If the batch contains no single-part SSE-S3 object,
then no KMS calls will be made.
If the batch contains at least one single-part
SSE-S3 object we have to make at least one KMS call.
No we first filter all single-part SSE-S3 objects
such that we only request the decryption keys for
these objects.
Once we know which objects resp. ETags require a
decryption key, MinIO either uses the KES bulk
decryption API (if supported) or decrypts each
ETag serially.
This commit is a significant improvement compared
to the previous listing code. Before, a single
non-SSE-S3 object caused MinIO to fall-back to
a serial ETag decryption.
For example, if a batch consisted of 249 SSE-S3
objects and one single SSE-KMS object, MinIO would
send 249 requests to the KMS.
Now, MinIO will send a single request for exactly
those 249 objects and skip the one SSE-KMS object
since it can handle its ETag locally.
Further, MinIO would request decryption keys
for SSE-S3 multipart objects in the past - even
though multipart ETags are not encrypted.
So, if a bucket contained only multipart SSE-S3
objects, MinIO would make totally unnecessary
requests to the KMS.
Now, MinIO simply skips these multipart objects
since it can handle the ETags locally.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
In bulk ETag decryption, do not rely on the etag to check if it is
encrypted or not to decide if we should set the actual object size in
ObjectInfo. The reason is that multipart objects ETags are not
encrypted.
Always get the actual object size in that case.
This commit adds support for bulk ETag
decryption for SSE-S3 encrypted objects.
If KES supports a bulk decryption API, then
MinIO will check whether its policy grants
access to this API. If so, MinIO will use
a bulk API call instead of sending encrypted
ETags serially to KES.
Note that MinIO will not use the KES bulk API
if its client certificate is an admin identity.
MinIO will process object listings in batches.
A batch has a configurable size that can be set
via `MINIO_KMS_KES_BULK_API_BATCH_SIZE=N`.
It defaults to `500`.
This env. variable is experimental and may be
renamed / removed in the future.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
ListObjects, ListObjectsV2 calls are being heavily taxed when
there are many versions on objects left over from a previous
release or ILM was never setup to clean them up. Instead
of being absolutely correct at resolving the exact latest
version of an object, we simply rely on the top most 1
version and resolve the rest.
Once we have obtained the top most "1" version for
ListObject, ListObjectsV2 call we break out.
For ListObjects and ListObjectsV2 perform lifecycle checks on
all objects before returning. This will filter out objects that are
pending lifecycle expiration.
Bonus: Cheaper server pool conflict resolution by not converting to FileInfo.
When reloading a dynamic config allow the request pool to scale both ways.
Existing requests hold on to the previous pool, so they will pop the elements from that.
currently an on-going decommission, during a server
restart might block the startup sequence for relatively
longer periods, instead start the decommission in
background lazily.
This commit fixes two bugs in the `PutObjectPartHandler`.
First, `PutObjectPart` should return SSE-KMS headers
when the object is encrypted using SSE-KMS.
Before, this was not the case.
Second, the ETag should always be a 16 byte hex string,
perhaps followed by a `-X` (where `X` is the number of parts).
However, `PutObjectPart` used to return the encrypted ETag
in case of SSE-KMS. This leaks MinIO internal etag details
through the S3 API.
The combination of both bugs causes clients that use SSE-KMS
to fail when trying to validate the ETag. Since `PutObjectPart`
did not send the SSE-KMS response headers, the response looked
like a plaintext `PutObjectPart` response. Hence, the client
tries to verify that the ETag is the content-md5 of the part.
This could never be the case, since MinIO used to return the
encrypted ETag.
Therefore, clients behaving as specified by the S3 protocol
tried to verify the ETag in a situation they should not.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
When more than 2 disks are unavailable for listing, the same disk will be used for fallback.
This makes quorum calculations incorrect since the same disk will have multiple entries.
This PR keeps track of which fallback disks have been handed out and only every returns a disk once.
avoids creating new transport for each `isServerResolvable`
request, instead re-use the available global transport and do
not try to forcibly close connections to avoid TIME_WAIT
build upon large clusters.
Never use httpClient.CloseIdleConnections() since that can have
a drastic effect on existing connections on the transport pool.
Remove it everywhere.
- GetObject() with vid should return 405
- GetObject() without vid should return 404
- ListObjects() should ignore this object if this is the "latest" version of the object
- ListObjectVersions() should list this object as "DELETE marker"
- Remove data parts before sync'ing the version pending purge
changing root credentials makes service accounts
in-operable, this PR changes the way sessionToken
is generated for service accounts.
It changes service account behavior to generate
sessionToken claims from its own secret instead
of using global root credential.
Existing credentials will be supported by
falling back to verify using root credential.
fixes#14530
```
tmp = buf[want:]
```
Would potentially crash when `buf` is truncated for some reason
and does not have the expected bytes, this is of course considered
not normal and is an odd situation. But we do not need to crash
here instead allow for errors to be returned and let callers handle
the errors.
This PR simply adds a warning message when it detects older kernel
versions and warn's them about potential performance issues on this
kernel.
The issue can be seen only with parallel I/O across all drives
on denser setups such as 90 drives or 45 drives per server configurations.
This type of code is not necessary, read's of all
metadata content at `.minio.sys/config` automatically
triggers healing when necessary in the GetObjectNInfo()
call-path.
Having this code is not useful and this also adds to
the overall startup time of MinIO when there are lots
of users and policies.
The main goal of this PR is to solve the situation where disks stop
responding to operations. This generally causes an FD build-up and
eventually will crash the server.
This adds detection of hung disks, where calls on disk get stuck.
We add functionality to `xlStorageDiskIDCheck` where it keeps
track of the number of concurrent requests on a given disk.
A total number of 100 operations are allowed. If this limit is reached
we will block (but not reject) new requests, but we will monitor the
state of the disk.
If no requests have been completed or updated within a 15-second
window, we mark the disk as offline. Requests that are blocked will be
unblocked and return an error as "faulty disk".
New requests will be rejected until the disk is marked OK again.
Once a disk has been marked faulty, a check will run every 5 seconds that
will attempt to write and read back a file. As long as this fails the disk will
remain faulty.
To prevent lots of long-running requests to mark the disk faulty we
implement a callback feature that allows updating the status as parts
of these operations are running.
We add a reader and writer wrapper that will update the status of each
successful read/write operation. This should allow fine enough granularity
that a slow, but still operational disk will not reach 15 seconds where
50 operations have not progressed.
Note that errors themselves are not enough to mark a disk faulty.
A nil (or io.EOF) error will mark a disk as "good".
* Make concurrent disk setting configurable via `_MINIO_DISK_MAX_CONCURRENT`.
* de-couple IsOnline() from disk health tracker
The purpose of IsOnline() is to ensure that we
reconnect the drive only when the "drive" was
- disconnected from network we need to validate
if the drive is "correct" and is the same drive
which belongs to this server.
- drive was replaced we have to format it - we
support hot swapping of the drives.
IsOnline() is not meant for taking the drive offline
when it is hung, it is not useful we can let the
drive be online instead "return" errors for relevant
calls.
* return errFaultyDisk for DiskInfo() call
Co-authored-by: Harshavardhana <harsha@minio.io>
Possible future Improvements:
* Unify the REST server and local xlStorageDiskIDCheck. This would also improve stats significantly.
* Allow reads/writes to be aborted by the context.
* Add usage stats, concurrent count, blocked operations, etc.
Data usage does not always contain tiering info even if the data usage
information is valid. Avoid a crash in that case.
(e.g. the scanner scanned the namespace, the user enables tiering,
prometheus scrapes the server before the scanner gets a chance to
update the data usage with new tiering information)
Healing decisions would align with skipped folder counters. This can lead to files
never being selected for heal checks on "clean" paths.
Use different hashing methods and take objectHealProbDiv into account when
calculating the cycle.
Found by @vadmeste
This is a side-affect of the optimization done in PR #13544 which
causes a certain type of delete operations on given object versions
can cause lastVersion indication to be skipped, which leads to
an `xl.meta` where Versions[] slice is empty while the entire
file is intact by itself.
This PR tries to ensure that such files are visible and deletable
by regular means of listing as null 'delete-marker' and also
avoid the situation where this potential issue might arise.
When scanning using normal mode, HealObject() can report an
error saying that it found a corrupted part. This doesn't have
when HealObject() is called with bitrot scan flag. However, when
this happens, we can still restart HealObject() with the bitrot scan.
This is also important because this means the scanner and the
new disks healer will not be able to heal an object that doesn't
exist in a specific disk and has corruption in another disk.
Also without this PR, mc admin heal command without bitrot will report
an error.
This commit removes some duplicate code that
converts KES API errors.
This code was added since KES `0.18.0` changed
some exported API errors. However, the KES SDK
handles this error conversion itself.
Therefore, it is not necessary to duplicate this
behavior in MinIO.
See: 21555fa624/error.go (L94)
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
- Updating KES dependency to v.0.18.0
- Fixing incompatibility issue when checking for errors during KES key creation
Signed-off-by: Lenin Alevski <alevsk.8772@gmail.com>
In a distributed setup, a DiskInfo REST call to an unformatted disk
returns an error with no disk information, such as the disk endpoint
URL, which is unexpected.
metadata headers can have headers without values
as per AWS S3 spec however, we need to skip some
headers that do not have values that potentially
can have empty values set.
small setups do not return appropriate errors when speedtest
cannot run on small tiny setups, allow the tests to fail
appropriately more pro-actively.
many users bring toy setups, this PR simply returns an error
in such situations.
healing disks take active I/O it is possible
that deleted objects might stay in .trash
folder for a really long time until the drive
is fully healed.
this PR changes it such that we are making sure
we purge the active content written to these
disks as well.
- speedtest logs calls that were canceled
spuriously, in situations where it should
be ignored.
- all errors of interest are always sent back
to the client there is no need to log them
on the server console.
- PUT failures should negate the increments
such that GET is not attempted on unsuccessful
calls.
- do not attempt MRF on speedtest objects.
In the testing mode, reformatting disks will fail because the healing
code will complain if one disk is in root mode. This commit will
automatically set all disks as non-root if MINIO_CI_CD is set.
Currently, when applying any dynamic config, the system reloads and
re-applies the config of all the dynamic sub-systems.
This PR refactors the code in such a way that changing config of a given
dynamic sub-system will work on only that sub-system.
An onlineDisk means its a valid disk but it may be a
re-connected disk, this PR verifies that based on LastConn()
to only trigger MRF. Current code would again re-load the
disk 'format.json' which is not necessary and perhaps an
unnecessary call.
A potential side affect of this is closing perfectly online
disks and getting re-replaced by reloading 'format.json'.
This PR tries to avoid this situation by making sure MRF
is triggered but not reloading 'format.json' because of MRF.
Only the first `listAndHeal` would ever be able to write on errCh, blocking all others infinitely.
Instead read all errors but return the first non-nil, if any.
The intention appears to be that this should cancel on any error,
so that part is kept.
Regression from #13990
When more than one gateway reads and writes from the same mount point
and there is a load balancer pointing to those gateways. Each gateway
will try to create its own temporary append file but fails to clear it later
when not needed.
This commit creates a routine that checks all upload IDs saved in
multipart directory and remove any stale entry with the same upload id
in the memory and in the temporary background append folder as well.
Enabled with `mc admin config set alias/ api gzip_objects=on`
Standard filtering applies (1K response minimum, not compressed content
type, not range request, gzip accepted by client).
The current code considers a pool with all root disks to be as part
of a testing environment even if there are other pools with mounted
disks. This will result to illegitimate writing in root disks.
Fix this by simplifing the logic: require MINIO_CI_CD in order to skip
root disk check.
MinIO configuration is loaded after the initialization of the server
handlers, which will miss the initialization of the bucket forwarder
handler.
Though the federation is deprecated, let's fix this for the time being.
S3 spec returns x-amz-restore header in HEAD/GET object with the
following format:
```
x-amz-restore: ongoing-request="false", expiry-date="Fri, 21 Dec 2012
00:00:00 GMT"
```
This commit adds quotes as the current code does not support it. It will
also supports the old format saved in the disk (in xl.meta) for backward
compatibility.
A regression removed support of federation in the gateway mode.
Enable it again.
Federation is deprecated for a while but let's fix this for the time being.
Deleting bulk objects had an issue since the relevant versionID
is not passed through the layers to ensure that the dangling
object purge actually works cleanly.
This is a continuation of quorum related error returned by
multi-object delete API from #14248
This PR ensures that we pass down correct information as
well as extend the scope of dangling object detection.
When setting a config of a particular sub-system, validate the existing
config and notification targets of only that sub-system, so that
existing errors related to one sub-system (e.g. notification target
offline) do not result in errors for other sub-systems.
Some users running MinIO claim that their system became slow. One
way to investigate is to look at this Prometheus history of the number of
the requests reaching the server. The existing current S3 requests metric
is not enough because it can increase of the system really becomes slow,
due to disk issues for example.
startup speed-up, currently getFormatErasureInQuorum()
would spend up to 2-3secs when there are 3000+ drives
for example in a setup, simplify this implementation
to use drive counts.
DeleteMarkers do not have a default quorum, i.e it is possible that
DeleteMarkers were created with n/2+1 quorum as well to make sure
that we satisfy situations such as those we need to make sure delete
markers only expect n/2 read quorum.
Additionally we should also look at additional metadata on the
actual objects that might have been "erasure" upgraded with new
parity when disks are down.
In such a scenario do not default to the standard storage class
parity, instead use the parityBlocks present on the FileInfo to
ensure that we are dealing with the correct quorum for READs and
DELETEs.
Retry listings, when no next marker is returned and the result isn't truncated.
This can happen when an object is queued, but no info can be fetched.
Fixes#14190
The healing code repeatedly tries to heal a root disk when it is empty
the reason is that connectEndpoint() returns errUnformattedDisk even
if the disk is a root disk. Changing that to returning another error
will avoid queueing the disk to the healing code in each connect disks
iteration.
some upgraded objects might not get listed due
to different quorum ratios across objects.
make sure to list all objects that satisfy the
maximum possible quorum.
This change allows the MinIO server to lookup users in different directory
sub-trees by allowing specification of multiple search bases separated by
semicolons.
This PR removes an unnecessary state that gets
passed around for DiskIDs, which is not necessary
since each disk exactly knows which pool and which
set it belongs to on a running system.
Currently cached DiskId's won't work properly
because it always ends up skipping offline disks
and never runs healing when disks are offline, as
it expects all the cached diskIDs to be present
always. This also sort of made things in-flexible
in terms perhaps a new diskID for `format.json`.
(however this is not a big issue)
This is an unnecessary requirement that healing
via scanner needs all drives to be online, instead
healing should trigger even when partial nodes
and drives are available this ensures that we
keep the SLA in-tact on the objects when disks
are offline for a prolonged period of time.
Wrong resource is being fetched, since idx is incremented, but mapID is reused.
Regression caused by #13454 - that part didn't optimize anything anyway.
Publish storage functions latency to help compare the performance
of different disks in a single deployment.
e.g.:
```
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/1",server="localhost:9001"} 226
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/2",server="localhost:9002"} 1180
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/3",server="localhost:9003"} 1183
minio_node_disk_latency_us{api="storage.WalkDir",disk="/tmp/xl/4",server="localhost:9004"} 1625
```
- create internal erasure volumes only if the disk is unformatted
- return a copy of format data in xlStorage.ReadAll
- parse env vars only once, to be re-used by xl-storage
This speed-up is intended for faster startup times
for almost all MinIO operations. Changes here are
- Drives are not re-read for 'format.json' on a regular
basis once read during init is remembered and refreshed
at 5 second intervals.
- Do not do O_DIRECT tests on drives with existing 'format.json'
only fresh setups need this check.
- Parallelize initializing erasureSets for multiple sets.
- Avoid re-reading format.json when migrating 'format.json'
from really old V1->V2->V3
- Keep a copy of local drives for any given server in memory
for a quick lookup.
this helps in caching the resolved values early on, avoids
causing further resolution for individual nodes when
object layer comes online.
this can speed up our startup time during, upgrades etc by
an order of magnitude.
additional changes in connectLoadInitFormats() and parallelize
all calls that might be potentially blocking.
- Site replication was missing replicating users,
groups when an empty site was added.
- Add site replication for groups and users when they
are disabled and enabled.
- Add support for replicating bucket quota config.
When calculating signatures empty part ETags were not discarded, leading
to a different signature compared to freshly created ones.
This would mean that after a heal signature of the healed metadata would be
different. Fixing the calculation of signature will make these consistent.
Furthermore when inconsistent entries, with zero version ID, with the same
mod times but different signatures, the one with the lowest signature would
be picked for quorum check. Since this is 50/50, we fall back to a simple
quorum count on all signatures.
Each of these fixes by themselves will lead to quorum. Tests were added
for regressions and expected outcomes.
When the replication rule is based on tag matches, the replication process
should pick up targets matching the tags specified in the replication
rule.
Fixing regression due to #12880
repeated reads on single large objects in HPC like
workloads, need the following option to disable
O_DIRECT for a more effective usage of the kernel
page-cache.
However this optional should be used in very specific
situations only, and shouldn't be enabled on all
servers.
NVMe servers benefit always from keeping O_DIRECT on.
map labels might have been referenced else, this
can lead to concurrent access at lower layers.
avoid this by copying the information while
concurrently serving the metrics.
do not allow mutation to pool command line when there are
unfinished decommissions in place, disallow such scenarios
to avoid user mistakes.
also add testcases to cover all relevant scenarios.
When reading input for PutObject or PutObjectPart add a readahead buffer for big inputs.
This will make network reads+hashing separate run async with erasure coding and writes. This will reduce overall latency in distributed setups where the input is from upstream and writes go to other servers.
We will read at 2 buffers ahead, meaning one will always be ready/waiting and one is currently being read from.
This improves PutObject and PutObjectParts for these cases.
When deleting multiple versions it "gives" up with an errFileVersionNotFound if
a version cannot be found. This effectively skips deleting other versions
sent in the same request.
This can happen on inconsistent objects. We should ignore errFileVersionNotFound
and continue with others.
We already ignore these at the caller level, this PR is continuation of 54a9877
This PR simplifies few things
- Multipart parts are renamed, upon failure are unrenamed() keep this
multipart specific behavior it is needed and works fine.
- AbortMultipart should blindly delete once lock is acquired instead
of re-reading metadata and calculating quorum, abort is a delete()
operation and client has no business looking for errors on this.
- Skip Access() calls to folders that are operating on
`.minio.sys/multipart` folder as well.
Large clusters with multiple sets, or multi-pool setups at times might
fail and report unexpected "file not found" errors. This can become
a problem during startup sequence when some files need to be created
at multiple locations.
- This PR ensures that we nil the erasure writers such that they
are skipped in RenameData() call.
- RenameData() doesn't need to "Access()" calls for `.minio.sys`
folders they always exist.
- Make sure PutObject() never returns ObjectNotFound{} for any
errors, make sure it always returns "WriteQuorum" when renameData()
fails with ObjectNotFound{}. Return appropriate errors for all
other cases.
Currently tag removal leaves replication state as `PENDING`
because the `HEAD` api returns just a tag count but not the
actual tags, and this is treated as a no-op
```
λ mc admin decommission start alias/ http://minio{1...2}/data{1...4}
```
```
λ mc admin decommission status alias/
┌─────┬─────────────────────────────────┬──────────────────────────────────┬────────┐
│ ID │ Pools │ Capacity │ Status │
│ 1st │ http://minio{1...2}/data{1...4} │ 439 GiB (used) / 561 GiB (total) │ Active │
│ 2nd │ http://minio{3...4}/data{1...4} │ 329 GiB (used) / 421 GiB (total) │ Active │
└─────┴─────────────────────────────────┴──────────────────────────────────┴────────┘
```
```
λ mc admin decommission status alias/ http://minio{1...2}/data{1...4}
Progress: ===================> [1GiB/sec] [15%] [4TiB/50TiB]
Time Remaining: 4 hours (started 3 hours ago)
```
```
λ mc admin decommission status alias/ http://minio{1...2}/data{1...4}
ERROR: This pool is not scheduled for decommissioning currently.
```
```
λ mc admin decommission cancel alias/
┌─────┬─────────────────────────────────┬──────────────────────────────────┬──────────┐
│ ID │ Pools │ Capacity │ Status │
│ 1st │ http://minio{1...2}/data{1...4} │ 439 GiB (used) / 561 GiB (total) │ Draining │
└─────┴─────────────────────────────────┴──────────────────────────────────┴──────────┘
```
> NOTE: Canceled decommission will not make the pool active again, since we might have
> Potentially partial duplicate content on the other pools, to avoid this scenario be
> very sure to start decommissioning as a planned activity.
```
λ mc admin decommission cancel alias/ http://minio{1...2}/data{1...4}
┌─────┬─────────────────────────────────┬──────────────────────────────────┬────────────────────┐
│ ID │ Pools │ Capacity │ Status │
│ 1st │ http://minio{1...2}/data{1...4} │ 439 GiB (used) / 561 GiB (total) │ Draining(Canceled) │
└─────┴─────────────────────────────────┴──────────────────────────────────┴────────────────────┘
```
In a multi-pool setup when disks are coming up, or in a single pool
setup let's say with 100's of erasure sets with a slow network.
It's possible when healing is attempted on `.minio.sys/config`
folder, it can lead to healing unexpectedly deleting some policy
files as dangling due to a mistake in understanding when `isObjectDangling`
is considered to be 'true'.
This issue happened in commit 30135eed86
when we assumed the validMeta with empty ErasureInfo is considered
to be fully dangling. This implementation issue gets exposed when
the server is starting up.
This is most easily seen with multiple-pool setups because of the
disconnected fashion pools that come up. The decision to purge the
object as dangling is taken incorrectly prior to the correct state
being achieved on each pool, when the corresponding drive let's say
returns 'errDiskNotFound', a 'delete' is triggered. At this point,
the 'drive' comes online because this is part of the startup sequence
as drives can come online lazily.
This kind of situation exists because we allow (totalDisks/2) number
of drives to be online when the server is being restarted.
Implementation made an incorrect assumption here leading to policies
getting deleted.
Added tests to capture the implementation requirements.
- This allows site-replication to be configured when using OpenID or the
internal IDentity Provider.
- Internal IDP IAM users and groups will now be replicated to all members of the
set of replicated sites.
- When using OpenID as the external identity provider, STS and service accounts
are replicated.
- Currently this change dis-allows root service accounts from being
replicated (TODO: discuss security implications).
It is possible that GetLock() call remembers a previously
failed releaseAll() when there are networking issues, now
this state can have potential side effects.
This PR tries to avoid this side affect by making sure
to initialize NewNSLock() for each GetLock() attempts
made to avoid any prior state in the memory that can
interfere with the new lock grants.
The current usage of assuming `default` parity of `4` is not correct
for all objects stored on MinIO, objects in .minio.sys have maximum
parity, healing won't trigger on these objects due to incorrect
verification of quorum.
The AddUser() API endpoint was accepting a policy field.
This API is used to update a user's secret key and account
status, and allows a regular user to update their own secret key.
The policy update is also applied though does not appear to
be used by any existing client-side functionality.
This fix changes the accepted request body type and removes
the ability to apply policy changes as that is possible via the
policy set API.
NOTE: Changing passwords can be disabled as a workaround
for this issue by adding an explicit "Deny" rule to disable the API
for users.
This PR is an attempt to make this configurable
as not all situations have same level of tolerable
delta, i.e disks are replaced days apart or even
hours.
There is also a possibility that nodes have drifted
in time, when NTP is not configured on the system.
data shards were wrong due to a healing bug
reported in #13803 mainly with unaligned object
sizes.
This PR is an attempt to automatically avoid
these shards, with available information about
the `xl.meta` and actually disk mtime.
- When using MinIO's internal IDP, STS credential permissions did not check the
groups of a user.
- Also fix bug in policy checking in AccountInfo call
Also log all the missed events and logs instead of silently
swallowing the events.
Bonus: Extend the logger webhook to support mTLS
similar to audit webhook target.
- r.ulock was not locked when r.UsageCache was being modified
Bonus:
- simplify code by removing some unnecessary clone methods - we can
do this because go arrays are values (not pointers/references) that are
automatically copied on assignment.
- remove some unnecessary map allocation calls
data-structures were repeatedly initialized
this causes GC pressure, instead re-use the
collectors.
Initialize collectors in `init()`, also make
sure to honor the cache semantics for performance
requirements.
Avoid a global map and a global lock for metrics
lookup instead let them all be lock-free unless
the cache is being invalidated.
When STS credentials are created for a user, a unique (hopefully stable) parent
user value exists for the credential, which corresponds to the user for whom the
credentials are created. The access policy is mapped to this parent-user and is
persisted. This helps ensure that all STS credentials of a user have the same
policy assignment at all times.
Before this change, for an OIDC STS credential, when the policy claim changes in
the provider (when not using RoleARNs), the change would not take effect on
existing credentials, but only on new ones.
To support existing STS credentials without parent-user policy mappings, we
lookup the policy in the policy claim value. This behavior should be deprecated
when such support is no longer required, as it can still lead to stale
policy mappings.
Additionally this change also simplifies the implementation for all non-RoleARN
STS credentials. Specifically, for AssumeRole (internal IDP) STS credentials,
policies are picked up from the parent user's policies; for
AssumeRoleWithCertificate STS credentials, policies are picked up from the
parent user mapping created when the STS credential is generated.
AssumeRoleWithLDAP already picks up policies mapped to the virtual parent user.
A corner case can occur where the delete-marker was propagated
but the metadata could not be updated on the primary. Sending
a RemoveObject call with the Delete marker version would end
up permanently deleting the version on target. Instead, perform
a Stat on the delete-marker version on target and redo replication
only if the delete-marker is missing on target.
After the introduction of Refresh logic in locks, the data scanner can
quit when the data scanner lock is not able to get refreshed. In that
case, the context of the data scanner will get canceled and
runDataScanner() will quit. Another server would pick the scanning
routine but after some time, all nodes can just have all scanning
routine aborted, as described above.
This fix will just run the data scanner in a loop.
- Allow proper SRError to be propagated to
handlers and converted appropriately.
- Make sure to enable object locking on buckets
when requested in MakeBucketHook.
- When DNSConfig is enabled attempt to delete it
first before deleting buckets locally.
- Rename MaxNoncurrentVersions tag to NewerNoncurrentVersions
Note: We apply overlapping NewerNoncurrentVersions rules such that
we honor the highest among applicable limits. e.g if 2 overlapping rules
are configured with 2 and 3 noncurrent versions to be retained, we
will retain 3.
- Expire newer noncurrent versions after noncurrent days
- MinIO extension: allow noncurrent days to be zero, allowing expiry
of noncurrent version as soon as more than configured
NewerNoncurrentVersions are present.
- Allow NewerNoncurrentVersions rules on object-locked buckets
- No x-amz-expiration when NewerNoncurrentVersions configured
- ComputeAction should skip rules with NewerNoncurrentVersions > 0
- Add unit tests for lifecycle.ComputeAction
- Support lifecycle rules with MaxNoncurrentVersions
- Extend ExpectedExpiryTime to work with zero days
- Fix all-time comparisons to be relative to UTC
- This introduces a new admin API with a query parameter (v=2) to return a
response with the timestamps
- Older API still works for compatibility/smooth transition in console
- allow any regular user to change their own password
- allow STS credentials to create users if permissions allow
Bonus: do not allow changes to sts/service account credentials (via add user API)
ListObjects() should never list a delete-marked folder
if latest is delete marker and delimiter is not provided.
ListObjectVersions() should list a delete-marked folder
even if latest is delete marker and delimiter is not
provided.
Enhance further versioning listing on the buckets
request.Form uses less memory allocation and avoids gorilla mux matching
with weird characters in parameters such as '\n'
- Remove Queries() to avoid matching
- Ensure r.ParseForm is called to populate fields
- Add a unit test for object names with '\n'
delete marked objects should not be considered
for listing when listing is delimited, this issue
as introduced in PR #13804 which was mainly to
address listing of directories in listing when
delimited.
This PR fixes this properly and adds tests to
ensure that we behave in accordance with how
an S3 API behaves for ListObjects() without
versions.
Save part.1 for writebacks in a separate folder
and move it to cache dir atomically while saving
the cache metadata. This is to avoid GC mistaking
part.1 as orphaned cache entries and purging them.
This PR also fixes object size being overwritten during
retries for write-back mode.
- deleting policies was deleting all LDAP
user mapping, this was a regression introduced
in #13567
- deleting of policies is properly sent across
all sites.
- remove unexpected errors instead embed the real
errors as part of the 500 error response.
- deleteBucket() should be called for cleanup
if client abruptly disconnects
- out of disk errors should be sent to client
properly and also cancel the calls
- limit concurrency to available MAXPROCS not
32 for auto-tuned setup, if procs are beyond
32 then continue normally. this is to handle
smaller setups.
fixes#13834
Return errors when untar fails at once.
Current error handling was quite a mess. Errors are written
to the stream, but processing continues.
Instead, return errors when they occur and transform
internal errors to bad request errors, since it is likely a
problem with the input.
Fixes#13832
Sometimes, we see an error message like "Server expects 'storage' API
version 'v41', instead found 'v41'" shows a more generic error message
with the path of the REST call.
The earlier approach of using a license token for
communicating with SUBNET is being replaced
with a simpler mechanism of API keys. Unlike the
license which is a JWT token, these API keys will
be simple UUID tokens and don't have any embedded
information in them. SUBNET would generate the
API key on cluster registration, and then it would
be saved in this config, to be used for subsequent
communication with SUBNET.
Following scenario such as objects that exist inside a
prefix say `folder/` must be included in the listObjects()
response.
```
2aa16073-387e-492c-9d59-b4b0b7b6997a v2 DEL folder/
a5b9ce68-7239-4921-90ab-20aed402c7a2 v1 PUT folder/
f2211798-0eeb-4d9e-9184-fcfeae27d069 v1 PUT folder/1.txt
```
Current master does not handle this scenario, because it
ignores the top level delete-marker on folders. This is
however unexpected. It is expected that list-objects returns
the top level prefix in this situation.
```
aws s3api list-objects --bucket harshavardhana --prefix unique/ \
--delimiter / --profile minio --endpoint-url http://localhost:9000
{
"CommonPrefixes": [
{
"Prefix": "unique/folder/"
}
]
}
```
There are applications in the wild such as Hadoop s3a connector
that exploit this behavior and expect the folder to be present
in the response.
This also makes the behavior consistent with AWS S3.
single object delete was not working properly
on a bucket when versioning was suspended,
current version 'null' object was never removed.
added unit tests to cover the behavior
fixes#13783
totalDrives reported in speedTest result were wrong
for multiple pools, this PR fixes this.
Bonus: add support for configurable storage-class, this
allows us to test REDUCED_REDUNDANCY to see further
maximum throughputs across the cluster.
- Allows setting a role policy parameter when configuring OIDC provider
- When role policy is set, the server prints a role ARN usable in STS API requests
- The given role policy is applied to STS API requests when the roleARN parameter is provided.
- Service accounts for role policy are also possible and work as expected.
- New sub-system has "region" and "name" fields.
- `region` subsystem is marked as deprecated, however still works, unless the
new region parameter under `site` is set - in this case, the region subsystem is
ignored. `region` subsystem is hidden from top-level help (i.e. from `mc admin
config set myminio`), but appears when specifically requested (i.e. with `mc
admin config set myminio region`).
- MINIO_REGION, MINIO_REGION_NAME are supported as legacy environment variables for server region.
- Adds MINIO_SITE_REGION as the current environment variable to configure the
server region and MINIO_SITE_NAME for the site name.
The index was converted directly from bytes to binary. This would fail a roundtrip through json.
This would result in `Error: invalid input: magic number mismatch` when reading back.
On non-erasure backends store index as base64.
The httpStreamResponse should not return until CloseWithError has been called.
Instead keep track of write state and skip writing/flushing if an error has occurred.
Fixes#13743
Regression from #13597 (not released)
Go's atomic.Value does not support `nil` type,
concrete type is necessary to avoid any panics with
the current implementation.
Also remove boolean to turn-off tracking of freezeCount.
an active running speedTest will reject all
new S3 requests to the server, until speedTest
is complete.
this is to ensure that speedTest results are
accurate and trusted.
Co-authored-by: Klaus Post <klauspost@gmail.com>
Since JWT tokens remain valid for up to 15 minutes, we
don't have to regenerate tokens for every call.
Cache tokens for matching access+secret+audience
for up to 15 seconds.
```
BenchmarkAuthenticateNode/uncached-32 270567 4179 ns/op 2961 B/op 33 allocs/op
BenchmarkAuthenticateNode/cached-32 7684824 157.5 ns/op 48 B/op 1 allocs/op
```
Reduces internode call allocations a great deal.
FileInfo quorum shouldn't be passed down, instead
inferred after obtaining a maximally occurring FileInfo.
This PR also changes other functions that rely on
wrong quorum calculation.
Update tests as well to handle the proper requirement. All
these changes are needed when migrating from older deployments
where we used to set N/2 quorum for reads to EC:4 parity in
newer releases.
dataDir loosely based on maxima is incorrect and does not
work in all situations such as disks in the following order
- xl.json migration to xl.meta there may be partial xl.json's
leftover if some disks are not yet connected when the disk
is yet to come up, since xl.json mtime and xl.meta is
same the dataDir maxima doesn't work properly leading to
quorum issues.
- its also possible that XLV1 might be true among the disks
available, make sure to keep FileInfo based on common quorum
and skip unexpected disks with the older data format.
Also, this PR tests upgrade from older to a newer release if the
data is readable and matches the checksum.
NOTE: this is just initial work we can build on top of this to do further tests.
there is a corner case where the new check
doesn't work where dataDir has changed, especially
when xl.json -> xl.meta healing happens, if some
healing is partial this can make certain backend
files unreadable.
This PR fixes and updates unit-tests
This unit allows users to limit the maximum number of noncurrent
versions of an object.
To enable this rule you need the following *ilm.json*
```
cat >> ilm.json <<EOF
{
"Rules": [
{
"ID": "test-max-noncurrent",
"Status": "Enabled",
"Filter": {
"Prefix": "user-uploads/"
},
"NoncurrentVersionExpiration": {
"MaxNoncurrentVersions": 5
}
}
]
}
EOF
mc ilm import myminio/mybucket < ilm.json
```
currently getReplicationConfig() failure incorrectly
returns error on unexpected buckets upon upgrade, we
should always calculate usage as much as possible.
listing can fail and it is allowed to be retried,
instead of returning right away return an error at
the end - heal the rest of the buckets and objects,
and when we are retrying skip the buckets that
are already marked done by using the tracked buckets.
fixes#12972
- Go might reset the internal http.ResponseWriter() to `nil`
after Write() failure if the go-routine has returned, do not
flush() such scenarios and avoid spurious flushes() as
returning handlers always flush.
- fix some racy tests with the console
- avoid ticker leaks in certain situations
Existing:
```go
type xlMetaV2 struct {
Versions []xlMetaV2Version `json:"Versions" msg:"Versions"`
}
```
Serialized as regular MessagePack.
```go
//msgp:tuple xlMetaV2VersionHeader
type xlMetaV2VersionHeader struct {
VersionID [16]byte
ModTime int64
Type VersionType
Flags xlFlags
}
```
Serialize as streaming MessagePack, format:
```
int(headerVersion)
int(xlmetaVersion)
int(nVersions)
for each version {
binary blob, xlMetaV2VersionHeader, serialized
binary blob, xlMetaV2Version, serialized.
}
```
xlMetaV2VersionHeader is <= 30 bytes serialized. Deserialized struct
can easily be reused and does not contain pointers, so efficient as a
slice (single allocation)
This allows quickly parsing everything as slices of bytes (no copy).
Versions are always *saved* sorted by modTime, newest *first*.
No more need to sort on load.
* Allows checking if a version exists.
* Allows reading single version without unmarshal all.
* Allows reading latest version of type without unmarshal all.
* Allows reading latest version without unmarshal of all.
* Allows checking if the latest is deleteMarker by reading first entry.
* Allows adding/updating/deleting a version with only header deserialization.
* Reduces allocations on conversion to FileInfo(s).
This will help other projects like `health-analyzer` to verify that the
struct was indeed populated by the minio server, and is not
default-populated during unmarshalling of the JSON.
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
legacy objects in 'xl.json' after upgrade, should have
following sequence of events - bucket should have versioning
enabled and the object should have been overwritten with
another version of an object.
this situation was not handled, which would lead to older
objects to stay perpetually with "legacy" dataDir, however
these objects were readable by all means - there weren't
converted to newer format.
This PR fixes this situation properly.
Add a new Prometheus metric for bucket replication latency
e.g.:
minio_bucket_replication_latency_ns{
bucket="testbucket",
operation="upload",
range="LESS_THAN_1_MiB",
server="127.0.0.1:9001",
targetArn="arn:minio:replication::45da043c-14f5-4da4-9316-aba5f77bf730:testbucket"} 2.2015663e+07
Co-authored-by: Klaus Post <klauspost@gmail.com>
container limits would not be properly honored in
our current implementation, mem.VirtualMemory()
function only reads /proc/meminfo which points to
the host system information inside the container.
This feature is useful in situations when console is exposed
over multiple intranent or internet entities when users are
connecting over local IP v/s going through load balancer.
Related console work was merged here
373bfbfe3f
- remove some duplicated code
- reported a bug, separately fixed in #13664
- using strings.ReplaceAll() when needed
- using filepath.ToSlash() use when needed
- remove all non-Go style comments from the codebase
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
- add checks such that swapped disks are detected
and ignored - never used for normal operations.
- implement `unrecognizedDisk` to be ignored with
all operations returning `errDiskNotFound`.
- also add checks such that we do not load unexpected
disks while connecting automatically.
- additionally humanize the values when printing the errors.
Bonus: fixes handling of non-quorum situations in
getLatestFileInfo(), that does not work when 2 drives
are down, currently this function would return errors
incorrectly.
creating service accounts is implicitly enabled
for all users, this PR however adds support to
reject creating service accounts, with an explicit
"Deny" policy.
On first list resume or when specifying a custom markers entries could be missed in rare cases.
Do conservative truncation of entries when forwarding.
Replaces #13619
If a given MinIO config is dynamic (can be changed without restart),
ensure that it can be reset also without restart.
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
when `MINIO_CACHE_COMMIT` is set.
- `writeback` caching applies only to single
uploads. When cache commit mode is
`writeback`, default multipart caching to be
synchronous.
- Add writethrough caching for single uploads
This commit makes the MinIO server behavior more consistent
w.r.t. key usage verification.
When MinIO verifies the client certificates it also checks
that the client certificate is valid of client authentication
(or any (i.e. wildcard) usage).
However, the MinIO server used to not verify the client key usage
when client certificate verification was disabled.
Now, the MinIO server verifies the client key usage even when
client certificate verification has been disabled. This makes
the MinIO behavior more consistent from a client's perspective.
Now, a client certificate has to be valid for client authentication
in all cases.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Bonus: if runs have PUT higher then capture it anyways
to display an unexpected result, which provides a way
to understand what might be slowing things down on the
system.
For example on a Data24 WDC setup it is clearly visible
there is a bug in the hardware.
```
./mc admin speedtest wdc/
⠧ Running speedtest (With 64 MiB object size, 32 concurrency) PUT: 31 GiB/s GET: 24 GiB/s
⠹ Running speedtest (With 64 MiB object size, 48 concurrency) PUT: 38 GiB/s GET: 24 GiB/s
MinIO 2021-11-04T06:08:33Z, 6 servers, 48 drives
PUT: 38 GiB/s, 605 objs/s
GET: 24 GiB/s, 383 objs/s
```
Reads are almost 14GiB/sec slower than Writes which
is practically not possible.
This reverts commit 091a7ae359.
- Ensure all actions accessing storage lock properly.
- Behavior change: policies can be deleted only when they
are not associated with any active credentials.
Also adds fix for accidental canned policy removal that was present in the
reverted version of the change.
Windows users often click on the binary without
knowing MinIO is a command-line tool and should be
run from a terminal. Throw a message to guide them
on what to do.
Co-authored-by: Klaus Post <klauspost@gmail.com>
Borrowed idea from Go's usage of this
optimization for ReadFrom() on client
side, we should re-use the 32k buffers
io.Copy() allocates for generic copy
from a reader to writer.
the performance increase for reads for
really tiny objects is at this range
after this change.
> * Fastest: +7.89% (+1.3 MiB/s) throughput, +7.89% (+1308.1) obj/s
- Ensure all actions accessing storage lock properly.
- Behavior change: policies can be deleted only when they
are not associated with any active credentials.
- The race happens with a goroutine that refreshes IAM cache data from storage.
- It could lead to deleted users re-appearing as valid live credentials.
- This change also causes CI to run tests without a race flag (in addition to
running it with).
deleting collection of versions belonging
to same object, we can avoid re-reading
the xl.meta from the disk instead purge
all the requested versions in-memory,
the tradeoff is to allocate a map to de-dup
the versions, allow disks to be read only
once per object.
additionally reduce the data transfer between
nodes by shortening msgp data values.
- combine similar looking functionalities into single
handlers, and remove unnecessary proxying of the
requests at handler layer.
- remove bucket forwarding handler as part of default setup
add it only if bucket federation is enabled.
Improvements observed for 1kiB object reads.
```
-------------------
Operation: GET
Operations: 4538555 -> 4595804
* Average: +1.26% (+0.2 MiB/s) throughput, +1.26% (+190.2) obj/s
* Fastest: +4.67% (+0.7 MiB/s) throughput, +4.67% (+739.8) obj/s
* 50% Median: +1.15% (+0.2 MiB/s) throughput, +1.15% (+173.9) obj/s
```
Removes RLock/RUnlock for updating metadata,
since we already take a write lock to update
metadata, this change removes reading of xl.meta
as well as an additional lock, the performance gain
should increase 3x theoretically for
- PutObjectRetention
- PutObjectLegalHold
This optimization is mainly for Veeam like
workloads that require a certain level of iops
from these API calls, we were losing iops.
read/writers are not concurrent in handlers
and self contained - no need to use atomics on
them.
avoids unnecessary contentions where it's not
required.
Logger targets were not race protected against concurrent updates from for example `HTTPConsoleLoggerSys`.
Restrict direct access to targets and make slices immutable so a returned slice can be processed safely without locks.
As we use etcd's watch interface, we do not need the
network notifications as they are no-ops anyway.
Bonus: Remove globalEtcdClient global usage in IAM
IAMSys is a higher-level object, that should not be called by the lower-level
storage API interface for IAM. This is to prepare for further improvements in
IAM code.
etcd operations, get/put/delete, should be logged when failed
with errors other than not found error. It will make it easier to
see connections issues from MinIO to etcd.
Refresh was doing a linear scan of all locked resources. This was adding
up to significant delays in locking on high load systems with long
running requests.
Add a secondary index for O(log(n)) UID -> resource lookups.
Multiple resources are stored in consecutive strings.
Bonus fixes:
* On multiple Unlock entries unlock the write locks we can.
* Fix `expireOldLocks` skipping checks on entry after expiring one.
* Return fast on canTakeUnlock/canTakeLock.
* Prealloc some places.
We do not reliably know the length of compressed data, including headers.
Request until the end-of-stream. Results will still be properly truncated.
Fixes#13441
Testing with `mc sql --compression BZIP2 --csv-input "rd=\n,fh=USE,fd=;" --query="select COUNT(*) from S3Object" local2/testbucket/nyc-taxi-data-10M.csv.bz2`
Before 96.98s, after 10.79s. Uses about 70% CPU while running.
offset+length should match the Size() of the individual parts
return 'errFileCorrupt' otherwise, to trigger healing of the individual
parts do not error out prematurely when healing such bitrot's upon
successful parts being written to the client.
another issue this PR fixes is to not return and error to
the client if we have just triggered a heal on a specific
part of the object, instead continue to read all the content
and let the heal happen asynchronously later.
Looks like policy restriction was not working properly
for normal users when they are not svc or STS accounts.
- svc accounts are now properly fixed to get
right permissions when its inherited, so
we do not have to set 'owner = true'
- sts accounts have always been using right
permissions, do not need an explicit lookup
- regular users always have proper policy mapping
- avoids relying in listQuorum from the underlying listObjects()
and potentially missing entries if any.
- avoid the entire merging logic etc, listing raw set by set
and loading whatever is found is cleaner when dealing with
a large cluster for IAM metadata.
* fix: disallow invalid x-amz-security-token for root credentials
fixes#13335
This was a regression added in #12947 when this part of the
code was refactored to avoid privilege issues with service
accounts with session policy.
Bonus:
- fix: AssumeRoleWithCertificate policy mapping and reload
AssumeRoleWithCertificate was not mapping to correct
policies even after successfully generating keys, since
the claims associated with this API were never looked up
properly. Ensure that policies are set appropriately.
- GetUser() API was not loading policies correctly based
on AccessKey based mapping which is true with OpenID
and AssumeRoleWithCertificate API.
with some broken clients allow non-strict validation
of sha256 when ContentLength > 0, it has been found in
the wild some applications that need this behavior. This
shall be only allowed if `--no-compat` is used.
change credentials handling such that
prefer MINIO_* envs first if they work,
if not fallback to AWS credentials. If
they fail we fail to start anyways.
This change allows a set of MinIO sites (clusters) to be configured
for mutual replication of all buckets (including bucket policies, tags,
object-lock configuration and bucket encryption), IAM policies,
LDAP service accounts and LDAP STS accounts.
In (erasureServerPools).MakeBucketWithLocation deletes the created
buckets if any set returns an error.
Add `NoRecreate` option, which will not recreate the bucket
in `DeleteBucket`, if the operation fails.
Additionally use context.Background() for operations we always want to be performed.
bucket deletes should purge entire bucket metadata
appropriately, use rename() to move the metadata files
to trash folder,
for dangling buckets instead of doing recursive deletes,
rename such buckets to trash folder as well.
Bonus: reduce retry duration for listing to 200ms
additionally optimize for IP only setups, avoid doing
unnecessary lookups if the Dial addr is an IP.
allow support for multiple listeners on same socket,
this is mainly meant for future purposes.
DeleteMarkers were unreadable if they had quorum based
guarantees, this PR tries to fix this behavior appropriately.
DeleteMarkers with sufficient should be allowed and the
return error should be accordingly with or without version-id.
This also allows for overwrites which may not be possible
in a multi-pool setup.
fixes#12787
it would seem like using `bufio.Scan()` is very
slow for heavy concurrent I/O, ie. when r.Body
is slow , instead use a proper
binary exchange format, to marshal and unmarshal
the LockArgs datastructure in a cleaner way.
this PR increases performance of the locking
sub-system for tiny repeated read lock requests
on same object.
```
BenchmarkLockArgs
BenchmarkLockArgs-4 6417609 185.7 ns/op 56 B/op 2 allocs/op
BenchmarkLockArgsOld
BenchmarkLockArgsOld-4 1187368 1015 ns/op 4096 B/op 1 allocs/op
```
This PR brings two optimizations mainly
for page-cache build-up and how to avoid
getting OOM killed in the process. Although
these memories are reclaimable Linux is not
fast enough to reclaim them as needed on a
very busy system. fadvise is a system call
implemented in Linux to advise page-cache to
avoid overload as we get significant amount
of requests on the server.
- FADV_SEQUENTIAL tells that all I/O from now
is going to be sequential, allowing for more
resposive throughput.
- FADV_NOREUSE tells kernel to start removing
things for this 'fd' from page-cache.
An endpoint can be empty when a disk is offline or something
wrong with it. Avoid it by filling erasureSets.endpointStrings
with values from arguments.
This was a regression introduced in '14bb969782'
this has the potential to cause corruption when
there are concurrent overwrites attempting to update
the content on the namespace.
This PR adds a situation where PutObject(), CopyObject()
compete properly for the same locks with NewMultipartUpload()
however it ends up turning off competing locks for the actual
object with GetObject() and DeleteObject() - since they do not
compete due to concurrent I/O on a versioned bucket it can lead
to loss of versions.
This PR fixes this bug with multi-pool setup with replication
that causes corruption of inlined data due to lack of competing
locks in a multi-pool setup.
Instead CompleteMultipartUpload holds the necessary
locks when finishing the transaction, knowing the exact
location of an object to schedule the multipart upload
doesn't need to compete in this manner, a pool id location
for existing object.
This reverts commit 91567ba916.
Revert because the error was incorrectly converted, there are
callers that rely on errConfigNotFound and it also took away
the migration code.
Instead the correct fix is PutBucketTaggingHandler() which
is already added.
also remove HealObjects() code from dataScanner running another
listing from the data-scanner is super in-efficient and in-fact
this code is redundant since we already attempt to heal all
dangling objects anyways.
- Supports object locked buckets that require
PutObject() to set content-md5 always.
- Use SSE-S3 when S3 gateway is being used instead
of SSE-KMS for auto-encryption.
This PR however also proceeds to simplify the loading
of various subsystems such as
- globalNotificationSys
- globalTargetSys
converge them directly into single bucket metadata sys
loader, once that is loaded automatically every other
target should be loaded and configured properly.
fixes#13252
DeleteObject() on existing objects before `xl.json` to
`xl.meta` change were not working, not sure when this
regression was added. This PR fixes this properly.
Also this PR ensures that we perform rename of xl.json
to xl.meta only during "write" phase of the call i.e
either during Healing or PutObject() overwrites.
Also handles few other scenarios during migration where
`backendEncryptedFile` was missing deleteConfig() will
fail with `configNotFound` this case was not ignored,
which can lead to failure during upgrades.
once we have competed for locks, verify if the
context is still valid - this is to ensure that
we do not start readdir() or read() calls on the
drives on canceled connections.
This commit brings two locks instead of single lock for
WalkDir() calls on top of c25816eabc.
The main reason is to avoid contention between readMetadata()
and ListDir() calls, ListDir() can take time on prefixes that
are huge for readdir() but this shouldn't end up blocking
all readMetadata() operations, this allows for more room for
I/O while not overly penalizing all listing operations.
This commit fixes an issue in the `AssumeRoleWithCertificate`
handler.
Before clients received an error when they send
a chain of X.509 certificates (their client certificate as
well as intermediate / root CAs).
Now, client can send a certificate chain and the server
will only consider non-CA / leaf certificates as possible
client certificate candidates. However, the client still
can only send one certificate.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
When unable to load existing metadata new versions
would not be written. This would leave objects in a
permanently unrecoverable state
Instead, start with clean metadata and write the incoming data.
Some identity providers like GitLab do not provide
information about group membership as part of the
identity token claims. They only expose it via OIDC compatible
'/oauth/userinfo' endpoint, as described in the OpenID
Connect 1.0 sepcification.
But this of course requires application to make sure to add
additional accessToken, since idToken cannot be re-used to
perform the same 'userinfo' call. This is why this is specialized
requirement. Gitlab seems to be the only OpenID vendor that requires
this support for the time being.
fixes#12367
Don't perform an independent evaluation of inlining, but mirror the decision made when uploading the object.
Leads to some objects being inlined or not based on new metrics. Instead respect previous decision.
Replication was not working properly for encrypted
objects in single PUT object for preserving etag,
We need to make sure to preserve etag such that replication
works properly and not gets into infinite loops of copying
due to ETag mismatches.
This will allow objects to relinquish read lock held during
replication earlier if the target is known to be down
without waiting for connection timeout when replication
is attempted.
Stop async listing if we have not heard back from the client for 3 minutes.
This will stop spending resources on async listings when they are unlikely to get used.
If the client returns a new listing will be started on the second request.
Stop saving cache metadata to disk. It is cleared on restarts anyway. Removes all
load/save functionality
This commit adds a new STS API for X.509 certificate
authentication.
A client can make an HTTP POST request over a TLS connection
and MinIO will verify the provided client certificate, map it to an
S3 policy and return temp. S3 credentials to the client.
So, this STS API allows clients to authenticate with X.509
certificates over TLS and obtain temp. S3 credentials.
For more details and examples refer to the docs/sts/tls.md
documentation.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Azure storage SDK uses http.Request feature which panics when the
request contains r.Form popuplated.
Azure gateway code creates a new request, however it modifies the
transport to add our metrics code which sets Request.Form during
shouldMeterRequest() call.
This commit simplifies shouldMeterRequest() to avoid setting
request.Form and avoid the crash.
console service should be shutdown last once all shutdown
sequences are complete, this is to ensure that we do not
prematurely kill the server before it cleans up the
`.minio.sys/tmp/uuid` folder.
NOTE: this only applies to NAS gateway setup.
Use a single allocation for reading the file, not the growing buffer of `io.ReadAll`.
Reuse the write buffer if we can when writing metadata in RenameData.
* reduce extra getObjectInfo() calls during ILM transition
This PR also changes expiration logic to be non-blocking,
scanner is now free from additional costs incurred due
to slower object layer calls and hitting the drives.
* move verifying expiration inside locks
A multi resources lock is a single lock UID with multiple associated
resources. This is created for example by multi objects delete
operation. This commit changes the behavior of Refresh() to iterate over
all locks having the same UID and refresh them.
Bonus: Fix showing top locks for multi delete objects
#11878 added "keepHTTPResponseAlive" to CreateFile requests.
The problem is that it will begin writing to the response before the
body is read after 10 seconds. This will abort the writes on the
client-side, since it assumes the server has received what it wants.
The proposed solution here is to monitor the completion of the body
before beginning to send keepalive pings.
Fixes observed high number of goroutines stuck in `io.Copy` in
`github.com/minio/minio/cmd.(*xlStorage).CreateFile` and
`(*storageRESTClient).CreateFile` stuck in `http.DrainBody`.
In the event when a lock is not refreshed in the cluster, this latter
will be automatically removed in the subsequent cleanup of non
refreshed locks routine, but it forgot to clean the local server,
hence having the same weird stale locks present.
This commit will remove the lock locally also in remote nodes, if
removing a lock from a remote node will fail, it will be anyway
removed later in the locks cleanup routine.
Currently in master this can cause existing
parent users to stop working and lead to
credentials getting overwritten.
```
~ mc admin user add alias/ minio123 minio123456
```
```
~ mc admin user svcacct add alias/ minio123 \
--access-key minio123 --secret-key minio123456
```
This PR rejects all such scenarios.
Faster healing as well as making healing more
responsive for faster scanner times.
also fixes a bug introduced in #13079, newly replaced
disks were not healing automatically.
- remove sourceCh usage from healing
we already have tasks and resp channel
- use read locks to lookup globalHealConfig
- fix healing resolver to pick candidates quickly
that need healing, without this resolver was
unexpectedly skipping.
healObject() should be non-blocking to ensure
that scanner is not blocked for a long time,
this adversely affects performance of the scanner
and also affects the way usage is updated
subsequently.
This PR allows for a non-blocking behavior for
healing, dropping operations that cannot be queued
anymore.
Synchronize bucket cycles so it is much more
likely that the same prefixes will be picked up
for scanning.
Use the global bloom filter cycle for that.
Bump bloom filter versions to clear those.
The intention is to list values of sys config that can potentially
impact the performance of minio.
At present, it will return max value configured for rlimit
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
proceed to heal the cluster when all the
drives in a set have failed, this is extremely
rare occurrence but even if it happens we allow
the cluster to be functional.
A recent regression caused new disks not being re-formatted. In the old
code, a disk needed be 'online' to be chosen to be formatted but the
disk has to be already formatted for XL storage IsOnline() function to
return true.
It is enough to check if XL storage is nil or not if we want to avoid
formatting root disks.
Co-authored-by: Anis Elleuch <anis@min.io>
markRootDisksAsDown() relies on disk info even if the
disk is unformatted. Therefore, we should always return
DiskInfo data even when DiskInfo storage API returns
errUnformattedDisk
`mc admin heal` command will show servers/disks tolerance, for that
purpose, you need to know the number of parity disks for each storage
class.
Parity is always the same in all pools.
prefixes at top level create such as
```
~ mc mb alias/bucket/prefix
```
The prefix/ incorrect appears as prefix__XL_DIR__/
in the accountInfo output, make sure to trim '__XL_DIR__'
Objects uploaded in this format for example
```
mc cp /etc/hosts alias/bucket/foo/bar/xl.meta
mc ls -r alias/bucket/foo/bar
```
Won't list the object, handle this scenario.
Ensure that one call will succeed and others will serialize
Example failure without code in place:
```
bucket-policy-handlers_test.go:120: unexpected error: cmd.InsufficientWriteQuorum: Storage resources are insufficient for the write operation doz2wjqaovp5kvlrv11fyacowgcvoziszmkmzzz9nk9au946qwhci4zkane5-1/
bucket-policy-handlers_test.go:120: unexpected error: cmd.InsufficientWriteQuorum: Storage resources are insufficient for the write operation doz2wjqaovp5kvlrv11fyacowgcvoziszmkmzzz9nk9au946qwhci4zkane5-1/
bucket-policy-handlers_test.go:135: want 1 ok, got 0
```
We are observing heavy system loads, potentially
locking the system up for periods when concurrent
listing operations are performed.
We place a per-disk lock on walk IO operations.
This will minimize the impact of concurrent listing
operations on the entire system and de-prioritize
them compared to other operations.
Single list operations should remain largely unaffected.
Some applications albeit poorly written rather than using headObject
rely on listObjects to check for existence of object, this unusual
request always has prefix=(to actual object) and max-keys=1
handle this situation specially such that we can avoid readdir()
on the top level parent to avoid sorting and skipping, ensuring
that such type of listObjects() always behaves similar to a
headObject() call.
this addresses a regression from #12984
which only addresses flat key from single
level deep at bucket level.
added extra tests as well to cover all
these scenarios.
- deletes should always Sweep() for tiering at the
end and does not need an extra getObjectInfo() call
- puts, copy and multipart writes should conditionally
do getObjectInfo() when tiering targets are configured
- introduce 'TransitionedObject' struct for ease of usage
and understanding.
- multiple-pools optimization deletes don't need to hold
read locks verifying objects across namespace and pools.
baseDir is empty if the top level prefix does not
end with `/` this causes large recursive listings
without any filtering, to fix this filtering make
sure to set the filter prefix appropriately.
also do not navigate folders at top level that do
not match the filter prefix, entries don't need
to match prefix since they are never prefixed
with the prefix anyways.
The previous code removes SVC/STS accounts for ldap users that do not
exist anymore in LDAP server. This commit will actually re-evaluate
filter as well if it is changed and remove all local SVC/STS accounts
beloning to the ldap user if the latter is not eligible for the
search filter anymore.
For example: the filter selects enabled users among other criteras in
the LDAP database, if one ldap user changes his status to disabled
later, then associated SVC/STS accounts will be removed because that user
does not meet the filter search anymore.
Traffic metering was not protected against concurrent updates.
```
WARNING: DATA RACE
Read at 0x00c02b0dace8 by goroutine 235:
github.com/minio/minio/cmd.setHTTPStatsHandler.func1()
d:/minio/minio/cmd/generic-handlers.go:360 +0x27d
net/http.HandlerFunc.ServeHTTP()
...
Previous write at 0x00c02b0dace8 by goroutine 994:
github.com/minio/minio/internal/http/stats.(*IncomingTrafficMeter).Read()
d:/minio/minio/internal/http/stats/http-traffic-recorder.go:34 +0xd2
```
The intention is to provide status of any sys services that can
potentially impact the performance of minio.
At present, it will return information about the `selinux` service
(not-installed/disabled/permissive/enforcing)
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
This happens because of a change added where any sub-credential
with parentUser == rootCredential i.e (MINIO_ROOT_USER) will
always be an owner, you cannot generate credentials with lower
session policy to restrict their access.
This doesn't affect user service accounts created with regular
users, LDAP or OpenID
Use `readMetadata` when reading version
information without data requested.
Reduces IO on inlined data.
Bonus: Inline compressed data as well when
compression is enabled.
- avoid extra lookup for 'xl.meta' since we are
definitely sure that it doesn't exist.
- use this in newMultipartUpload() as well
- also additionally do not write with O_DSYNC
to avoid loading the drives, instead create
'xl.meta' for listing operations without
O_DSYNC since these are ephemeral objects.
- do the same with newMultipartUpload() since
it gets synced when the PutObjectPart() is
attempted, we do not need to tax newMultipartUpload()
instead.
removes unexpected features from regular putObject() such as
- increasing parity when disks are down, avoids
a lot of DiskInfo() calls.
- triggering MRF for metacache objects
if disks are offline
- avoiding renames from temporary location
to actual namespace, not needed since
metacache files are unique.
Disregard interfaces that are down when selecting bind addresses
Windows often has a number of disabled NICs used for VPN and other services.
This often causes minio to select an address for contacting the console that is on a disabled (virtual) NIC.
This checks if the interface is up before adding it to the pool on Windows.
Before, the gateway will complain that it found KMS configured in the
environment but the gateway mode does not support encryption. This
commit will allow starting of the gateway but ensure that S3 operations
with encryption headers will fail when the gateway doesn't support
encryption. That way, the user can use etcd + KMS and have IAM data
encrypted in the etcd store.
Co-authored-by: Anis Elleuch <anis@min.io>
improvements include
- skip IPv6 correctly
- do not set default value for
MINIO_SERVER_URL, let it be
configured if not use local IPs
Bonus:
- In healing return error from listPathRaw()
- update console to v0.8.3
Remote caches were not returned correctly, so they would not get updated on save.
Furthermore make some tweaks for more reliable updates.
Invalidate bloom filter to ensure rescan.
we will allow situations such as
```
a/b/1.txt
a/b
```
and
```
a/b
a/b/1.txt
```
we are going to document that this usecase is
not supported and we will never support it, if
any application does this users have to delete
the top level parent to make sure namespace is
accessible at lower level.
rest of the situations where the prefixes get
created across sets are supported as is.
its possible that some multipart uploads would have
uploaded only single parts so relying on `len(o.Parts)`
alone is not sufficient, we need to look for ETag
pattern to be absolutely sure.
Some incorrect setups might have multiple audiences
where they are trying to use a single authentication
endpoint for multiple services.
Nevertheless OpenID spec allows it to make it
even more confusin for no good reason.
> It MUST contain the OAuth 2.0 client_id of the
> Relying Party as an audience value. It MAY also
> contain identifiers for other audiences. In the
> general case, the aud value is an array of case
> sensitive strings. In the common special case
> when there is one audience, the aud value MAY
> be a single case sensitive string.
fixes#12809
this healing optimization caused multiple
regressions in healing
- delete-markers incorrectly missing
heal and returning incorrect healing
results to client.
- missing individual 'parts' such
as for restored object or simply
for all objects just missing few parts.
This optimization is not necessary, we
should proceed to verify all cases possible
not just when metadata is inconsistent.
destination path and old path will be similar
when healing occurs, this can lead to healed
parts being again purged leading to always an
inconsistent state on an object which might
further cause reduction in quorum eventually.
delete-markers missing on drives were
not healed due to few things
disksWithAllParts() does not know-how
to deal with delete markers, add support
for that.
fixes#12787
- delete-markers are incorrectly reported
as corrupt with wrong data sent to client
'mc admin heal -r' on objects with delete
marker will report as 'grey' incorrectly.
- do not heal delete-markers during HeadObject()
this can lead to inconsistent order of heals
on the object, although this is not an issue
in terms of order of versions it is rather
simpler to keep the same order on all drives.
- defaultHealResult() should handle 'err == nil'
case such that valid cases should be handled
as 'drive' status OK.
- remove use of getOnlineDisks() instead rely on fallbackDisks()
when disk return errors like diskNotFound, unformattedDisk
use other fallback disks to list from, instead of paying the
price for checking getOnlineDisks()
- optimize getDiskID() further to avoid large write locks when
looking formatLastCheck time window
This new change allows for a more relaxed fallback for listing
allowing for more tolerance and also eventually gain more
consistency in results even if using '3' disks by default.
When configured in Lookup Bind mode, the server now periodically queries the
LDAP IDP service to find changes to a user's group memberships, and saves this
info to update the access policies for all temporary and service account
credentials belonging to LDAP users.
Add a new goroutine file which has another printing format. We need it
to see how much time each goroutine was blocked. Easier to detect stops.
Co-authored-by: Anis Elleuch <anis@min.io>
when TLS is configured using IPs directly
might interfere and not work properly when
the server is configured with TLS certs but
the certs only have domain certs.
Also additionally allow users to specify
a public accessible URL for console to talk
to MinIO i.e `MINIO_SERVER_URL` this would
allow them to use an external ingress domain
to talk to MinIO. This internally fixes few
problems such as presigned URL generation on
the console UI etc.
This needs to be done additionally for any
MinIO deployments that might have a much more
stricter requirement when running in standalone
mode such as FS or standalone erasure code.
This method is used to add expected expiration and transition time
for an object in GET/HEAD Object response headers.
Also fixed bugs in lifecycle.PredictTransitionTime and
getLifecycleTransitionTier in handling current and
non-current versions.
This allows remote bucket admin to identify the origin of transitioned
objects by simply inspecting the object prefixes.
e.g let's take a remote tier TIER-1 pointing to a remote bucket (prefix)
testbucket/testprefix-1. The remote bucket admin can list all transitioned objects
from a MinIO deployment identified by '2e78e906-1c5d-4f94-8689-9df44cafde39' and
source bucket 'mybucket' like so,
```
$ ./mc ls -r minio-tier-target/testbucket/testprefix-1/2e78e906-1c5d-4f94-8689-9df44cafde39/mybucket/
[2021-07-12 17:15:50 PDT] 160B 48/fb/48fbc0e6-3a73-458b-9337-8e722c619ca4
[2021-07-12 16:58:46 PDT] 160B 7d/1c/7d1c96bd-031a-48d4-99ea-b1304e870830
```