This commit adds support for bulk ETag
decryption for SSE-S3 encrypted objects.
If KES supports a bulk decryption API, then
MinIO will check whether its policy grants
access to this API. If so, MinIO will use
a bulk API call instead of sending encrypted
ETags serially to KES.
Note that MinIO will not use the KES bulk API
if its client certificate is an admin identity.
MinIO will process object listings in batches.
A batch has a configurable size that can be set
via `MINIO_KMS_KES_BULK_API_BATCH_SIZE=N`.
It defaults to `500`.
This env. variable is experimental and may be
renamed / removed in the future.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
- Updating KES dependency to v.0.18.0
- Fixing incompatibility issue when checking for errors during KES key creation
Signed-off-by: Lenin Alevski <alevsk.8772@gmail.com>
this helps in caching the resolved values early on, avoids
causing further resolution for individual nodes when
object layer comes online.
this can speed up our startup time during, upgrades etc by
an order of magnitude.
additional changes in connectLoadInitFormats() and parallelize
all calls that might be potentially blocking.
- Site replication was missing replicating users,
groups when an empty site was added.
- Add site replication for groups and users when they
are disabled and enabled.
- Add support for replicating bucket quota config.
When reading input for PutObject or PutObjectPart add a readahead buffer for big inputs.
This will make network reads+hashing separate run async with erasure coding and writes. This will reduce overall latency in distributed setups where the input is from upstream and writes go to other servers.
We will read at 2 buffers ahead, meaning one will always be ready/waiting and one is currently being read from.
This improves PutObject and PutObjectParts for these cases.
To avoid error message like:
```
go: warning: github.com/gomodule/redigo@v2.0.0+incompatible: retracted by module author: Old development version not maintained or published.
go: to switch to the latest unretracted version, run:
go get github.com/gomodule/redigo@latest
```
- This allows site-replication to be configured when using OpenID or the
internal IDentity Provider.
- Internal IDP IAM users and groups will now be replicated to all members of the
set of replicated sites.
- When using OpenID as the external identity provider, STS and service accounts
are replicated.
- Currently this change dis-allows root service accounts from being
replicated (TODO: discuss security implications).
The AddUser() API endpoint was accepting a policy field.
This API is used to update a user's secret key and account
status, and allows a regular user to update their own secret key.
The policy update is also applied though does not appear to
be used by any existing client-side functionality.
This fix changes the accepted request body type and removes
the ability to apply policy changes as that is possible via the
policy set API.
NOTE: Changing passwords can be disabled as a workaround
for this issue by adding an explicit "Deny" rule to disable the API
for users.
minisign v0.10.0 tool broke compatibility, that leads
to our library failing to parse the newer signatures.
This PR
fixes - https://github.com/minio/operator/issues/913
fixes - https://github.com/minio/minio/issues/13824
A workaround for users facing this problem is to unset
```
MINIO_UPDATE_MINISIGN_PUBKEY
```
or set it to `empty` string then signature verification
is skipped automatically.
totalDrives reported in speedTest result were wrong
for multiple pools, this PR fixes this.
Bonus: add support for configurable storage-class, this
allows us to test REDUCED_REDUNDANCY to see further
maximum throughputs across the cluster.
- Allows setting a role policy parameter when configuring OIDC provider
- When role policy is set, the server prints a role ARN usable in STS API requests
- The given role policy is applied to STS API requests when the roleARN parameter is provided.
- Service accounts for role policy are also possible and work as expected.
an active running speedTest will reject all
new S3 requests to the server, until speedTest
is complete.
this is to ensure that speedTest results are
accurate and trusted.
Co-authored-by: Klaus Post <klauspost@gmail.com>
Existing:
```go
type xlMetaV2 struct {
Versions []xlMetaV2Version `json:"Versions" msg:"Versions"`
}
```
Serialized as regular MessagePack.
```go
//msgp:tuple xlMetaV2VersionHeader
type xlMetaV2VersionHeader struct {
VersionID [16]byte
ModTime int64
Type VersionType
Flags xlFlags
}
```
Serialize as streaming MessagePack, format:
```
int(headerVersion)
int(xlmetaVersion)
int(nVersions)
for each version {
binary blob, xlMetaV2VersionHeader, serialized
binary blob, xlMetaV2Version, serialized.
}
```
xlMetaV2VersionHeader is <= 30 bytes serialized. Deserialized struct
can easily be reused and does not contain pointers, so efficient as a
slice (single allocation)
This allows quickly parsing everything as slices of bytes (no copy).
Versions are always *saved* sorted by modTime, newest *first*.
No more need to sort on load.
* Allows checking if a version exists.
* Allows reading single version without unmarshal all.
* Allows reading latest version of type without unmarshal all.
* Allows reading latest version without unmarshal of all.
* Allows checking if the latest is deleteMarker by reading first entry.
* Allows adding/updating/deleting a version with only header deserialization.
* Reduces allocations on conversion to FileInfo(s).
This will help other projects like `health-analyzer` to verify that the
struct was indeed populated by the minio server, and is not
default-populated during unmarshalling of the JSON.
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
This feature is useful in situations when console is exposed
over multiple intranent or internet entities when users are
connecting over local IP v/s going through load balancer.
Related console work was merged here
373bfbfe3f
This reverts commit 091a7ae359.
- Ensure all actions accessing storage lock properly.
- Behavior change: policies can be deleted only when they
are not associated with any active credentials.
Also adds fix for accidental canned policy removal that was present in the
reverted version of the change.
Windows users often click on the binary without
knowing MinIO is a command-line tool and should be
run from a terminal. Throw a message to guide them
on what to do.
Co-authored-by: Klaus Post <klauspost@gmail.com>
deleting collection of versions belonging
to same object, we can avoid re-reading
the xl.meta from the disk instead purge
all the requested versions in-memory,
the tradeoff is to allocate a map to de-dup
the versions, allow disks to be read only
once per object.
additionally reduce the data transfer between
nodes by shortening msgp data values.
We do not reliably know the length of compressed data, including headers.
Request until the end-of-stream. Results will still be properly truncated.
Fixes#13441
Testing with `mc sql --compression BZIP2 --csv-input "rd=\n,fh=USE,fd=;" --query="select COUNT(*) from S3Object" local2/testbucket/nyc-taxi-data-10M.csv.bz2`
Before 96.98s, after 10.79s. Uses about 70% CPU while running.
This change allows a set of MinIO sites (clusters) to be configured
for mutual replication of all buckets (including bucket policies, tags,
object-lock configuration and bucket encryption), IAM policies,
LDAP service accounts and LDAP STS accounts.
additionally optimize for IP only setups, avoid doing
unnecessary lookups if the Dial addr is an IP.
allow support for multiple listeners on same socket,
this is mainly meant for future purposes.
This will allow objects to relinquish read lock held during
replication earlier if the target is known to be down
without waiting for connection timeout when replication
is attempted.
The intention is to list values of sys config that can potentially
impact the performance of minio.
At present, it will return max value configured for rlimit
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
`mc admin heal` command will show servers/disks tolerance, for that
purpose, you need to know the number of parity disks for each storage
class.
Parity is always the same in all pools.
The intention is to provide status of any sys services that can
potentially impact the performance of minio.
At present, it will return information about the `selinux` service
(not-installed/disabled/permissive/enforcing)
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
improvements include
- skip IPv6 correctly
- do not set default value for
MINIO_SERVER_URL, let it be
configured if not use local IPs
Bonus:
- In healing return error from listPathRaw()
- update console to v0.8.3
Some incorrect setups might have multiple audiences
where they are trying to use a single authentication
endpoint for multiple services.
Nevertheless OpenID spec allows it to make it
even more confusin for no good reason.
> It MUST contain the OAuth 2.0 client_id of the
> Relying Party as an audience value. It MAY also
> contain identifiers for other audiences. In the
> general case, the aud value is an array of case
> sensitive strings. In the common special case
> when there is one audience, the aud value MAY
> be a single case sensitive string.
fixes#12809
This commit gathers MRF metrics from
all nodes in a cluster and return it to the caller. This will show information about the
number of objects in the MRF queues
waiting to be healed.
Ensure that hostnames / ip addresses are not printed in the subnet
health report. Anonymize them by replacing them with `servern` where `n`
represents the position of the server in the pool.
This is done by building a `host anonymizer` map that maps every
possible value containing the host e.g. host, host:port,
http://host:port, etc to the corresponding anonymized name and using
this map to replace the values at the time of health report generation.
A different logic is used to anonymize host names in the `procinfo`
data, as the host names are part of an ellipses pattern in the process
start command. Here we just replace the prefix/suffix of the ellipses
pattern with their hashes.
Gzip responses if appropriate, except GetObject requests.
List reponses has an almost 10:1 compression ratio with no
measurable slowdown (in fact it seems a bit faster).
Download files from *any* bucket/path as an encrypted zip file.
The key is included in the response but can be separated so zip
and the key doesn't have to be sent on the same channel.
Requires https://github.com/minio/pkg/pull/6
if object was uploaded with multipart. This is to ensure that
GetObject calls with partNumber in URI request parameters
have same behavior on source and replication target.
This feature also changes the default port where
the browser is running, now the port has moved
to 9001 and it can be configured with
```
--console-address ":9001"
```
Also adding an API to allow resyncing replication when
existing object replication is enabled and the remote target
is entirely lost. With the `mc replicate reset` command, the
objects that are eligible for replication as per the replication
config will be resynced to target if existing object replication
is enabled on the rule.
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
Previous PR #12351 added functions to read from the reader
stream to reduce memory usage, use the same technique in
few other places where we are not interested in reading the
data part.
This commit replaces the custom KES client implementation
with the KES SDK from https://github.com/minio/kes
The SDK supports multi-server client load-balancing and
requests retry out of the box. Therefore, this change reduces
the overall complexity within the MinIO server and there
is no need to maintain two separate client implementations.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
- Check ES server version by querying its API
- Minimum required version of ES is 5.x
- Add deprecation warnings for ES versions < 7.x
- Still works with 5.x and 6.x, but support to be removed at a later date.
Signed-off-by: Aditya Manthramurthy <aditya@minio.io>
https://github.com/minio/console takes over the functionality for the
future object browser development
Signed-off-by: Harshavardhana <harsha@minio.io>
With this change, MinIO's ILM supports transitioning objects to a remote tier.
This change includes support for Azure Blob Storage, AWS S3 compatible object
storage incl. MinIO and Google Cloud Storage as remote tier storage backends.
Some new additions include:
- Admin APIs remote tier configuration management
- Simple journal to track remote objects to be 'collected'
This is used by object API handlers which 'mutate' object versions by
overwriting/replacing content (Put/CopyObject) or removing the version
itself (e.g DeleteObjectVersion).
- Rework of previous ILM transition to fit the new model
In the new model, a storage class (a.k.a remote tier) is defined by the
'remote' object storage type (one of s3, azure, GCS), bucket name and a
prefix.
* Fixed bugs, review comments, and more unit-tests
- Leverage inline small object feature
- Migrate legacy objects to the latest object format before transitioning
- Fix restore to particular version if specified
- Extend SharedDataDirCount to handle transitioned and restored objects
- Restore-object should accept version-id for version-suspended bucket (#12091)
- Check if remote tier creds have sufficient permissions
- Bonus minor fixes to existing error messages
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Krishna Srinivas <krishna@minio.io>
Signed-off-by: Harshavardhana <harsha@minio.io>
This commit changes the config/IAM encryption
process. Instead of encrypting config data
(users, policies etc.) with the root credentials
MinIO now encrypts this data with a KMS - if configured.
Therefore, this PR moves the MinIO-KMS configuration (via
env. variables) to a "top-level" configuration.
The KMS configuration cannot be stored in the config file
since it is used to decrypt the config file in the first
place.
As a consequence, this commit also removes support for
Hashicorp Vault - which has been deprecated anyway.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
using Lstat() is causing tiny memory allocations,
that are usually wasted and never used, instead
we can simply uses Access() call that does 0
memory allocations.
some SDKs might incorrectly send duplicate
entries for keys such as "conditions", Go
stdlib unmarshal for JSON does not support
duplicate keys - instead skips the first
duplicate and only preserves the last entry.
This can lead to issues where a policy JSON
while being valid might not properly apply
the required conditions, allowing situations
where POST policy JSON would end up allowing
uploads to unauthorized buckets and paths.
This PR fixes this properly.
```
mc admin info --json
```
provides these details, for now, we shall eventually
expose this at Prometheus level eventually.
Co-authored-by: Harshavardhana <harsha@minio.io>
Creating notification events for replica creation
is not particularly useful to send as the notification
event generated at source already includes replication
completion events.
For applications using replica cluster as failover, avoiding
duplicate notifications for replica event will allow seamless
failover.
store the cache in-memory instead of disks to avoid large
write amplifications for list heavy workloads, store in
memory instead and let it auto expire.
since we have changed our default envs to MINIO_ROOT_USER,
MINIO_ROOT_PASSWORD this was not supported by minio-go
credentials package, update minio-go to v7.0.10 for this
support. This also addresses few bugs related to users
had to specify AWS_ACCESS_KEY_ID as well to authenticate
with their S3 backend if they only used MINIO_ROOT_USER.
continuation of PR#11491 for multiple server pools and
bi-directional replication.
Moving proxying for GET/HEAD to handler level rather than
server pool layer as this was also causing incorrect proxying
of HEAD.
Also fixing metadata update on CopyObject - minio-go was not passing
source version ID in X-Amz-Copy-Source header
In PR #11165 due to incorrect proxying for 2
way replication even when the object was not
yet replicated
Additionally, fix metadata comparisons when
deciding to do full replication vs metadata copy.
fixes#11340
Synchronous replication can be enabled by setting the --sync
flag while adding a remote replication target.
This PR also adds proxying on GET/HEAD to another node in a
active-active replication setup in the event of a 404 on the current node.
This PR refactors the way we use buffers for O_DIRECT and
to re-use those buffers for messagepack reader writer.
After some extensive benchmarking found that not all objects
have this benefit, and only objects smaller than 64KiB see
this benefit overall.
Benefits are seen from almost all objects from
1KiB - 32KiB
Beyond this no objects see benefit with bulk call approach
as the latency of bytes sent over the wire v/s streaming
content directly from disk negate each other with no
remarkable benefits.
All other optimizations include reuse of msgp.Reader,
msgp.Writer using sync.Pool's for all internode calls.
additionally also configure http2 healthcheck
values to quickly detect unstable connections
and let them timeout.
also use single transport for proxying requests
Due to botched upstream renames of project repositories
and incomplete migration to go.mod support, our current
dependency version of `go.mod` had bugs i.e it was
using commits from master branch which didn't have
the required fixes present in release-3.4 branches
which leads to some rare bugs
https://github.com/etcd-io/etcd/pull/11477 provides
a workaround for now and we should migrate to this.
release-3.5 eventually claims to fix all of this
properly until then we cannot use /v3 import right now
Instead of having less/more fields inside a structure depending on the
platform (non-linux/linux), it would be better to have the same standard
definition in all platforms, and certain fields of the structure to be
populated or left unpopulated depending on the platform.
Delete marker replication is implemented for V2
configuration specified in AWS spec (though AWS
allows it only in the V1 configuration).
This PR also brings in a MinIO only extension of
replicating permanent deletes, i.e. deletes specifying
version id are replicated to target cluster.
`decryptObjectInfo` is a significant bottleneck when listing objects.
Reduce the allocations for a significant speedup.
https://github.com/minio/sio/pull/40
```
λ benchcmp before.txt after.txt
benchmark old ns/op new ns/op delta
Benchmark_decryptObjectInfo-32 24260928 808656 -96.67%
benchmark old MB/s new MB/s speedup
Benchmark_decryptObjectInfo-32 0.04 1.24 31.00x
benchmark old allocs new allocs delta
Benchmark_decryptObjectInfo-32 75112 48996 -34.77%
benchmark old bytes new bytes delta
Benchmark_decryptObjectInfo-32 287694772 4228076 -98.53%
```
* add NVMe drive info [model num, serial num, drive temp. etc.]
* Ignore fuse partitions
* Add the nvme logic only for linux
* Move smart/nvme structs to a separate file
Co-authored-by: wlan0 <sidharthamn@gmail.com>
add a hint on the disk to allow for tracking fresh disk
being healed, to allow for restartable heals, and also
use this as a way to track and remove disks.
There are more pending changes where we should move
all the disk formatting logic to backend drives, this
PR doesn't deal with this refactor instead makes it
easier to track healing in the future.
Always check if the auto-generated code is still compatible with the
existing written code to avoid a possible forgetting or sometimes a non
intentional change.
As the bulk/recursive delete will require multiple connections to open at an instance,
The default open connections limit will be reached which results in the following error
```FATAL: sorry, too many clients already```
By setting the open connections to a reasonable value - `2`, We ensure that the max open connections
will not be exhausted and lie under bounds.
The queries are simple inserts/updates/deletes which is operational and sufficient with the
the maximum open connection limit is 2.
Fixes#10553
Allow user configuration for MaxOpenConnections
- Add methods to set/remove replication rules (poornas)
- fix: only SSE-C headers should be applied to destination (Harshavardhana)
- fix: avoid data race by copying the buffer (Harshavardhana)
- remove deprecated build badges (Harshavardhana)
- fix: handle readFull bug with certain readers (Harshavardhana)
- fix a typo in README.md (Julien K)
- lifecycle: Fix marshaling expiration date/days (Anis Elleuch)
- add replication-status, expiration headers (Harshavardhana)
- Return object's version id in StatObject(Anis Elleuch)
- display appropriate funcName with nested callers (Harshavardhana)
- allow KMS tests to be run in the CI/CD (Harshavardhana)
- fix: removing lifecycle properly (Harshavardhana)
- feat: Add ListenNotification API to listen for all events (Harshavardhana)
with the merge of https://github.com/etcd-io/etcd/pull/11823
etcd v3.5.0 will now have a properly imported versioned path
this fixes our pending migration to newer repo
- additionally upgrade to msgp@v1.1.2
- change StatModTime,StatSize fields as
simple Size/ModTime
- reduce 50000 entries per List batch to 10000
as client needs to wait too long to see the
first batch some times which is not desired
and it is worth we write the data as soon
as we have it.
- Implement a new xl.json 2.0.0 format to support,
this moves the entire marshaling logic to POSIX
layer, top layer always consumes a common FileInfo
construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
for object placements.
Fixes#2111
At a customer setup with lots of concurrent calls
it can be observed that in newRetryTimer there
were lots of tiny alloations which are not
relinquished upon retries, in this codepath
we were only interested in re-using the timer
and use it wisely for each locker.
```
(pprof) top
Showing nodes accounting for 8.68TB, 97.02% of 8.95TB total
Dropped 1198 nodes (cum <= 0.04TB)
Showing top 10 nodes out of 79
flat flat% sum% cum cum%
5.95TB 66.50% 66.50% 5.95TB 66.50% time.NewTimer
1.16TB 13.02% 79.51% 1.16TB 13.02% github.com/ncw/directio.AlignedBlock
0.67TB 7.53% 87.04% 0.70TB 7.78% github.com/minio/minio/cmd.xlObjects.putObject
0.21TB 2.36% 89.40% 0.21TB 2.36% github.com/minio/minio/cmd.(*posix).Walk
0.19TB 2.08% 91.49% 0.27TB 2.99% os.statNolog
0.14TB 1.59% 93.08% 0.14TB 1.60% os.(*File).readdirnames
0.10TB 1.09% 94.17% 0.11TB 1.25% github.com/minio/minio/cmd.readDirN
0.10TB 1.07% 95.23% 0.10TB 1.07% syscall.ByteSliceFromString
0.09TB 1.03% 96.27% 0.09TB 1.03% strings.(*Builder).grow
0.07TB 0.75% 97.02% 0.07TB 0.75% path.(*lazybuf).append
```
This PR adds a new configuration parameter which allows readiness
check to respond within 10secs, this can be reduced to a lower value
if necessary using
```
mc admin config set api ready_deadline=5s
```
or
```
export MINIO_API_READY_DEADLINE=5s
```
S3 is now natively supported by B2 cloud storage provider
there is no reason to use specialized gateway for B2 anymore,
our current S3 gateway with caching would work with B2.
Resolves#8584
By monitoring PUT/DELETE and heal operations it is possible
to track changed paths and keep a bloom filter for this data.
This can help prioritize paths to scan. The bloom filter can identify
paths that have not changed, and the few collisions will only result
in a marginal extra workload. This can be implemented on either a
bucket+(1 prefix level) with reasonable performance.
The bloom filter is set to have a false positive rate at 1% at 1M
entries. A bloom table of this size is about ~2500 bytes when serialized.
To not force a full scan of all paths that have changed cycle bloom
filters would need to be kept, so we guarantee that dirty paths have
been scanned within cycle runs. Until cycle bloom filters have been
collected all paths are considered dirty.