Existing:
```go
type xlMetaV2 struct {
Versions []xlMetaV2Version `json:"Versions" msg:"Versions"`
}
```
Serialized as regular MessagePack.
```go
//msgp:tuple xlMetaV2VersionHeader
type xlMetaV2VersionHeader struct {
VersionID [16]byte
ModTime int64
Type VersionType
Flags xlFlags
}
```
Serialize as streaming MessagePack, format:
```
int(headerVersion)
int(xlmetaVersion)
int(nVersions)
for each version {
binary blob, xlMetaV2VersionHeader, serialized
binary blob, xlMetaV2Version, serialized.
}
```
xlMetaV2VersionHeader is <= 30 bytes serialized. Deserialized struct
can easily be reused and does not contain pointers, so efficient as a
slice (single allocation)
This allows quickly parsing everything as slices of bytes (no copy).
Versions are always *saved* sorted by modTime, newest *first*.
No more need to sort on load.
* Allows checking if a version exists.
* Allows reading single version without unmarshal all.
* Allows reading latest version of type without unmarshal all.
* Allows reading latest version without unmarshal of all.
* Allows checking if the latest is deleteMarker by reading first entry.
* Allows adding/updating/deleting a version with only header deserialization.
* Reduces allocations on conversion to FileInfo(s).
This will help other projects like `health-analyzer` to verify that the
struct was indeed populated by the minio server, and is not
default-populated during unmarshalling of the JSON.
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
legacy objects in 'xl.json' after upgrade, should have
following sequence of events - bucket should have versioning
enabled and the object should have been overwritten with
another version of an object.
this situation was not handled, which would lead to older
objects to stay perpetually with "legacy" dataDir, however
these objects were readable by all means - there weren't
converted to newer format.
This PR fixes this situation properly.
Add a new Prometheus metric for bucket replication latency
e.g.:
minio_bucket_replication_latency_ns{
bucket="testbucket",
operation="upload",
range="LESS_THAN_1_MiB",
server="127.0.0.1:9001",
targetArn="arn:minio:replication::45da043c-14f5-4da4-9316-aba5f77bf730:testbucket"} 2.2015663e+07
Co-authored-by: Klaus Post <klauspost@gmail.com>
container limits would not be properly honored in
our current implementation, mem.VirtualMemory()
function only reads /proc/meminfo which points to
the host system information inside the container.
This feature is useful in situations when console is exposed
over multiple intranent or internet entities when users are
connecting over local IP v/s going through load balancer.
Related console work was merged here
373bfbfe3f
- remove some duplicated code
- reported a bug, separately fixed in #13664
- using strings.ReplaceAll() when needed
- using filepath.ToSlash() use when needed
- remove all non-Go style comments from the codebase
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
- add checks such that swapped disks are detected
and ignored - never used for normal operations.
- implement `unrecognizedDisk` to be ignored with
all operations returning `errDiskNotFound`.
- also add checks such that we do not load unexpected
disks while connecting automatically.
- additionally humanize the values when printing the errors.
Bonus: fixes handling of non-quorum situations in
getLatestFileInfo(), that does not work when 2 drives
are down, currently this function would return errors
incorrectly.
creating service accounts is implicitly enabled
for all users, this PR however adds support to
reject creating service accounts, with an explicit
"Deny" policy.
On first list resume or when specifying a custom markers entries could be missed in rare cases.
Do conservative truncation of entries when forwarding.
Replaces #13619
If a given MinIO config is dynamic (can be changed without restart),
ensure that it can be reset also without restart.
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
when `MINIO_CACHE_COMMIT` is set.
- `writeback` caching applies only to single
uploads. When cache commit mode is
`writeback`, default multipart caching to be
synchronous.
- Add writethrough caching for single uploads
This commit makes the MinIO server behavior more consistent
w.r.t. key usage verification.
When MinIO verifies the client certificates it also checks
that the client certificate is valid of client authentication
(or any (i.e. wildcard) usage).
However, the MinIO server used to not verify the client key usage
when client certificate verification was disabled.
Now, the MinIO server verifies the client key usage even when
client certificate verification has been disabled. This makes
the MinIO behavior more consistent from a client's perspective.
Now, a client certificate has to be valid for client authentication
in all cases.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Bonus: if runs have PUT higher then capture it anyways
to display an unexpected result, which provides a way
to understand what might be slowing things down on the
system.
For example on a Data24 WDC setup it is clearly visible
there is a bug in the hardware.
```
./mc admin speedtest wdc/
⠧ Running speedtest (With 64 MiB object size, 32 concurrency) PUT: 31 GiB/s GET: 24 GiB/s
⠹ Running speedtest (With 64 MiB object size, 48 concurrency) PUT: 38 GiB/s GET: 24 GiB/s
MinIO 2021-11-04T06:08:33Z, 6 servers, 48 drives
PUT: 38 GiB/s, 605 objs/s
GET: 24 GiB/s, 383 objs/s
```
Reads are almost 14GiB/sec slower than Writes which
is practically not possible.
This reverts commit 091a7ae359.
- Ensure all actions accessing storage lock properly.
- Behavior change: policies can be deleted only when they
are not associated with any active credentials.
Also adds fix for accidental canned policy removal that was present in the
reverted version of the change.
Windows users often click on the binary without
knowing MinIO is a command-line tool and should be
run from a terminal. Throw a message to guide them
on what to do.
Co-authored-by: Klaus Post <klauspost@gmail.com>
Borrowed idea from Go's usage of this
optimization for ReadFrom() on client
side, we should re-use the 32k buffers
io.Copy() allocates for generic copy
from a reader to writer.
the performance increase for reads for
really tiny objects is at this range
after this change.
> * Fastest: +7.89% (+1.3 MiB/s) throughput, +7.89% (+1308.1) obj/s
- Ensure all actions accessing storage lock properly.
- Behavior change: policies can be deleted only when they
are not associated with any active credentials.
- The race happens with a goroutine that refreshes IAM cache data from storage.
- It could lead to deleted users re-appearing as valid live credentials.
- This change also causes CI to run tests without a race flag (in addition to
running it with).
deleting collection of versions belonging
to same object, we can avoid re-reading
the xl.meta from the disk instead purge
all the requested versions in-memory,
the tradeoff is to allocate a map to de-dup
the versions, allow disks to be read only
once per object.
additionally reduce the data transfer between
nodes by shortening msgp data values.
- combine similar looking functionalities into single
handlers, and remove unnecessary proxying of the
requests at handler layer.
- remove bucket forwarding handler as part of default setup
add it only if bucket federation is enabled.
Improvements observed for 1kiB object reads.
```
-------------------
Operation: GET
Operations: 4538555 -> 4595804
* Average: +1.26% (+0.2 MiB/s) throughput, +1.26% (+190.2) obj/s
* Fastest: +4.67% (+0.7 MiB/s) throughput, +4.67% (+739.8) obj/s
* 50% Median: +1.15% (+0.2 MiB/s) throughput, +1.15% (+173.9) obj/s
```
Removes RLock/RUnlock for updating metadata,
since we already take a write lock to update
metadata, this change removes reading of xl.meta
as well as an additional lock, the performance gain
should increase 3x theoretically for
- PutObjectRetention
- PutObjectLegalHold
This optimization is mainly for Veeam like
workloads that require a certain level of iops
from these API calls, we were losing iops.
read/writers are not concurrent in handlers
and self contained - no need to use atomics on
them.
avoids unnecessary contentions where it's not
required.
Logger targets were not race protected against concurrent updates from for example `HTTPConsoleLoggerSys`.
Restrict direct access to targets and make slices immutable so a returned slice can be processed safely without locks.
As we use etcd's watch interface, we do not need the
network notifications as they are no-ops anyway.
Bonus: Remove globalEtcdClient global usage in IAM
IAMSys is a higher-level object, that should not be called by the lower-level
storage API interface for IAM. This is to prepare for further improvements in
IAM code.
etcd operations, get/put/delete, should be logged when failed
with errors other than not found error. It will make it easier to
see connections issues from MinIO to etcd.
Refresh was doing a linear scan of all locked resources. This was adding
up to significant delays in locking on high load systems with long
running requests.
Add a secondary index for O(log(n)) UID -> resource lookups.
Multiple resources are stored in consecutive strings.
Bonus fixes:
* On multiple Unlock entries unlock the write locks we can.
* Fix `expireOldLocks` skipping checks on entry after expiring one.
* Return fast on canTakeUnlock/canTakeLock.
* Prealloc some places.
We do not reliably know the length of compressed data, including headers.
Request until the end-of-stream. Results will still be properly truncated.
Fixes#13441
Testing with `mc sql --compression BZIP2 --csv-input "rd=\n,fh=USE,fd=;" --query="select COUNT(*) from S3Object" local2/testbucket/nyc-taxi-data-10M.csv.bz2`
Before 96.98s, after 10.79s. Uses about 70% CPU while running.
offset+length should match the Size() of the individual parts
return 'errFileCorrupt' otherwise, to trigger healing of the individual
parts do not error out prematurely when healing such bitrot's upon
successful parts being written to the client.
another issue this PR fixes is to not return and error to
the client if we have just triggered a heal on a specific
part of the object, instead continue to read all the content
and let the heal happen asynchronously later.
Looks like policy restriction was not working properly
for normal users when they are not svc or STS accounts.
- svc accounts are now properly fixed to get
right permissions when its inherited, so
we do not have to set 'owner = true'
- sts accounts have always been using right
permissions, do not need an explicit lookup
- regular users always have proper policy mapping
- avoids relying in listQuorum from the underlying listObjects()
and potentially missing entries if any.
- avoid the entire merging logic etc, listing raw set by set
and loading whatever is found is cleaner when dealing with
a large cluster for IAM metadata.
* fix: disallow invalid x-amz-security-token for root credentials
fixes#13335
This was a regression added in #12947 when this part of the
code was refactored to avoid privilege issues with service
accounts with session policy.
Bonus:
- fix: AssumeRoleWithCertificate policy mapping and reload
AssumeRoleWithCertificate was not mapping to correct
policies even after successfully generating keys, since
the claims associated with this API were never looked up
properly. Ensure that policies are set appropriately.
- GetUser() API was not loading policies correctly based
on AccessKey based mapping which is true with OpenID
and AssumeRoleWithCertificate API.
with some broken clients allow non-strict validation
of sha256 when ContentLength > 0, it has been found in
the wild some applications that need this behavior. This
shall be only allowed if `--no-compat` is used.
change credentials handling such that
prefer MINIO_* envs first if they work,
if not fallback to AWS credentials. If
they fail we fail to start anyways.
This change allows a set of MinIO sites (clusters) to be configured
for mutual replication of all buckets (including bucket policies, tags,
object-lock configuration and bucket encryption), IAM policies,
LDAP service accounts and LDAP STS accounts.
In (erasureServerPools).MakeBucketWithLocation deletes the created
buckets if any set returns an error.
Add `NoRecreate` option, which will not recreate the bucket
in `DeleteBucket`, if the operation fails.
Additionally use context.Background() for operations we always want to be performed.
bucket deletes should purge entire bucket metadata
appropriately, use rename() to move the metadata files
to trash folder,
for dangling buckets instead of doing recursive deletes,
rename such buckets to trash folder as well.
Bonus: reduce retry duration for listing to 200ms
additionally optimize for IP only setups, avoid doing
unnecessary lookups if the Dial addr is an IP.
allow support for multiple listeners on same socket,
this is mainly meant for future purposes.
DeleteMarkers were unreadable if they had quorum based
guarantees, this PR tries to fix this behavior appropriately.
DeleteMarkers with sufficient should be allowed and the
return error should be accordingly with or without version-id.
This also allows for overwrites which may not be possible
in a multi-pool setup.
fixes#12787
it would seem like using `bufio.Scan()` is very
slow for heavy concurrent I/O, ie. when r.Body
is slow , instead use a proper
binary exchange format, to marshal and unmarshal
the LockArgs datastructure in a cleaner way.
this PR increases performance of the locking
sub-system for tiny repeated read lock requests
on same object.
```
BenchmarkLockArgs
BenchmarkLockArgs-4 6417609 185.7 ns/op 56 B/op 2 allocs/op
BenchmarkLockArgsOld
BenchmarkLockArgsOld-4 1187368 1015 ns/op 4096 B/op 1 allocs/op
```
This PR brings two optimizations mainly
for page-cache build-up and how to avoid
getting OOM killed in the process. Although
these memories are reclaimable Linux is not
fast enough to reclaim them as needed on a
very busy system. fadvise is a system call
implemented in Linux to advise page-cache to
avoid overload as we get significant amount
of requests on the server.
- FADV_SEQUENTIAL tells that all I/O from now
is going to be sequential, allowing for more
resposive throughput.
- FADV_NOREUSE tells kernel to start removing
things for this 'fd' from page-cache.
An endpoint can be empty when a disk is offline or something
wrong with it. Avoid it by filling erasureSets.endpointStrings
with values from arguments.
This was a regression introduced in '14bb969782'
this has the potential to cause corruption when
there are concurrent overwrites attempting to update
the content on the namespace.
This PR adds a situation where PutObject(), CopyObject()
compete properly for the same locks with NewMultipartUpload()
however it ends up turning off competing locks for the actual
object with GetObject() and DeleteObject() - since they do not
compete due to concurrent I/O on a versioned bucket it can lead
to loss of versions.
This PR fixes this bug with multi-pool setup with replication
that causes corruption of inlined data due to lack of competing
locks in a multi-pool setup.
Instead CompleteMultipartUpload holds the necessary
locks when finishing the transaction, knowing the exact
location of an object to schedule the multipart upload
doesn't need to compete in this manner, a pool id location
for existing object.
This reverts commit 91567ba916.
Revert because the error was incorrectly converted, there are
callers that rely on errConfigNotFound and it also took away
the migration code.
Instead the correct fix is PutBucketTaggingHandler() which
is already added.
also remove HealObjects() code from dataScanner running another
listing from the data-scanner is super in-efficient and in-fact
this code is redundant since we already attempt to heal all
dangling objects anyways.
- Supports object locked buckets that require
PutObject() to set content-md5 always.
- Use SSE-S3 when S3 gateway is being used instead
of SSE-KMS for auto-encryption.
This PR however also proceeds to simplify the loading
of various subsystems such as
- globalNotificationSys
- globalTargetSys
converge them directly into single bucket metadata sys
loader, once that is loaded automatically every other
target should be loaded and configured properly.
fixes#13252
DeleteObject() on existing objects before `xl.json` to
`xl.meta` change were not working, not sure when this
regression was added. This PR fixes this properly.
Also this PR ensures that we perform rename of xl.json
to xl.meta only during "write" phase of the call i.e
either during Healing or PutObject() overwrites.
Also handles few other scenarios during migration where
`backendEncryptedFile` was missing deleteConfig() will
fail with `configNotFound` this case was not ignored,
which can lead to failure during upgrades.
once we have competed for locks, verify if the
context is still valid - this is to ensure that
we do not start readdir() or read() calls on the
drives on canceled connections.
This commit brings two locks instead of single lock for
WalkDir() calls on top of c25816eabc.
The main reason is to avoid contention between readMetadata()
and ListDir() calls, ListDir() can take time on prefixes that
are huge for readdir() but this shouldn't end up blocking
all readMetadata() operations, this allows for more room for
I/O while not overly penalizing all listing operations.
This commit fixes an issue in the `AssumeRoleWithCertificate`
handler.
Before clients received an error when they send
a chain of X.509 certificates (their client certificate as
well as intermediate / root CAs).
Now, client can send a certificate chain and the server
will only consider non-CA / leaf certificates as possible
client certificate candidates. However, the client still
can only send one certificate.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>