Current code was relying on globalEndpoints as
the source of secondary truth to obtain
the missing endpoints list when the disk
is offline, this is problematic
- there is no way to know if the getDisks()
returned endpoints total is same as the
ones list of globalEndpoints and it
belongs to a particular set.
- there is no order guarantee as getDisks()
is ordered as per format.json, globalEndpoints
may not be, so potentially end up including
incorrect endpoints.
To fix this bring getEndpoints() just like getDisks()
to ensure that consistently ordered endpoints are
always available for us to ensure that returned values
are consistent with what each erasure set would observe.
Uploading files with names that could not be written to disk
would result in "reduce your request" errors returned.
Instead check explicitly for disallowed characters and reject
files with `Object name contains unsupported characters.`
At a customer setup with lots of concurrent calls
it can be observed that in newRetryTimer there
were lots of tiny alloations which are not
relinquished upon retries, in this codepath
we were only interested in re-using the timer
and use it wisely for each locker.
```
(pprof) top
Showing nodes accounting for 8.68TB, 97.02% of 8.95TB total
Dropped 1198 nodes (cum <= 0.04TB)
Showing top 10 nodes out of 79
flat flat% sum% cum cum%
5.95TB 66.50% 66.50% 5.95TB 66.50% time.NewTimer
1.16TB 13.02% 79.51% 1.16TB 13.02% github.com/ncw/directio.AlignedBlock
0.67TB 7.53% 87.04% 0.70TB 7.78% github.com/minio/minio/cmd.xlObjects.putObject
0.21TB 2.36% 89.40% 0.21TB 2.36% github.com/minio/minio/cmd.(*posix).Walk
0.19TB 2.08% 91.49% 0.27TB 2.99% os.statNolog
0.14TB 1.59% 93.08% 0.14TB 1.60% os.(*File).readdirnames
0.10TB 1.09% 94.17% 0.11TB 1.25% github.com/minio/minio/cmd.readDirN
0.10TB 1.07% 95.23% 0.10TB 1.07% syscall.ByteSliceFromString
0.09TB 1.03% 96.27% 0.09TB 1.03% strings.(*Builder).grow
0.07TB 0.75% 97.02% 0.07TB 0.75% path.(*lazybuf).append
```
For example `{1...17}/{1...52}` symmetrical
distribution of drives cannot be obtained
- Because 17 is a prime number
- Is not divisible by any pre-defined setCounts i.e
from 1 to 16
Manual healing (as background healing) creates a heal task with a
possiblity to override healing options, such as deep or normal mode.
Use a pointer type in heal opts so nil would mean use the default
healing options.
aws cli fails to set a bucket encryption configuration to MinIO server.
The reason is that aws cli does not send MD5-Content header. It seems
that MD5-Content is not required anymore.
This commit also returns Not Implemented header early to help mint tests
to ignore testing this API in gateway modes.
CopyObject was not correctly figuring out the correct
destination object location and would end up creating
duplicate objects on two different zones, reproduced
by doing encryption based key rotation.
Advantages avoids 100's of stats which are needed for each
upload operation in FS/NAS gateway mode when uploading a large
multipart object, dramatically increases performance for
multipart uploads by avoiding recursive calls.
For other gateway's simplifies the approach since
azure, gcs, hdfs gateway's don't capture any specific
metadata during upload which needs handler validation
for encryption/compression.
Erasure coding was already optimized, additionally
just avoids small allocations of large data structure.
Fixes#7206
GetDiskID() in storage rest client does not really issue a REST request
to the remote disk, but returns an in-memory value instead.
However, GetDiskID() should return an error when format.json is not
found or for other similar issues (unmounted disks, etc..)
GetDiskID() is only called when formatting disks and getting storage
informatio, hence this commit should not have a performance degradation.
Additionally also fix STS logs to filter out LDAP
password to be sent out in audit logs.
Bonus fix handle the reload of users properly by
making sure to preserve the newer users during the
reload to be not invalidated.
Fixes#9707Fixes#9644Fixes#9651
Bonus fixes in quota enforcement to use the
new datastructure and use timedValue to cache
a value/reload automatically avoids one less
global variable.
If the requested server is part of the set this will always read
from the local disk, even if the disk contains a parity shard.
In default setup there is a 50% chance that at least
one shard that otherwise would have been fetched remotely
will be read locally instead.
It basically trades RPC call overhead for reed-solomon.
On distributed localhost this seems to be fairly break-even,
with a very small gain in throughput and latency.
However on networked servers this should be a bigger
1MB objects, before:
```
Operation: GET. Concurrency: 32. Hosts: 4.
Requests considered: 76257:
* Avg: 25ms 50%: 24ms 90%: 32ms 99%: 42ms Fastest: 7ms Slowest: 67ms
* First Byte: Average: 23ms, Median: 22ms, Best: 5ms, Worst: 65ms
Throughput:
* Average: 1213.68 MiB/s, 1272.63 obj/s (59.948s, starting 14:45:44 CEST)
```
After:
```
Operation: GET. Concurrency: 32. Hosts: 4.
Requests considered: 78845:
* Avg: 24ms 50%: 24ms 90%: 31ms 99%: 39ms Fastest: 8ms Slowest: 62ms
* First Byte: Average: 22ms, Median: 21ms, Best: 6ms, Worst: 57ms
Throughput:
* Average: 1255.11 MiB/s, 1316.08 obj/s (59.938s, starting 14:43:58 CEST)
```
Bonus fix: Only ask for heal once on an object.
This value is requested on every upload when there are multiple zones.
Since this will result in an RPC call to every remote disk this scales
quite badly in a distributed setup. Load every 1second interval.
2 servers, localhost only. In large distributed setups much bigger
gains can be expected.
```
Operations: 21743 -> 22454
* Average: +3.28% (+0.0 MiB/s) throughput, +3.28% (+11.9) obj/s
* Fastest: +3.37% (+0.0 MiB/s) throughput, +3.37% (+13.0) obj/s
* 50% Median: +3.03% (+0.0 MiB/s) throughput, +3.03% (+11.2) obj/s
* Slowest: +8.03% (+0.0 MiB/s) throughput, +8.03% (+22.8) obj/s
```
For easy management of this a generic helper has been added.
some clients such as veeam expect the x-amz-meta to
be sent in lower cased form, while this does indeed
defeats the HTTP protocol contract it is harder to
change these applications, while these applications
get fixed appropriately in future.
x-amz-meta is usually sent in lowercased form
by AWS S3 and some applications like veeam
incorrectly end up relying on the case sensitivity
of the HTTP headers.
Bonus fixes
- Fix the iso8601 time format to keep it same as
AWS S3 response
- Increase maxObjectList to 50,000 and use
maxDeleteList as 10,000 whenever multi-object
deletes are needed.
No one really uses FS for large scale accounting
usage, neither we crawl in NAS gateway mode. It is
worthwhile to simply disable this feature as its
not useful for anyone.
Bonus disable bucket quota ops as well in, FS
and gateway mode
size calculation in crawler was using the real size
of the object instead of its actual size i.e either
a decrypted or uncompressed size.
this is needed to make sure all other accounting
such as bucket quota and mcs UI to display the
correct values.
This PR adds a new configuration parameter which allows readiness
check to respond within 10secs, this can be reduced to a lower value
if necessary using
```
mc admin config set api ready_deadline=5s
```
or
```
export MINIO_API_READY_DEADLINE=5s
```
net/http exposes ErrorLog but it is log.Logger
instance not an interface which can be overridden,
because of this reason the logging is interleaved
sometimes with TLS with messages like this on the
server
```
http: TLS handshake error from 139.178.70.188:63760: EOF
```
This is bit problematic for us as we need to have
consistent logging view for allow --json or --quiet
flags.
With this PR we ensure that this format is adhered to.
Groups information shall be now stored as part of the
credential data structure, this is a more idiomatic
way to support large LDAP groups.
Avoids the complication of setups where LDAP groups
can be in the range of 150+ which may lead to excess
HTTP header size > 8KiB, to reduce such an occurrence
we shall save the group information on the server as
part of the credential data structure.
Bonus change support multiple mapped policies, across
all types of users.
This PR is a continuation from #9586, now the
entire parsing logic is fully merged into
bucket metadata sub-system, simplify the
quota API further by reducing the remove
quota handler implementation.
Shuffling arguments that we pass to MinIO server are supported. However,
when that happens, Prometheus returns wrong information about disks usage
and online/offline status.
The commit fixes the issue by avoiding relying on xl.endpoints since
it is not ordered.
this is a major overhaul by migrating off all
bucket metadata related configs into a single
object '.metadata.bin' this allows us for faster
bootups across 1000's of buckets and as well
as keeps the code simple enough for future
work and additions.
Additionally also fixes#9396, #9394
To avoid this issue with refCounter refactor the code
such that
- locker() always increases refCount upon success
- unlocker() always decrements refCount upon success
(as a special case removes the resource if the
refCount is zero)
By these two assumptions we are able to see that we
are never granted two write lockers in any situation.
Thanks to @vcabbage for writing a nice reproducer.
enable linter using golangci-lint across
codebase to run a bunch of linters together,
we shall enable new linters as we fix more
things the codebase.
This PR fixes the first stage of this
cleanup.
There is a disparency of behavior under Linux & Windows about
the returned error when trying to rename a non existant path.
err := os.Rename("/path/does/not/exist", "/tmp/copy")
Linux:
isSysErrNotDir(err) = false
os.IsNotExist(err) = true
Windows:
isSysErrNotDir(err) = true
os.IsNotExist(err) = true
ENOTDIR in Linux is returned when the destination path
of the rename call contains a file in one of the middle
segments of the path (e.g. /tmp/file/dst, where /tmp/file
is an actual file not a directory)
However, as shown above, Windows has more scenarios when
it returns ENOTDIR. For example, when the source path contains
an inexistant directory in its path.
In that case, we want errFileNotFound returned and not
errFileAccessDenied, so this commit will add a further check to close
the disparency between Windows & Linux.
The `ioutil.NopCloser(reader)` was hiding nested hash readers.
We make it an `io.Closer` so it can be attached without wrapping
and allows for nesting, by merging the requests.
The `keepHTTPResponseAlive` would cause errors to be
returned with status OK.
- Add '32' as a filler byte until a response is ready
- '0' to indicate the response is ready to be consumed
- '1' to indicate response has an error which needs
to be returned to the caller
Clear out 'file not found' errors from dir walker, since it may be
in a folder that has been deleted since it was scanned.
This PR is to ensure that we call the relevant object
layer APIs for necessary S3 API level functionalities
allowing gateway implementations to return proper
errors as NotImplemented{}
This allows for all our tests in mint to behave
appropriately and can be handled appropriately as
well.
S3 is now natively supported by B2 cloud storage provider
there is no reason to use specialized gateway for B2 anymore,
our current S3 gateway with caching would work with B2.
Resolves#8584
requests in federated setups for STS type calls which are
performed at '/' resource should be routed by the muxer,
the assumption is simply such that requests without a bucket
in a federated setup cannot be proxied, so serve them at
current server.
This commit makes the KES client use HTTP/2
when establishing a connection to the KES server.
This is necessary since the next KES server release
will require HTTP/2.
We should allow quorum errors to be send upwards
such that caller can retry while reading bucket
encryption/policy configs when server is starting
up, this allows distributed setups to load the
configuration properly.
Current code didn't facilitate this and would have
never loaded the actual configs during rolling,
server restarts.
In large setups this avoids unnecessary data transfer
across nodes and potential locks.
This PR also optimizes heal result channel, which should
be avoided for each queueHealTask as its expensive
to create/close channels for large number of objects.
This PR allows setting a "hard" or "fifo" quota
restriction at the bucket level. Buckets that
have reached the FIFO quota configured, will
automatically be cleaned up in FIFO manner until
bucket usage drops to configured quota.
If a bucket is configured with a "hard" quota
ceiling, all further writes are disallowed.
ResponseWriter & RecordAPIStats has similar role, merge them.
This commit will also fix wrong auditing for STS and Web and others
since they are using ResponseWriter instead of the RecordAPIStats.
A user can incorrectly mounts a newly fresh disk. MinIO will detect
that it is writing with a rootfs disk and will mark it down. However,
it is hard for the user to understand what's going on.
This commit will just print a notice so it will be easy to spot
such use case.
- elasticsearch client should rely on the SDK helpers
instead of pure HTTP calls.
- webhook shouldn't need to check for IsActive() for
all notifications, failure should be delayed.
- Remove DialHTTP as its never used properly
Fixes#9460
allow generating service accounts for temporary credentials
which have a designated parent, currently OpenID is not yet
supported.
added checks to ensure that service account cannot generate
further service accounts for itself, service accounts can
never be a parent to any credential.
Audit was not working properly when enabled from the environment
caused by a typo in the code.
This commit fixes that but also consider the following variables:
`MINIO_LOGGER_WEBHOOK_ENABLE_*` and
`MINIO_AUDIT_WEBHOOK_ENABLE_*` so the user can use
this latter to temporarily disable a logger or audit configuration.
data usage tracker and crawler seem to be logging
non-actionable information on console, which is not
useful and is fixed on its own in almost all deployments,
lets keep this logging to minimal.
it is possible in many screnarios that even
if the divisible value is optimal, we may
end up with uneven distribution due to number
of nodes present in the configuration.
added code allow for affinity towards various
ellipses to figure out optimal value across
ellipses such that we can always reach a
symmetric value automatically.
Fixes#9416
By monitoring PUT/DELETE and heal operations it is possible
to track changed paths and keep a bloom filter for this data.
This can help prioritize paths to scan. The bloom filter can identify
paths that have not changed, and the few collisions will only result
in a marginal extra workload. This can be implemented on either a
bucket+(1 prefix level) with reasonable performance.
The bloom filter is set to have a false positive rate at 1% at 1M
entries. A bloom table of this size is about ~2500 bytes when serialized.
To not force a full scan of all paths that have changed cycle bloom
filters would need to be kept, so we guarantee that dirty paths have
been scanned within cycle runs. Until cycle bloom filters have been
collected all paths are considered dirty.
this commit avoids lots of tiny allocations, repeated
channel creates which are performed when filtering
the incoming events, unescaping a key just for matching.
also remove deprecated code which is not needed
anymore, avoids unexpected data structure transformations
from the map to slice.
we have policy available for sub-admin users to set/get/delete
config, but we incorrectly decrypt the content using admin secret
key which in-fact should be the credential authenticating the
request.
global WORM mode is a complex piece for which
the time has passed, with the advent of S3 compatible
object locking and retention implementation global
WORM is sort of deprecated, this has been mentioned
in our documentation for some time, now the time
has come for this to go.