Optimizations include
- do not write the metacache block if the size of the
block is '0' and it is the first block - where listing
is attempted for a transient prefix, this helps to
avoid creating lots of empty metacache entries for
`minioMetaBucket`
- avoid the entire initialization sequence of cacheCh
, metacacheBlockWriter if we are simply going to skip
them when discardResults is set to true.
- No need to hold write locks while writing metacache
blocks - each block is unique, per bucket, per prefix
and also is written by a single node.
current implementation was incorrect, it in-fact
assumed only read quorum number of disks. in-fact
that value is only meant for read quorum good entries
from all online disks.
This PR fixes this behavior properly.
This commit refactors the code in `cmd/crypto`
and separates SSE-S3, SSE-C and SSE-KMS.
This commit should not cause any behavior change
except for:
- `IsRequested(http.Header)`
which now returns the requested type {SSE-C, SSE-S3,
SSE-KMS} and does not consider SSE-C copy headers.
However, SSE-C copy headers alone are anyway not valid.
Additional cases handled
- fix address situations where healing is not
triggered on failed writes and deletes.
- consider object exists during listing when
metadata can be successfully decoded.
always find the right set of online peers for remote listing,
this may have an effect on listing if the server is down - we
should do this to avoid always performing transient operations
on bucket->peerClient that is permanently or down for a long
period.
additionally also configure http2 healthcheck
values to quickly detect unstable connections
and let them timeout.
also use single transport for proxying requests
with missing nextMarker with delimiter based listing,
top level prefixes beyond 4500 or max-keys value
wouldn't be sent back for client to ask for the next
batch.
reproduced at a customer deployment, create prefixes
as shown below
```
for year in $(seq 2017 2020)
do
for month in {01..12}
do for day in {01..31}
do
mc -q cp file myminio/testbucket/dir/day_id=$year-$month-$day/;
done
done
done
```
Then perform
```
aws s3api --profile minio --endpoint-url http://localhost:9000 list-objects \
--bucket testbucket --prefix dir/ --delimiter / --max-keys 1000
```
You shall see missing NextMarker, this would disallow listing beyond max-keys
requested and also disallow beyond 4500 (maxKeyObjectList) prefixes being listed
because client wouldn't know the NextMarker available.
This PR addresses this situation properly by making the implementation
more spec compatible. i.e NextMarker in-fact can be either an object, a prefix
with delimiter depending on the input operation.
This issue was introduced after the list caching changes and has been present
for a while.
issue was introduced in #11106 the following
pattern
<-t.C // timer fired
if !t.Stop() {
<-t.C // timer hangs
}
Seems to hang at the last `t.C` line, this
issue happens because a fired timer cannot be
Stopped() anymore and t.Stop() returns `false`
leading to confusing state of usage.
Refactor the code such that use timers appropriately
with exact requirements in place.
Both Azure & S3 gateways call for object information before returning
the stream of the object, however, the object content/length could be
modified meanwhile, which means it can return a corrupted object.
Use ETag to ensure that the object was not modified during the GET call
crawler should only ListBuckets once not for each serverPool,
buckets are same across all pools, across sets and ListBuckets
always returns an unified view, once list buckets returns
sort it by create time to scan the latest buckets earlier
with the assumption that latest buckets would have lesser
content than older buckets allowing them to be scanned faster
and also to be able to provide more closer to latest view.
When searching the caches don't copy the ids, instead inline the loop.
```
Benchmark_bucketMetacache_findCache-32 19200 63490 ns/op 8303 B/op 5 allocs/op
Benchmark_bucketMetacache_findCache-32 20338 58609 ns/op 111 B/op 4 allocs/op
```
Add a reasonable, but still the simplistic benchmark.
Bonus - make nicer zero alloc logging
With new refactor of bucket healing, healing bucket happens
automatically including its metadata, there is no need to
redundant heal buckets also in ListBucketsHeal remove
it.
optimization mainly to avoid listing the entire
`.minio.sys/buckets/.minio.sys` directory, this
can get really huge and comes in the way of startup
routines, contents inside `.minio.sys/buckets/.minio.sys`
are rather transient and not necessary to be healed.
Tests environments (go test or manual testing) should always consider
the passed disks are root disks and should not rely on disk.IsRootDisk()
function. The reason is that this latter can return a false negative
when called in a busy system. However, returning a false negative will
only occur in a testing environment and not in a production, so we can
accept this trade-off for now.
Perform cleanup operations on copied data. Avoids read locking
data while determining which caches to keep.
Also, reduce the log(N*N) operation to log(N*M) where M caches
with the same root or below when checking potential replacements.
This refactor is done for few reasons below
- to avoid deadlocks in scenarios when number
of nodes are smaller < actual erasure stripe
count where in N participating local lockers
can lead to deadlocks across systems.
- avoids expiry routines to run 1000 of separate
network operations and routes per disk where
as each of them are still accessing one single
local entity.
- it is ideal to have since globalLockServer
per instance.
- In a 32node deployment however, each server
group is still concentrated towards the
same set of lockers that partipicate during
the write/read phase, unlike previous minio/dsync
implementation - this potentially avoids send
32 requests instead we will still send at max
requests of unique nodes participating in a
write/read phase.
- reduces overall chattiness on smaller setups.
The bucket forwarder handler considers MakeBucket to be always local but
it mistakenly thinks that PUT bucket lifecycle to be a MakeBucket call.
Fix the check of the MakeBucket call by ensuring that the query is empty
in the PUT url.
till now we used to match the inode number of the root
drive and the drive path minio would use, if they match
we knew that its a root disk.
this may not be true in all situations such as running
inside a container environment where the container might
be mounted from a different partition altogether, root
disk detection might fail.
partNumber was miscalculting the start and end of parts when partNumber
query is specified in the GET request. This commit fixes it and also
fixes the ContentRange header in that case.
PR 038bcd9079 introduced
version '3', we need to make sure that we do not
print an unexpected error instead log a message to
indicate we will auto update the version.
Due to botched upstream renames of project repositories
and incomplete migration to go.mod support, our current
dependency version of `go.mod` had bugs i.e it was
using commits from master branch which didn't have
the required fixes present in release-3.4 branches
which leads to some rare bugs
https://github.com/etcd-io/etcd/pull/11477 provides
a workaround for now and we should migrate to this.
release-3.5 eventually claims to fix all of this
properly until then we cannot use /v3 import right now
supports `mc admin config set <alias> heal sleep=100ms` to
enable more aggressive healing under certain times.
also optimize some areas that were doing extra checks than
necessary when bitrotscan was enabled, avoid double sleeps
make healing more predictable.
fixes#10497
- accountInfo API that returns information about
user, access to buckets and the size per bucket
- addUser - user is allowed to change their secretKey
- getUserInfo - returns user info if the incoming
is the same user requesting their information
In some cases a writer could be left behind unclosed, leaking compression blocks.
Always close and set compression concurrency to 2 which should be fine to keep up.
Due to https://github.com/philhofer/fwd/issues/20 when skipping a metadata entry that is >2048 bytes and the buffer is full (2048 bytes) the skip will fail with `io.ErrNoProgress`.
Enlarge the buffer so we temporarily make this much more unlikely.
If it still happens we will have to rewrite the skips to reads.
Fixes#10959
dangling object when deleted means object doesn't exist
anymore, so we should return appropriate errors, this
allows crawler heal to ensure that it removes the tracker
for dangling objects.
AZURE_STORAGE_ACCOUNT and AZURE_STORAGE_KEY are used in
azure CLI to specify the azure blob storage access & secret keys. With this commit,
it is possible to set them if you want the gateway's own credentials to be
different from the Azure blob credentials.
Co-authored-by: Harshavardhana <harsha@minio.io>
dangling objects when removed `mc admin heal -r` or crawler
auto heal would incorrectly return error - this can interfere
with usage calculation as the entry size for this would be
returned as `0`, instead upon success use the resultant
object size to calculate the final size for the object
and avoid reporting this in the log messages
Also do not set ObjectSize in healResultItem to be '-1'
this has an effect on crawler metrics calculating 1 byte
less for objects which seem to be missing their `xl.meta`
X-Minio-Replication-Delete-Status header shows the
status of the replication of a permanent delete of a version.
All GETs are disallowed and return 405 on this object version.
In the case of replicating delete markers.
X-Minio-Replication-DeleteMarker-Status shows the status
of replication, and would similarly return 405.
Additionally, this PR adds reporting of delete marker event completion
and updates documentation
Alternative to #10927
Instead of having an upstream fix, do unwrap when checking network errors.
'As' will also work when destination is an interface as checked by the tests.
This PR adds transition support for ILM
to transition data to another MinIO target
represented by a storage class ARN. Subsequent
GET or HEAD for that object will be streamed from
the transition tier. If PostRestoreObject API is
invoked, the transitioned object can be restored for
duration specified to the source cluster.
allow directories to be replicated as well, along with
their delete markers in replication.
Bonus fix to fix bloom filter updates for directories
to be preserved.
fixes a regression introduced in #10859, due
to the error returned by rest.Client being typed
i.e *rest.NetworkError - IsNetworkHostDown function
didn't work as expected to detect network issues.
This in-turn aggravated the situations when nodes
are disconnected leading to performance loss.
Do listings with prefix filter when bloom filter is dirty.
This will forward the prefix filter to the lister which will make it
only scan the folders/objects with the specified prefix.
If we have a clean bloom filter we try to build a more generally
useful cache so in that case, we will list all objects/folders.
Add shortcut for `APN/1.0 Veeam/1.0 Backup/10.0`
It requests unique blocks with a specific prefix. We skip
scanning the parent directory for more objects matching the prefix.
Allow each crawler operation to sleep up to 10 seconds on very heavily loaded systems.
This will of course make minimum crawler speed less, but should be more effective at stopping.
Delete marker replication is implemented for V2
configuration specified in AWS spec (though AWS
allows it only in the V1 configuration).
This PR also brings in a MinIO only extension of
replicating permanent deletes, i.e. deletes specifying
version id are replicated to target cluster.
This will make the health check clients 'silent'.
Use `IsNetworkOrHostDown` determine if network is ok so it mimics the functionality in the actual client.
this is needed such that we make sure to heal the
users, policies and bucket metadata right away as
we do listing based on list cache which only lists
'3' sufficiently good drives, to avoid possibly
losing access to these users upon upgrade make
sure to heal them.
If a scanning server shuts down unexpectedly we may have "successful" caches that are incomplete on a set.
In this case mark the cache with an error so it will no longer be handed out.
Add `MINIO_API_EXTEND_LIST_CACHE_LIFE` that will extend
the life of generated caches for a while.
This changes caches to remain valid until no updates have been
received for the specified time plus a fixed margin.
This also changes the caches from being invalidated when the *first*
set finishes until the *last* set has finished plus the specified time
has passed.
Similar to #10775 for fewer memory allocations, since we use
getOnlineDisks() extensively for listing we should optimize it
further.
Additionally, remove all unused walkers from the storage layer
A new field called AccessKey is added to the ReqInfo struct and populated.
Because ReqInfo is added to the context, this allows the AccessKey to be
accessed from 3rd-party code, such as a custom ObjectLayer.
Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Kaloyan Raev <kaloyan@storj.io>
On extremely long running listings keep the transient list 15 minutes after last update instead of using start time.
Also don't do overlap checks on transient lists.
Add trashcan that keeps recently updated lists after bucket deletion.
All caches were deleted once a bucket was deleted, so caches still running would report errors. Now they are canceled.
Fix `.minio.sys` not being transient.
Bonus fixes, remove package retry it is harder to get it
right, also manage context remove it such that we don't have
to rely on it anymore instead use a simple Jitter retry.
WriteAll saw 127GB allocs in a 5 minute timeframe for 4MiB buffers
used by `io.CopyBuffer` even if they are pooled.
Since all writers appear to write byte buffers, just send those
instead and write directly. The files are opened through the `os`
package so they have no special properties anyway.
This removes the alloc and copy for each operation.
REST sends content length so a precise alloc can be made.
this reduces allocations in order of magnitude
Also, revert "erasure: delete dangling objects automatically (#10765)"
affects list caching should be investigated.
Add store and a forward option for a single part
uploads when an async mode is enabled with env
MINIO_CACHE_COMMIT=writeback
It defaults to `writethrough` if unspecified.
Bonus fixes, we do not need reload format anymore
as the replaced drive is healed locally we only need
to ensure that drive heal reloads the drive properly.
We preserve the UUID of the original order, this means
that the replacement in `format.json` doesn't mean that
the drive needs to be reloaded into memory anymore.
fixes#10791
when server is booting up there is a possibility
that users might see '503' because object layer
when not initialized, then the request is proxied
to neighboring peers first one which is online.
* Fix caches having EOF marked as a failure.
* Simplify cache updates.
* Provide context for checkMetacacheState failures.
* Log 499 when the client disconnects.
`decryptObjectInfo` is a significant bottleneck when listing objects.
Reduce the allocations for a significant speedup.
https://github.com/minio/sio/pull/40
```
λ benchcmp before.txt after.txt
benchmark old ns/op new ns/op delta
Benchmark_decryptObjectInfo-32 24260928 808656 -96.67%
benchmark old MB/s new MB/s speedup
Benchmark_decryptObjectInfo-32 0.04 1.24 31.00x
benchmark old allocs new allocs delta
Benchmark_decryptObjectInfo-32 75112 48996 -34.77%
benchmark old bytes new bytes delta
Benchmark_decryptObjectInfo-32 287694772 4228076 -98.53%
```
Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c
Gist of improvements:
* Cross-server caching and listing will use the same data across servers and requests.
* Lists can be arbitrarily resumed at a constant speed.
* Metadata for all files scanned is stored for streaming retrieval.
* The existing bloom filters controlled by the crawler is used for validating caches.
* Concurrent requests for the same data (or parts of it) will not spawn additional walkers.
* Listing a subdirectory of an existing recursive cache will use the cache.
* All listing operations are fully streamable so the number of objects in a bucket no
longer dictates the amount of memory.
* Listings can be handled by any server within the cluster.
* Caches are cleaned up when out of date or superseded by a more recent one.
only newly replaced drives get the new `format.json`,
this avoids disks reloading their in-memory reference
format, ensures that drives are online without
reloading the in-memory reference format.
keeping reference format in-tact means UUIDs
never change once they are formatted.
lockers currently might leave stale lockers,
in unknown ways waiting for downed lockers.
locker check interval is high enough to safely
cleanup stale locks.
reference format should be source of truth
for inconsistent drives which reconnect,
add them back to their original position
remove automatic fix for existing offline
disk uuids
Bonus fixes
- logging improvements to ensure that we don't use
`go logger.LogIf` to avoid runtime.Caller missing
the function name. log where necessary.
- remove unused code at erasure sets
Test TestDialContextWithDNSCacheRand was failing sometimes because it depends
on a random selection of addresses when testing random DNS resolution from cache.
Lower addr selection exception to 10%
Allow requests to come in for users as soon as object
layer and config are initialized, this allows users
to be authenticated sooner and would succeed automatically
on servers which are yet to fully initialize.
Go stdlib resolver doesn't support caching DNS
resolutions, since we compile with CGO disabled
we are more probe to DNS flooding for all network
calls to resolve for DNS from the DNS server.
Under various containerized environments such as
VMWare this becomes a problem because there are
no DNS caches available and we may end up overloading
the kube-dns resolver under concurrent I/O.
To circumvent this issue implement a DNSCache resolver
which resolves DNS and caches them for around 10secs
with every 3sec invalidation attempted.
connect disks pre-emptively upon startup, to ensure we have
enough disks are connected at startup rather than wait
for them.
we need to do this to avoid long wait times for server to
be online when we have servers come up in rolling upgrade
fashion
Only use dynamic delays for the crawler. Even though the max wait was 1 second the number
of waits could severely impact crawler speed.
Instead of relying on a global metric, we use the stateless local delays to keep the crawler
running at a speed more adjusted to current conditions.
The only case we keep it is before bitrot checks when enabled.
This PR fixes a hang which occurs quite commonly at higher concurrency
by allowing following changes
- allowing lower connections in time_wait allows faster socket open's
- lower idle connection timeout to ensure that we let kernel
reclaim the time_wait connections quickly
- increase somaxconn to 4096 instead of 2048 to allow larger tcp
syn backlogs.
fixes#10413
This change tracks bandwidth for a bucket and object
- [x] Add Admin API
- [x] Add Peer API
- [x] Add BW throttling
- [x] Admin APIs to set replication limit
- [x] Admin APIs for fetch bandwidth
In almost all scenarios MinIO now is
mostly ready for all sub-systems
independently, safe-mode is not useful
anymore and do not serve its original
intended purpose.
allow server to be fully functional
even with config partially configured,
this is to cater for availability of actual
I/O v/s manually fixing the server.
In k8s like environments it will never make
sense to take pod into safe-mode state,
because there is no real access to perform
any remote operation on them.
- select lockers which are non-local and online to have
affinity towards remote servers for lock contention
- optimize lock retry interval to avoid sending too many
messages during lock contention, reduces average CPU
usage as well
- if bucket is not set, when deleteObject fails make sure
setPutObjHeaders() honors lifecycle only if bucket name
is set.
- fix top locks to list out always the oldest lockers always,
avoid getting bogged down into map's unordered nature.
This is to allow remote targets to be generalized
for replication/ILM transition
Also adding a field in BucketTarget to identify
a remote target with a label.
This commit fixes a misuse of the `http.ResponseWriter.WriteHeader`.
A caller should **either** call `WriteHeader` exactly once **or**
write to the response writer and causing an implicit 200 OK.
Writing the response headers more than once causes a `http: superfluous
response.WriteHeader call` log message. This commit fixes this
by preventing a 2nd `WriteHeader` call being forwarded to the underlying
`ResponseWriter`.
Updates #10587
* add NVMe drive info [model num, serial num, drive temp. etc.]
* Ignore fuse partitions
* Add the nvme logic only for linux
* Move smart/nvme structs to a separate file
Co-authored-by: wlan0 <sidharthamn@gmail.com>
throw proper error when port is not accessible
for the regular user, this is possibly a regression.
```
ERROR Unable to start the server: Insufficient permissions to use specified port
> Please ensure MinIO binary has 'cap_net_bind_service=+ep' permissions
HINT:
Use 'sudo setcap cap_net_bind_service=+ep /path/to/minio' to provide sufficient permissions
```
After #10594 let's invalidate the bloom filters to force the next cycles to go through all data.
There is a small chance that the linked PR could have caused missing bloom filter data.
This will invalidate the current bloom filters and make the crawler go through everything.
Routing using on source IP if found. This should distribute
the listing load for V1 and versioning on multiple nodes
evenly between different clients.
If source IP is not found from the http request header, then falls back
to bucket name instead.
Disallow versioning suspension on a bucket with
pre-existing replication configuration
If versioning is suspended on the target,replication
should fail.
`mc admin info` on busy setups will not move HDD
heads unnecessarily for repeated calls, provides
a better responsiveness for the call overall.
Bonus change allow listTolerancePerSet be N-1
for good entries, to avoid skipping entries
for some reason one of the disk went offline.
add a hint on the disk to allow for tracking fresh disk
being healed, to allow for restartable heals, and also
use this as a way to track and remove disks.
There are more pending changes where we should move
all the disk formatting logic to backend drives, this
PR doesn't deal with this refactor instead makes it
easier to track healing in the future.
- Add owner information for expiry, locking, unlocking a resource
- TopLocks returns now locks in quorum by default, provides
a way to capture stale locks as well with `?stale=true`
- Simplify the quorum handling for locks to avoid from storage
class, because there were challenges to make it consistent
across all situations.
- And other tiny simplifications to reset locks.
context canceled errors bubbling up from the network
layer has the potential to be misconstrued as network
errors, taking prematurely a server offline and triggering
a health check routine avoid this potential occurrence.
isEnded() was incorrectly calculating if the current healing sequence is
ended or not. h.currentStatus.Items could be empty if healing is very
slow and mc admin heal consumed all items.
As the bulk/recursive delete will require multiple connections to open at an instance,
The default open connections limit will be reached which results in the following error
```FATAL: sorry, too many clients already```
By setting the open connections to a reasonable value - `2`, We ensure that the max open connections
will not be exhausted and lie under bounds.
The queries are simple inserts/updates/deletes which is operational and sufficient with the
the maximum open connection limit is 2.
Fixes#10553
Allow user configuration for MaxOpenConnections