This refactor is done for few reasons below
- to avoid deadlocks in scenarios when number
of nodes are smaller < actual erasure stripe
count where in N participating local lockers
can lead to deadlocks across systems.
- avoids expiry routines to run 1000 of separate
network operations and routes per disk where
as each of them are still accessing one single
local entity.
- it is ideal to have since globalLockServer
per instance.
- In a 32node deployment however, each server
group is still concentrated towards the
same set of lockers that partipicate during
the write/read phase, unlike previous minio/dsync
implementation - this potentially avoids send
32 requests instead we will still send at max
requests of unique nodes participating in a
write/read phase.
- reduces overall chattiness on smaller setups.
Add trashcan that keeps recently updated lists after bucket deletion.
All caches were deleted once a bucket was deleted, so caches still running would report errors. Now they are canceled.
Fix `.minio.sys` not being transient.
Bonus fixes, we do not need reload format anymore
as the replaced drive is healed locally we only need
to ensure that drive heal reloads the drive properly.
We preserve the UUID of the original order, this means
that the replacement in `format.json` doesn't mean that
the drive needs to be reloaded into memory anymore.
fixes#10791
Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c
Gist of improvements:
* Cross-server caching and listing will use the same data across servers and requests.
* Lists can be arbitrarily resumed at a constant speed.
* Metadata for all files scanned is stored for streaming retrieval.
* The existing bloom filters controlled by the crawler is used for validating caches.
* Concurrent requests for the same data (or parts of it) will not spawn additional walkers.
* Listing a subdirectory of an existing recursive cache will use the cache.
* All listing operations are fully streamable so the number of objects in a bucket no
longer dictates the amount of memory.
* Listings can be handled by any server within the cluster.
* Caches are cleaned up when out of date or superseded by a more recent one.
Allow requests to come in for users as soon as object
layer and config are initialized, this allows users
to be authenticated sooner and would succeed automatically
on servers which are yet to fully initialize.
This change tracks bandwidth for a bucket and object
- [x] Add Admin API
- [x] Add Peer API
- [x] Add BW throttling
- [x] Admin APIs to set replication limit
- [x] Admin APIs for fetch bandwidth
In almost all scenarios MinIO now is
mostly ready for all sub-systems
independently, safe-mode is not useful
anymore and do not serve its original
intended purpose.
allow server to be fully functional
even with config partially configured,
this is to cater for availability of actual
I/O v/s manually fixing the server.
In k8s like environments it will never make
sense to take pod into safe-mode state,
because there is no real access to perform
any remote operation on them.
- select lockers which are non-local and online to have
affinity towards remote servers for lock contention
- optimize lock retry interval to avoid sending too many
messages during lock contention, reduces average CPU
usage as well
- if bucket is not set, when deleteObject fails make sure
setPutObjHeaders() honors lifecycle only if bucket name
is set.
- fix top locks to list out always the oldest lockers always,
avoid getting bogged down into map's unordered nature.
- Add changes to ensure remote disks are not
incorrectly taken online if their order has
changed or are incorrect disks.
- Bring changes to peer to detect disconnection
with separate Health handler, to avoid a
rather expensive call GetLocakDiskIDs()
- Follow up on the same changes for Lockers
as well
When updating all servers following the constructions of mc update,
only the endpoint server will be updated successfully.
All the other peer servers' updating failed due to the error below:
--------------------------------------------------------------------------
parsing time "2006-01-02T15:04:05Z07:00" as "<release version>": cannot parse "-01-02T15:04:05Z07:00" as "0-"
--------------------------------------------------------------------------
- Implement a new xl.json 2.0.0 format to support,
this moves the entire marshaling logic to POSIX
layer, top layer always consumes a common FileInfo
construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
for object placements.
Fixes#2111
This PR adds a new configuration parameter which allows readiness
check to respond within 10secs, this can be reduced to a lower value
if necessary using
```
mc admin config set api ready_deadline=5s
```
or
```
export MINIO_API_READY_DEADLINE=5s
```
This PR is a continuation from #9586, now the
entire parsing logic is fully merged into
bucket metadata sub-system, simplify the
quota API further by reducing the remove
quota handler implementation.
this is a major overhaul by migrating off all
bucket metadata related configs into a single
object '.metadata.bin' this allows us for faster
bootups across 1000's of buckets and as well
as keeps the code simple enough for future
work and additions.
Additionally also fixes#9396, #9394
enable linter using golangci-lint across
codebase to run a bunch of linters together,
we shall enable new linters as we fix more
things the codebase.
This PR fixes the first stage of this
cleanup.
The `keepHTTPResponseAlive` would cause errors to be
returned with status OK.
- Add '32' as a filler byte until a response is ready
- '0' to indicate the response is ready to be consumed
- '1' to indicate response has an error which needs
to be returned to the caller
Clear out 'file not found' errors from dir walker, since it may be
in a folder that has been deleted since it was scanned.
This PR is to ensure that we call the relevant object
layer APIs for necessary S3 API level functionalities
allowing gateway implementations to return proper
errors as NotImplemented{}
This allows for all our tests in mint to behave
appropriately and can be handled appropriately as
well.
We should allow quorum errors to be send upwards
such that caller can retry while reading bucket
encryption/policy configs when server is starting
up, this allows distributed setups to load the
configuration properly.
Current code didn't facilitate this and would have
never loaded the actual configs during rolling,
server restarts.
This PR allows setting a "hard" or "fifo" quota
restriction at the bucket level. Buckets that
have reached the FIFO quota configured, will
automatically be cleaned up in FIFO manner until
bucket usage drops to configured quota.
If a bucket is configured with a "hard" quota
ceiling, all further writes are disallowed.
By monitoring PUT/DELETE and heal operations it is possible
to track changed paths and keep a bloom filter for this data.
This can help prioritize paths to scan. The bloom filter can identify
paths that have not changed, and the few collisions will only result
in a marginal extra workload. This can be implemented on either a
bucket+(1 prefix level) with reasonable performance.
The bloom filter is set to have a false positive rate at 1% at 1M
entries. A bloom table of this size is about ~2500 bytes when serialized.
To not force a full scan of all paths that have changed cycle bloom
filters would need to be kept, so we guarantee that dirty paths have
been scanned within cycle runs. Until cycle bloom filters have been
collected all paths are considered dirty.
this commit avoids lots of tiny allocations, repeated
channel creates which are performed when filtering
the incoming events, unescaping a key just for matching.
also remove deprecated code which is not needed
anymore, avoids unexpected data structure transformations
from the map to slice.
- keep long running obd network tests alive
- fix error - wrong number of parents in process OBD info
- ensure that osinfo does not error out when inside containers
- remove limit on max number of connections per client transport
The generic client transport uses a default limit of 64 conns per transport.
This could end up limiting and throttling usage, and artificially slowing
down the performance of MinIO even on hardware capable of doing better.
- Removes PerfInfo admin API as its not OBDInfo
- Keep the drive path without the metaBucket in OBD
global latency map.
- Remove all the unused code related to PerfInfo API
- Do not redefined global mib,gib constants use
humanize.MiByte and humanize.GiByte instead always
- Implement a graph algorithm to test network bandwidth from every
node to every other node
- Saturate any network bandwidth adaptively, accounting for slow
and fast network capacity
- Implement parallel drive OBD tests
- Implement a paging mechanism for OBD test to provide periodic updates to client
- Implement Sys, Process, Host, Mem OBD Infos
- acquire since leader lock for all background operations
- healing, crawling and applying lifecycle policies.
- simplify lifecyle to avoid network calls, which was a
bug in implementation - we should hold a leader and
do everything from there, we have access to entire
name space.
- make listing, walking not interfere by slowing itself
down like the crawler.
- effectively use global context everywhere to ensure
proper shutdown, in cache, lifecycle, healing
- don't read `format.json` for prometheus metrics in
StorageInfo() call.
- pkg/bucket/encryption provides support for handling bucket
encryption configuration
- changes under cmd/ provide support for AES256 algorithm only
Co-Authored-By: Poorna <poornas@users.noreply.github.com>
Co-authored-by: Harshavardhana <harsha@minio.io>