Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c
Gist of improvements:
* Cross-server caching and listing will use the same data across servers and requests.
* Lists can be arbitrarily resumed at a constant speed.
* Metadata for all files scanned is stored for streaming retrieval.
* The existing bloom filters controlled by the crawler is used for validating caches.
* Concurrent requests for the same data (or parts of it) will not spawn additional walkers.
* Listing a subdirectory of an existing recursive cache will use the cache.
* All listing operations are fully streamable so the number of objects in a bucket no
longer dictates the amount of memory.
* Listings can be handled by any server within the cluster.
* Caches are cleaned up when out of date or superseded by a more recent one.
This PR fixes a hang which occurs quite commonly at higher concurrency
by allowing following changes
- allowing lower connections in time_wait allows faster socket open's
- lower idle connection timeout to ensure that we let kernel
reclaim the time_wait connections quickly
- increase somaxconn to 4096 instead of 2048 to allow larger tcp
syn backlogs.
fixes#10413
- select lockers which are non-local and online to have
affinity towards remote servers for lock contention
- optimize lock retry interval to avoid sending too many
messages during lock contention, reduces average CPU
usage as well
- if bucket is not set, when deleteObject fails make sure
setPutObjHeaders() honors lifecycle only if bucket name
is set.
- fix top locks to list out always the oldest lockers always,
avoid getting bogged down into map's unordered nature.
MaxConnsPerHost can potentially hang a call without any
way to timeout, we do not need this setting for our proxy
and gateway implementations instead IdleConn settings are
good enough.
Also ensure to use NewRequestWithContext and make sure to
take the disks offline only for network errors.
Fixes#10304
inconsistent drive healing when one of the drive is offline
while a new drive was replaced, this change is to ensure
that we can add the offline drive back into the mix by
healing it again.
We can reduce this further in the future, but this is a good
value to keep around. With the advent of continuous healing,
we can be assured that namespace will eventually be
consistent so we are okay to avoid the necessity to
a list across all drives on all sets.
Bonus Pop()'s in parallel seem to have the potential to
wait too on large drive setups and cause more slowness
instead of gaining any performance remove it for now.
Also, implement load balanced reply for local disks,
ensuring that local disks have an affinity for
- cleanupStaleMultipartUploads()
When manual healing is triggered, one node in a cluster will
become the authority to heal. mc regularly sends new requests
to fetch the status of the ongoing healing process, but a load
balancer could land the healing request to a node that is not
doing the healing request.
This PR will redirect a request to the node based on the node
index found described as part of the client token. A similar
technique is also used to proxy ListObjectsV2 requests
by encoding this information in continuation-token
Readiness as no reasoning to be cluster scope
because that is not how the k8s networking works
for pods, all the pods to a deployment are not
sharing the network in a singleton. Instead they
are run as local scopes to themselves, with
readiness failures the pod is potentially taken
out of the network to be resolvable - this
affects the distributed setup in myriad of
different ways.
Instead readiness should behave like liveness
with local scope alone, and should be a dummy
implementation.
This PR all the startup times and overal k8s
startup time dramatically improves.
Added another handler called as `/minio/health/cluster`
to understand the cluster scope health.
- Implement a new xl.json 2.0.0 format to support,
this moves the entire marshaling logic to POSIX
layer, top layer always consumes a common FileInfo
construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
for object placements.
Fixes#2111
- total number of S3 API calls per server
- maximum wait duration for any S3 API call
This implementation is primarily meant for situations
where HDDs are not capable enough to handle the incoming
workload and there is no way to throttle the client.
This feature allows MinIO server to throttle itself
such that we do not overwhelm the HDDs.
When formatting a set validate if a host failure will likely lead to data loss.
While we don't know what config will be set in the future
evaluate to our best knowledge, assuming default settings.
Use reference format to initialize lockers
during startup, also handle `nil` for NetLocker
in dsync and remove *errorLocker* implementation
Add further tuning parameters such as
- DialTimeout is now 15 seconds from 30 seconds
- KeepAliveTimeout is not 20 seconds, 5 seconds
more than default 15 seconds
- ResponseHeaderTimeout to 10 seconds
- ExpectContinueTimeout is reduced to 3 seconds
- DualStack is enabled by default remove setting
it to `true`
- Reduce IdleConnTimeout to 30 seconds from
1 minute to avoid idleConn build up
Fixes#8773
Fixes scenario where zones are appropriately
handled, along with supporting overriding set
count. The new fix also ensures that we handle
the various setup types properly.
Update documentation to properly indicate the
behavior.
Fixes#8750
Co-authored-by: Nitish Tiwari <nitish@minio.io>
Changes in IP underneath are dynamic in replica sets
with multiple tenants, so deploying in that fashion
will not work until we wait for atleast one participatory
server to be local.
This PR also ensures that multi-tenant zone expansion also
works in replica set k8s deployments.
Introduces a new ENV `KUBERNETES_REPLICA_SET` check to call
appropriate code paths.
Continuation from #8629 which basically broke
zone deployments on k8s statefulset environment
due to incorrect assumptions which made it work
on replicated set.
Fix this properly such that this container works
for both replicated set and stateful set deployment
In replica sets, hosts resolve to localhost
IP automatically until the deployment fully
comes up. To avoid this issue we need to
wait for such resolution.
Fixes an issue reported by @klauspost and @vadmeste
This PR also allows users to expand their clusters
from single node XL deployment to distributed mode.
This PR implements locking from a global entity into
a more localized set level entity, allowing for locks
to be held only on the resources which are writing
to a collection of disks rather than a global level.
In this process this PR also removes the top-level
limit of 32 nodes to an unlimited number of nodes. This
is a precursor change before bring in bucket expansion.
- This PR allows config KVS to be validated properly
without being affected by ENV overrides, rejects
invalid values during set operation
- Expands unit tests and refactors the error handling
for notification targets, returns error instead of
ignoring targets for invalid KVS
- Does all the prep-work for implementing safe-mode
style operation for MinIO server, introduces a new
global variable to toggle safe mode based operations
NOTE: this PR itself doesn't provide safe mode operations
specific errors, `application` errors or `all` by default.
console logging on server by default lists all logs -
enhance admin console API to accept `type` as query parameter to
subscribe to application/minio logs.
On Kubernetes/Docker setups DNS resolves inappropriately
sometimes where there are situations same endpoints with
multiple disks come online indicating either one of them
is local and some of them are not local. This situation
can never happen and its only a possibility in orchestrated
deployments with dynamic DNS. Following code ensures that we
treat if one of the endpoint says its local for a given host
it is true for all endpoints for the same host. Following code
ensures that this assumption is true and it works in all
scenarios and it is safe to assume for a given host.
This PR also adds validation such that we do not crash the
server if there are bugs in the endpoints list in dsync
initialization.
Thanks to Daniel Valdivia <hola@danielvaldivia.com> for
reproducing this, this fix is needed as part of the
https://github.com/minio/m3 project.
This change is related to larger config migration PR
change, this is a first stage change to move our
configs to `cmd/config/` - divided into its subsystems
endpoint.IsLocal will not have .Host entries so
using them to skip double entries will never work.
change the code such that we look for endpoint.Host
outside of endpoint.IsLocal logic to skip double
hosts appropriately.
Move these functions to their appropriate file.
When checking if federation is necessary, the code compares
the SRV record stored in etcd against the list of endpoints
that the MinIO server is exposing. If there is an intersection
in this list the request is forwarded.
The SRV record includes both the host and the port, but the
intersection check previously only looked at the IP address. This
would prevent federation from working in situations where the endpoint
IP is the same for multiple MinIO servers. Some examples of where this
can occur are:
- running mulitiple copies of MinIO on the same host
- using multiple MinIO servers behind a NAT with port-forwarding
Allow server to start if one of the local nodes in docker/kubernetes setup is successfully resolved
- The rule is that we need atleast one local node to work. We dont need to resolve the
rest at that point.
- In a non-orchestrational setup, we fail if we do not have atleast one local node up
and running.
- In an orchestrational setup (docker-swarm and kubernetes), We retry with a sleep of 5
seconds until any one local node shows up.
Fixes#6995
Collect historic cpu and mem stats. Also, use actual values
instead of formatted strings while returning to the client. The string
formatting prevents values from being processed by the server or
by the client without parsing it.
This change will allow the values to be processed (eg.
compute rolling-average over the lifetime of the minio server)
and offloads the formatting to the client.
This is part of implementation for mc admin health command. The
ServerDrivesPerfInfo() admin API returns read and write speed
information for all the drives (local and remote) in a given Minio
server deployment.
Part of minio/mc#2606