This PR fixes a hang which occurs quite commonly at higher concurrency
by allowing following changes
- allowing lower connections in time_wait allows faster socket open's
- lower idle connection timeout to ensure that we let kernel
reclaim the time_wait connections quickly
- increase somaxconn to 4096 instead of 2048 to allow larger tcp
syn backlogs.
fixes#10413
This change tracks bandwidth for a bucket and object
- [x] Add Admin API
- [x] Add Peer API
- [x] Add BW throttling
- [x] Admin APIs to set replication limit
- [x] Admin APIs for fetch bandwidth
In almost all scenarios MinIO now is
mostly ready for all sub-systems
independently, safe-mode is not useful
anymore and do not serve its original
intended purpose.
allow server to be fully functional
even with config partially configured,
this is to cater for availability of actual
I/O v/s manually fixing the server.
In k8s like environments it will never make
sense to take pod into safe-mode state,
because there is no real access to perform
any remote operation on them.
- select lockers which are non-local and online to have
affinity towards remote servers for lock contention
- optimize lock retry interval to avoid sending too many
messages during lock contention, reduces average CPU
usage as well
- if bucket is not set, when deleteObject fails make sure
setPutObjHeaders() honors lifecycle only if bucket name
is set.
- fix top locks to list out always the oldest lockers always,
avoid getting bogged down into map's unordered nature.
This is to allow remote targets to be generalized
for replication/ILM transition
Also adding a field in BucketTarget to identify
a remote target with a label.
This commit fixes a misuse of the `http.ResponseWriter.WriteHeader`.
A caller should **either** call `WriteHeader` exactly once **or**
write to the response writer and causing an implicit 200 OK.
Writing the response headers more than once causes a `http: superfluous
response.WriteHeader call` log message. This commit fixes this
by preventing a 2nd `WriteHeader` call being forwarded to the underlying
`ResponseWriter`.
Updates #10587
* add NVMe drive info [model num, serial num, drive temp. etc.]
* Ignore fuse partitions
* Add the nvme logic only for linux
* Move smart/nvme structs to a separate file
Co-authored-by: wlan0 <sidharthamn@gmail.com>
throw proper error when port is not accessible
for the regular user, this is possibly a regression.
```
ERROR Unable to start the server: Insufficient permissions to use specified port
> Please ensure MinIO binary has 'cap_net_bind_service=+ep' permissions
HINT:
Use 'sudo setcap cap_net_bind_service=+ep /path/to/minio' to provide sufficient permissions
```
After #10594 let's invalidate the bloom filters to force the next cycles to go through all data.
There is a small chance that the linked PR could have caused missing bloom filter data.
This will invalidate the current bloom filters and make the crawler go through everything.
Routing using on source IP if found. This should distribute
the listing load for V1 and versioning on multiple nodes
evenly between different clients.
If source IP is not found from the http request header, then falls back
to bucket name instead.
Disallow versioning suspension on a bucket with
pre-existing replication configuration
If versioning is suspended on the target,replication
should fail.
`mc admin info` on busy setups will not move HDD
heads unnecessarily for repeated calls, provides
a better responsiveness for the call overall.
Bonus change allow listTolerancePerSet be N-1
for good entries, to avoid skipping entries
for some reason one of the disk went offline.
add a hint on the disk to allow for tracking fresh disk
being healed, to allow for restartable heals, and also
use this as a way to track and remove disks.
There are more pending changes where we should move
all the disk formatting logic to backend drives, this
PR doesn't deal with this refactor instead makes it
easier to track healing in the future.
- Add owner information for expiry, locking, unlocking a resource
- TopLocks returns now locks in quorum by default, provides
a way to capture stale locks as well with `?stale=true`
- Simplify the quorum handling for locks to avoid from storage
class, because there were challenges to make it consistent
across all situations.
- And other tiny simplifications to reset locks.
context canceled errors bubbling up from the network
layer has the potential to be misconstrued as network
errors, taking prematurely a server offline and triggering
a health check routine avoid this potential occurrence.
isEnded() was incorrectly calculating if the current healing sequence is
ended or not. h.currentStatus.Items could be empty if healing is very
slow and mc admin heal consumed all items.
As the bulk/recursive delete will require multiple connections to open at an instance,
The default open connections limit will be reached which results in the following error
```FATAL: sorry, too many clients already```
By setting the open connections to a reasonable value - `2`, We ensure that the max open connections
will not be exhausted and lie under bounds.
The queries are simple inserts/updates/deletes which is operational and sufficient with the
the maximum open connection limit is 2.
Fixes#10553
Allow user configuration for MaxOpenConnections
It is possible the heal drives are not reported from
the maintenance check because the background heal
state simply relied on the `format.json` for capturing
unformatted drives. It is possible that drives might
be still healing - make sure that applications which
rely on cluster health check respond back this detail.
Also, revamp the way ListBuckets work make few portions
of the healing logic parallel
- walk objects for healing disks in parallel
- collect the list of buckets in parallel across drives
- provide consistent view for listBuckets()
Healing was not working correctly in the distributed mode because
errFileVersionNotFound was not properly converted in storage rest
client.
Besides, fixing the healing delete marker is not working as expected.
performance improves by around 100x or more
```
go test -v -run NONE -bench BenchmarkGetPartFile
goos: linux
goarch: amd64
pkg: github.com/minio/minio/cmd
BenchmarkGetPartFileWithTrie
BenchmarkGetPartFileWithTrie-4 1000000000 0.140 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/minio/minio/cmd 1.737s
```
fixes#10520
* Fix cases where minimum timeout > default timeout.
* Add defensive code for too small/negative timeouts.
* Never set timeout below the maximum value of a request.
* Protect against (unlikely) int64 wraps.
* Decrease timeout slower.
* Don't re-lock before copying.
This bug was introduced in 14f0047295
almost 3yrs ago, as a side affect of removing stale `fs.json`
but we in-fact end up removing existing good `fs.json` for an
existing object, leading to some form of a data loss.
fixes#10496
from 20s for 10000 parts to less than 1sec
Without the patch
```
~ time aws --endpoint-url=http://localhost:9000 --profile minio s3api \
list-parts --bucket testbucket --key test \
--upload-id c1cd1f50-ea9a-4824-881c-63b5de95315a
real 0m20.394s
user 0m0.589s
sys 0m0.174s
```
With the patch
```
~ time aws --endpoint-url=http://localhost:9000 --profile minio s3api \
list-parts --bucket testbucket --key test \
--upload-id c1cd1f50-ea9a-4824-881c-63b5de95315a
real 0m0.891s
user 0m0.624s
sys 0m0.182s
```
fixes#10503
It was observed in VMware vsphere environment during a
pod replacement, `mc admin info` might report incorrect
offline nodes for the replaced drive. This issue eventually
goes away but requires quite a lot of time for all servers
to be in sync.
This PR fixes this behavior properly.
If the ILM document requires removing noncurrent versions, the
the server should be able to remove 'null' versions as well.
'null' versions are created when versioning is not enabled
or suspended.
The entire encryption layer is dependent on the fact that
KMS should be configured for S3 encryption to work properly
and we only support passing the headers as is to the backend
for encryption only if KMS is configured.
Make sure that this predictability is maintained, currently
the code was allowing encryption to go through and fail
at later to indicate that KMS was not configured. We should
simply reply "NotImplemented" if KMS is not configured, this
allows clients to simply proceed with their tests.
This is to ensure that Go contexts work properly, after some
interesting experiments I found that Go net/http doesn't
cancel the context when Body is non-zero and hasn't been
read till EOF.
The following gist explains this, this can lead to pile up
of go-routines on the server which will never be canceled
and will die at a really later point in time, which can
simply overwhelm the server.
https://gist.github.com/harshavardhana/c51dcfd055780eaeb71db54f9c589150
To avoid this refactor the locking such that we take locks after we
have started reading from the body and only take locks when needed.
Also, remove contextReader as it's not useful, doesn't work as expected
context is not canceled until the body reaches EOF so there is no point
in wrapping it with context and putting a `select {` on it which
can unnecessarily increase the CPU overhead.
We will still use the context to cancel the lockers etc.
Additional simplification in the locker code to avoid timers
as re-using them is a complicated ordeal avoid them in
the hot path, since locking is very common this may avoid
lots of allocations.
configurable remote transport timeouts for some special cases
where this value needs to be bumped to a higher value when
transferring large data between federated instances.
In `(*cacheObjects).GetObjectNInfo` copy the metadata before spawning a goroutine.
Clean up a few map[string]string copies as well, reducing allocs and simplifying the code.
Fixes#10426
From https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html#intro-lifecycle-rules-actions
```
When specifying the number of days in the NoncurrentVersionTransition
and NoncurrentVersionExpiration actions in a Lifecycle configuration,
note the following:
It is the number of days from when the version of the object becomes
noncurrent (that is, when the object is overwritten or deleted), that
Amazon S3 will perform the action on the specified object or objects.
Amazon S3 calculates the time by adding the number of days specified in
the rule to the time when the new successor version of the object is
created and rounding the resulting time to the next day midnight UTC.
For example, in your bucket, suppose that you have a current version of
an object that was created at 1/1/2014 10:30 AM UTC. If the new version
of the object that replaces the current version is created at 1/15/2014
10:30 AM UTC, and you specify 3 days in a transition rule, the
transition date of the object is calculated as 1/19/2014 00:00 UTC.
```