This is to ensure that Go contexts work properly, after some
interesting experiments I found that Go net/http doesn't
cancel the context when Body is non-zero and hasn't been
read till EOF.
The following gist explains this, this can lead to pile up
of go-routines on the server which will never be canceled
and will die at a really later point in time, which can
simply overwhelm the server.
https://gist.github.com/harshavardhana/c51dcfd055780eaeb71db54f9c589150
To avoid this refactor the locking such that we take locks after we
have started reading from the body and only take locks when needed.
Also, remove contextReader as it's not useful, doesn't work as expected
context is not canceled until the body reaches EOF so there is no point
in wrapping it with context and putting a `select {` on it which
can unnecessarily increase the CPU overhead.
We will still use the context to cancel the lockers etc.
Additional simplification in the locker code to avoid timers
as re-using them is a complicated ordeal avoid them in
the hot path, since locking is very common this may avoid
lots of allocations.
configurable remote transport timeouts for some special cases
where this value needs to be bumped to a higher value when
transferring large data between federated instances.
In `(*cacheObjects).GetObjectNInfo` copy the metadata before spawning a goroutine.
Clean up a few map[string]string copies as well, reducing allocs and simplifying the code.
Fixes#10426
From https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html#intro-lifecycle-rules-actions
```
When specifying the number of days in the NoncurrentVersionTransition
and NoncurrentVersionExpiration actions in a Lifecycle configuration,
note the following:
It is the number of days from when the version of the object becomes
noncurrent (that is, when the object is overwritten or deleted), that
Amazon S3 will perform the action on the specified object or objects.
Amazon S3 calculates the time by adding the number of days specified in
the rule to the time when the new successor version of the object is
created and rounding the resulting time to the next day midnight UTC.
For example, in your bucket, suppose that you have a current version of
an object that was created at 1/1/2014 10:30 AM UTC. If the new version
of the object that replaces the current version is created at 1/15/2014
10:30 AM UTC, and you specify 3 days in a transition rule, the
transition date of the object is calculated as 1/19/2014 00:00 UTC.
```
This PR adds a DNS target that ensures to update an entry
into Kubernetes operator when a bucket is created or deleted.
See minio/operator#264 for details.
Co-authored-by: Harshavardhana <harsha@minio.io>
MaxConnsPerHost can potentially hang a call without any
way to timeout, we do not need this setting for our proxy
and gateway implementations instead IdleConn settings are
good enough.
Also ensure to use NewRequestWithContext and make sure to
take the disks offline only for network errors.
Fixes#10304
inconsistent drive healing when one of the drive is offline
while a new drive was replaced, this change is to ensure
that we can add the offline drive back into the mix by
healing it again.
Add context to all (non-trivial) calls to the storage layer.
Contexts are propagated through the REST client.
- `context.TODO()` is left in place for the places where it needs to be added to the caller.
- `endWalkCh` could probably be removed from the walkers, but no changes so far.
The "dangerous" part is that now a caller disconnecting *will* propagate down, so a
"delete" operation will now be interrupted. In some cases we might want to disconnect
this functionality so the operation completes if it has started, leaving the system in a cleaner state.
This commit refactors the certificate management implementation
in the `certs` package such that multiple certificates can be
specified at the same time. Therefore, the following layout of
the `certs/` directory is expected:
```
certs/
│
├─ public.crt
├─ private.key
├─ CAs/ // CAs directory is ignored
│ │
│ ...
│
├─ example.com/
│ │
│ ├─ public.crt
│ └─ private.key
└─ foobar.org/
│
├─ public.crt
└─ private.key
...
```
However, directory names like `example.com` are just for human
readability/organization and don't have any meaning w.r.t whether
a particular certificate is served or not. This decision is made based
on the SNI sent by the client and the SAN of the certificate.
***
The `Manager` will pick a certificate based on the client trying
to establish a TLS connection. In particular, it looks at the client
hello (i.e. SNI) to determine which host the client tries to access.
If the manager can find a certificate that matches the SNI it
returns this certificate to the client.
However, the client may choose to not send an SNI or tries to access
a server directly via IP (`https://<ip>:<port>`). In this case, we
cannot use the SNI to determine which certificate to serve. However,
we also should not pick "the first" certificate that would be accepted
by the client (based on crypto. parameters - like a signature algorithm)
because it may be an internal certificate that contains internal hostnames.
We would disclose internal infrastructure details doing so.
Therefore, the `Manager` returns the "default" certificate when the
client does not specify an SNI. The default certificate the top-level
`public.crt` - i.e. `certs/public.crt`.
This approach has some consequences:
- It's the operator's responsibility to ensure that the top-level
`public.crt` does not disclose any information (i.e. hostnames)
that are not publicly visible. However, this was the case in the
past already.
- Any other `public.crt` - except for the top-level one - must not
contain any IP SAN. The reason for this restriction is that the
Manager cannot match a SNI to an IP b/c the SNI is the server host
name. The entire purpose of SNI is to indicate which host the client
tries to connect to when multiple hosts run on the same IP. So, a
client will not set the SNI to an IP.
If we would allow IP SANs in a lower-level `public.crt` a user would
expect that it is possible to connect to MinIO directly via IP address
and that the MinIO server would pick "the right" certificate. However,
the MinIO server cannot determine which certificate to serve, and
therefore always picks the "default" one. This may lead to all sorts
of confusing errors like:
"It works if I use `https:instance.minio.local` but not when I use
`https://10.0.2.1`.
These consequences/limitations should be pointed out / explained in our
docs in an appropriate way. However, the support for multiple
certificates should not have any impact on how deployment with a single
certificate function today.
Co-authored-by: Harshavardhana <harsha@minio.io>
- do not fail the healthcheck if heal status
was not obtained from one of the nodes,
if many nodes fail then report this as a
catastrophic error.
- add "x-minio-write-quorum" value to match
the write tolerance supported by server.
- admin info now states if a drive is healing
where madmin.Disk.Healing is set to true
and madmin.Disk.State is "ok"
Currently, cache purges are triggered as soon as the low watermark is exceeded.
To reduce IO this should only be done when reaching the high watermark.
This simplifies checks and reduces all calls for a GC to go through
`dcache.diskSpaceAvailable(size)`. While a comment claims that
`dcache.triggerGC <- struct{}{}` was non-blocking I don't see how
that was possible. Instead, we add a 1 size to the queue channel
and use channel semantics to avoid blocking when a GC has
already been requested.
`bytesToClear` now takes the high watermark into account to it will
not request any bytes to be cleared until that is reached.
This commit reduces the retry delay when retrying a request
to a KES server by:
- reducing the max. jitter delay from 3s to 1.5s
- skipping the random delay when there are more KES endpoints
available.
If there are more KES endpoints we can directly retry to the request
by sending it to the next endpoint - as pointed out by @krishnasrinivas
- delete-marker should be created on a suspended bucket as `null`
- delete-marker should delete any pre-existing `null` versioned
object and create an entry `null`
When checking parts we already do a stat for each part.
Since we have the on disk size check if it is at least what we expect.
When checking metadata check if metadata is 0 bytes.
The `getNonLoopBackIP` may grab an IP from an interface that
doesn't allow binding (on Windows), so this test consistently fails.
We exclude that specific error.
* readDirN: Check if file is directory
`syscall.FindNextFile` crashes if the handle is a file.
`errFileNotFound` matches 'unix' functionality: d19b434ffc/cmd/os-readdir_unix.go (L106)Fixes#10384
This commit addresses a maintenance / automation problem when MinIO-KES
is deployed on bare-metal. In orchestrated env. the orchestrator (K8S)
will make sure that `n` KES servers (IPs) are available via the same DNS
name. There it is sufficient to provide just one endpoint.
ListObjectsV1 requests are actually redirected to a specific node,
depending on the bucket name. The purpose of this behavior was
to optimize listing.
However, the current code sends a Bad Gateway error if the
target node is offline, which is a bad behavior because it means
that the list request will fail, although this is unnecessary since
we can still use the current node to list as well (the default behavior
without using proxying optimization)
Currently, you can see mint fails when there is one offline node, after
this PR, mint will always succeed.
We can reduce this further in the future, but this is a good
value to keep around. With the advent of continuous healing,
we can be assured that namespace will eventually be
consistent so we are okay to avoid the necessity to
a list across all drives on all sets.
Bonus Pop()'s in parallel seem to have the potential to
wait too on large drive setups and cause more slowness
instead of gaining any performance remove it for now.
Also, implement load balanced reply for local disks,
ensuring that local disks have an affinity for
- cleanupStaleMultipartUploads()
If DiskInfo calls failed the information returned was used anyway
resulting in no endpoint being set.
This would make the drive be attributed to the local system since
`disk.Endpoint == disk.DrivePath` in that case.
Instead, if the call fails record the endpoint and the error only.
bonus make sure to ignore objectNotFound, and versionNotFound
errors properly at all layers, since HealObjects() returns
objectNotFound error if the bucket or prefix is empty.
When crawling never use a disk we know is healing.
Most of the change involves keeping track of the original endpoint on xlStorage
and this also fixes DiskInfo.Endpoint never being populated.
Heal master will print `data-crawl: Disk "http://localhost:9001/data/mindev/data2/xl1" is
Healing, skipping` once on a cycle (no more often than every 5m).
We should only enforce quotas if no error has been returned.
firstErr is safe to access since all goroutines have exited at this point.
If `firstErr` hasn't been set by something else return the context error if cancelled.
Keep dataUpdateTracker while goroutine is starting.
This will ensure the object is updated one `start` returns
Tested with
```
λ go test -cpu=1,2,4,8 -test.run TestDataUpdateTracker -count=1000
PASS
ok github.com/minio/minio/cmd 8.913s
```
Fixes#10295
fresh drive setups when one of the drive is
a root drive, we should ignore such a root
drive and not proceed to format.
This PR handles this properly by marking
the disks which are root disk and they are
taken offline.
newDynamicTimeout should be allocated once, in-case
of temporary locks in config and IAM we should
have allocated timeout once before the `for loop`
This PR doesn't fix any issue as such, but provides
enough dynamism for the timeout as per expectation.
Based on our previous conversations I assume we should send the version
id when healing an object.
Maybe we should even list object versions and heal all?
In a non recursive mode, issuing a list request where prefix
is an existing object with a slash and delimiter is a slash will
return entries in the object directory (data dir IDs)
```
$ aws s3api --profile minioadmin --endpoint-url http://localhost:9000 \
list-objects-v2 --bucket testbucket --prefix code_of_conduct.md/ --delimiter '/'
{
"CommonPrefixes": [
{
"Prefix":
"code_of_conduct.md/ec750fe0-ea7e-4b87-bbec-1e32407e5e47/"
}
]
}
```
This commit adds a fast exit track in Walk() in this specific case.
use `/etc/hosts` instead of `/` to check for common
device id, if the device is same for `/etc/hosts`
and the --bind mount to detect root disks.
Bonus enhance healthcheck logging by adding maintenance
tags, for all messages.
adds a feature where we can fetch the MinIO
command-line remotely, this
is primarily meant to add some stateless
nature to the MinIO deployment in k8s
environments, MinIO operator would run a
webhook service endpoint
which can be used to fetch any environment
value in a generalized approach.
It is possible in situations when server was deployed
in asymmetric configuration in the past such as
```
minio server ~/fs{1...4}/disk{1...5}
```
Results in setDriveCount of 10 in older releases
but with fairly recent releases we have moved to
having server affinity which means that a set drive
count ascertained from above config will be now '4'
While the object layer make sure that we honor
`format.json` the storageClass configuration however
was by mistake was using the global value obtained
by heuristics. Which leads to prematurely using
lower parity without being requested by the an
administrator.
This PR fixes this behavior.
Bonus also fix a bug where we did not purge relevant
service accounts generated by rotating credentials
appropriately, service accounts should become invalid
as soon as its corresponding parent user becomes invalid.
Since service account themselves carry parent claim always
we would never reach this problem, as the access get
rejected at IAM policy layer.
when source and destination are same and versioning is enabled
on the destination bucket - we do not need to re-create the entire
object once again to optimize on space utilization.
Cases this PR is not supporting
- any pre-existing legacy object will not
be preserved in this manner, meaning a new
dataDir will be created.
- key-rotation and storage class changes
of course will never re-use the dataDir
conflicting files can exist on FS at
`.minio.sys/buckets/testbucket/policy.json/`, this is an
expected valid scenario for FS mode allow it to work,
i.e ignore and move forward
With reduced parity our write quorum should be same
as read quorum, but code was still assuming
```
readQuorum+1
```
In all situations which is not necessary.
Generalize replication target management so
that remote targets for a bucket can be
managed with ARNs. `mc admin bucket remote`
command will be used to manage targets.
Context timeout might race on each other when timeouts are lower
i.e when two lock attempts happened very quickly on the same resource
and the servers were yet trying to establish quorum.
This situation can lead to locks held which wouldn't be unlocked
and subsequent lock attempts would fail.
This would require a complete server restart. A potential of this
issue happening is when server is booting up and we are trying
to hold a 'transaction.lock' in quick bursts of timeout.
replace dummy buffer with nullReader{} instead,
to avoid large memory allocations in memory
constrainted environments. allows running
obd tests in such environments.
Currently, listing directories on HDFS incurs a per-entry remote Stat() call
penalty, the cost of which can really blow up on directories with many
entries (+1,000) especially when considered in addition to peripheral
calls (such as validation) and the fact that minio is an intermediary to the
client (whereas other clients listed below can query HDFS directly).
Because listing directories this way is expensive, the Golang HDFS library
provides the [`Client.Open()`] function which creates a [`FileReader`] that is
able to batch multiple calls together through the [`Readdir()`] function.
This is substantially more efficient for very large directories.
In one case we were witnessing about +20 seconds to list a directory with 1,500
entries, admittedly large, but the Java hdfs ls utility as well as the HDFS
library sample ls utility were much faster.
Hadoop HDFS DFS (4.02s):
λ ~/code/minio → use-readdir
» time hdfs dfs -ls /directory/with/1500/entries/
…
hdfs dfs -ls 5.81s user 0.49s system 156% cpu 4.020 total
Golang HDFS library (0.47s):
λ ~/code/hdfs → master
» time ./hdfs ls -lh /directory/with/1500/entries/
…
./hdfs ls -lh 0.13s user 0.14s system 56% cpu 0.478 total
mc and minio **without** optimization (16.96s):
λ ~/code/minio → master
» time mc ls myhdfs/directory/with/1500/entries/
…
./mc ls 0.22s user 0.29s system 3% cpu 16.968 total
mc and minio **with** optimization (0.40s):
λ ~/code/minio → use-readdir
» time mc ls myhdfs/directory/with/1500/entries/
…
./mc ls 0.13s user 0.28s system 102% cpu 0.403 total
[`Client.Open()`]: https://godoc.org/github.com/colinmarc/hdfs#Client.Open
[`FileReader`]: https://godoc.org/github.com/colinmarc/hdfs#FileReader
[`Readdir()`]: https://godoc.org/github.com/colinmarc/hdfs#FileReader.Readdir
If there are many listeners to bucket notifications or to the trace
subsystem, healing fails to work properly since it suspends itself when
the number of concurrent connections is above a certain threshold.
These connections are also continuous and not costly (*no disk access*),
it is okay to just ignore them in waitForLowHTTPReq().
this is to detect situations of corruption disk
format etc errors quickly and keep the disk online
in such scenarios for requests to fail appropriately.
Fixes two different types of problems
- continuation of the problem seen in FS #9992
as not fixed for erasure coded deployments,
reproduced this issue with spark and its fixed now
- another issue was leaking walk go-routines which
would lead to high memory usage and crash the system
this is simply because all the walks which were purged
at the top limit had leaking end walkers which would
consume memory endlessly.
closes#9966closes#10088
not all claims need to be present for
the JWT claim, let the policies not
exist and only apply which are present
when generating the credentials
once credentials are generated then
those policies should exist, otherwise
the request will fail.
- copyObject in-place decryption failed
due to incorrect verification of headers
- do not decode ETag when object is encrypted
with SSE-C, so that pre-conditions don't fail
prematurely.
This PR adds support for healing older
content i.e from 2yrs, 1yr. Also handles
other situations where our config was
not encrypted yet.
This PR also ensures that our Listing
is consistent and quorum friendly,
such that we don't list partial objects
In federated NAS gateway setups, multiple hosts in srvRecords
was picked at random which could mean that if one of the
host was down the request can indeed fail and if client
retries it would succeed. Instead allow server to figure
out the current online host quickly such that we can
exclude the host which is down.
At the max the attempt to look for a downed node is to
300 millisecond, if the node is taking longer to respond
than this value we simply ignore and move to the node,
total attempts are equal to number of srvRecords if no
server is online we simply fallback to last dialed host.
healing was not working properly when drives were
replaced, due to the error check in root disk
calculation this PR fixes this behavior
This PR also adds additional fix for missing
metadata entries from .minio.sys as part of
disk healing as well.
Added code to ignore and print more context
sensitive errors for better debugging.
This PR is continuation of fix in 7b14e9b660
Enforce bucket quotas when crawling has finished.
This ensures that we will not do quota enforcement on old data.
Additionally, delete less if we are closer to quota than we thought.
for users who don't have access to HDFS rootPath '/'
can optionally specify `minio gateway hdfs hdfs://namenode:8200/path`
for which they have access to, allowing all writes to be
performed at `/path`.
NOTE: once configured in this manner you need to make
sure command line is correctly specified, otherwise
your data might not be visible
closes#10011
this PR to allow legacy support for big-data
applications which run older Java versions
which do not support the secure ciphers
currently defaulted in MinIO. This option
allows optionally to turn them off such
that client and server can negotiate the
best ciphers themselves.
This env is purposefully not documented,
meant as a last resort when client
application cannot be changed easily.
- admin info node offline check is now quicker
- admin info now doesn't duplicate the code
across doing the same checks for disks
- rely on StorageInfo to return appropriate errors
instead of calling locally.
- diskID checks now return proper errors when
disk not found v/s format.json missing.
- add more disk states for more clarity on the
underlying disk errors.
while we handle all situations for writes and reads
on older format, what we didn't cater for properly
yet was delete where we only ended up deleting
just `xl.meta` - instead we should allow all the
deletes to go through for older format without
versioning enabled buckets.
CORS is notorious requires specific headers to be
handled appropriately in request and response,
using cors package as part of handlerFunc() for
options method lacks the necessary control this
package needs to add headers.
Without instantiating a new rest client we can
have a recursive error which can lead to
healthcheck returning always offline, this can
prematurely take the servers offline.
gorilla/mux broke their recent release 1.7.4 which we
upgraded to, we need the current workaround to ensure
that our regex matches appropriately.
An upstream PR is sent, we should remove the
workaround once we have a new release.
- reduce locker timeout for early transaction lock
for more eagerness to timeout
- reduce leader lock timeout to range from 30sec to 1minute
- add additional log message during bootstrap phase
Main issue is that `t.pool[params]` should be `t.pool[oldest]`.
We add a bit more safety features for the code.
* Make writes to the endTimerCh non-blocking in all cases
so multiple releases cannot lock up.
* Double check expectations.
* Shift down deletes with copy instead of truncating slice.
* Actually delete the oldest if we are above total limit.
* Actually delete the oldest found and not the current.
* Unexport the mutex so nobody from the outside can meddle with it.
This commit adds a new admin API for creating master keys.
An admin client can send a POST request to:
```
/minio/admin/v3/kms/key/create?key-id=<keyID>
```
The name / ID of the new key is specified as request
query parameter `key-id=<ID>`.
Creating new master keys requires KES - it does not work with
the native Vault KMS (deprecated) nor with a static master key
(deprecated).
Further, this commit removes the `UpdateKey` method from the `KMS`
interface. This method is not needed and not used anymore.
with the merge of https://github.com/etcd-io/etcd/pull/11823
etcd v3.5.0 will now have a properly imported versioned path
this fixes our pending migration to newer repo
Use a separate client for these calls that can take a long time.
Add request context to these so they are canceled when the client
disconnects instead except for ListObject which doesn't have any equivalent.
Healing an object which has multiple versions was not working because
the healing code forgot to consider errFileVersionNotFound error as a
use case that needs healing
Currently, lifecycle expiry is deleting all object versions which is not
correct, unless noncurrent versions field is specified.
Also, only delete the delete marker if it is the only version of the
given object.
- additionally upgrade to msgp@v1.1.2
- change StatModTime,StatSize fields as
simple Size/ModTime
- reduce 50000 entries per List batch to 10000
as client needs to wait too long to see the
first batch some times which is not desired
and it is worth we write the data as soon
as we have it.
object KMS is configured with auto-encryption,
there were issues when using docker registry -
this has been left unnoticed for a while.
This PR fixes an issue with compatibility.
Additionally also fix the continuation-token implementation
infinite loop issue which was missed as part of #9939
Also fix the heal token to be generated as a client
facing value instead of what is remembered by the
server, this allows for the server to be stateless
regarding the token's behavior.
When manual healing is triggered, one node in a cluster will
become the authority to heal. mc regularly sends new requests
to fetch the status of the ongoing healing process, but a load
balancer could land the healing request to a node that is not
doing the healing request.
This PR will redirect a request to the node based on the node
index found described as part of the client token. A similar
technique is also used to proxy ListObjectsV2 requests
by encoding this information in continuation-token
The S3 specification says that versions are ordered in the response of
list object versions.
mc snapshot needs this to know which version comes first especially when
two versions have the same exact last-modified field.
Readiness as no reasoning to be cluster scope
because that is not how the k8s networking works
for pods, all the pods to a deployment are not
sharing the network in a singleton. Instead they
are run as local scopes to themselves, with
readiness failures the pod is potentially taken
out of the network to be resolvable - this
affects the distributed setup in myriad of
different ways.
Instead readiness should behave like liveness
with local scope alone, and should be a dummy
implementation.
This PR all the startup times and overal k8s
startup time dramatically improves.
Added another handler called as `/minio/health/cluster`
to understand the cluster scope health.
Walk() functionality was missing on gateway
implementations leading to missing functionality
for the browser UI such as remove multiple objects,
download as zip file etc.
This PR brings a generic implementation across
all gateway's, it is not required to repeat the
same code in all gateway's
The default behavior is to cache each range requested
to cache drive. Add an environment variable
`MINIO_RANGE_CACHE` - when set to off, it disables
range caching and instead downloads entire object
in the background.
Fixes#9870
Bonus fix during versioning merge one of the PR was missing
the offline/online disk count fix from #9801 port it correctly
over to the master branch from release.
Additionally, add versionID support for MRF
Fixes#9910Fixes#9931
This PR has the following changes
- Removing duplicate lookupConfigs() calls.
- Deprecate admin config APIs for NAS gateways. This will avoid repeated reloads of the config from the disk.
- WatchConfigNASDisk will be removed
- Migration guide for NAS gateways users to migrate to ENV settings.
NOTE: THIS PR HAS A BREAKING CHANGE
Fixes#9875
Co-authored-by: Harshavardhana <harsha@minio.io>
Looking into full disk errors on zoned setup. We don't take the
5% space requirement into account when selecting a zone.
The interesting part is that even considering this we don't
know the size of the object the user wants to upload when
they do multipart uploads.
It seems quite defensive to always upload multiparts to
the zone where there is the most space since all load will
be directed to a part of the cluster.
In these cases we make sure it can at least hold a 1GiB file
and we disadvantage fuller zones more by subtracting the
expected size before weighing.
- x-amz-storage-class specified CopyObject
should proceed regardless, its not a precondition
- sourceVersionID is specified CopyObject should
proceed regardless, its not a precondition
This PR fixes all the below scenarios
and handles them correctly.
- existing data/bucket is replaced with
new content, no versioning enabled old
structure vanishes.
- existing data/bucket - enable versioning
before uploading any data, once versioning
enabled upload new content, old content
is preserved.
- suspend versioning on the bucket again, now
upload content again the old content is purged
since that is the default "null" version.
Additionally sync data after xl.json -> xl.meta
rename(), to avoid any surprises if there is a
crash during this rename operation.
- Add changes to ensure remote disks are not
incorrectly taken online if their order has
changed or are incorrect disks.
- Bring changes to peer to detect disconnection
with separate Health handler, to avoid a
rather expensive call GetLocakDiskIDs()
- Follow up on the same changes for Lockers
as well
Just like GET/DELETE APIs it is possible to preserve
client supplied versionId's, of course the versionIds
have to be uuid, if an existing versionId is found
it is overwritten if no object locking policies
are found.
- PUT /bucketname/objectname?versionId=<id>
- POST /bucketname/objectname?uploads=&versionId=<id>
- PUT /bucketname/objectname?verisonId=<id> (with x-amz-copy-source)
Fixes potentially infinite allocations, especially in FS mode,
since lookups live up to 30 minutes. Limit walk pool sizes to 50
max parameter entries and 4 concurrent operations with the same
parameters.
Fixes#9835
PutObject on multiple-zone with versioning would not
overwrite the correct location of the object if the
object has delete marker, leading to duplicate objects
on two zones.
This PR fixes by adding affinity towards delete marker
when GetObjectInfo() returns error, use the zone index
which has the delete marker.
Bonus change to use channel to serialize triggers,
instead of using atomic variables. More efficient
mechanism for synchronization.
Co-authored-by: Nitish Tiwari <nitish@minio.io>
In the Current bug we were re-using the context
from previously granted lockers, this would
lead to lock timeouts for existing valid
read or write locks, leading to premature
timeout of locks.
This bug affects only local lockers in FS
or standalone erasure coded mode. This issue
is rather historical as well and was present
in lsync for some time but we were lucky to
not see it.
Similar changes are done in dsync as well
to keep the code more familiar
Fixes#9827
When updating all servers following the constructions of mc update,
only the endpoint server will be updated successfully.
All the other peer servers' updating failed due to the error below:
--------------------------------------------------------------------------
parsing time "2006-01-02T15:04:05Z07:00" as "<release version>": cannot parse "-01-02T15:04:05Z07:00" as "0-"
--------------------------------------------------------------------------
- Implement a new xl.json 2.0.0 format to support,
this moves the entire marshaling logic to POSIX
layer, top layer always consumes a common FileInfo
construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
for object placements.
Fixes#2111
Historically due to lack of support for middlewares
we ended up writing wrapped handlers for all
middlewares on top of the gorilla/mux, this causes
multiple issues when we want to let's say
- Overload r.Body with some custom implementation
to track the incoming Reads()
- Add other sort of top level checks to avoid
DDOSing the server with large incoming HTTP
bodies.
Since 1.7.x release gorilla/mux provides proper
use of middlewares, which are honored by the muxer
directly. This makes sure that Go can honor its
own internal ServeHTTP(w, r) implementation where
Go net/http can wrap into its own customer readers.
This PR as a side-affect fixes rare issues of client
hangs which were reported in the wild but never really
understood or fixed in our codebase.
Fixes#9759Fixes#7266Fixes#6540Fixes#5455Fixes#5150
Refer https://github.com/boto/botocore/pull/1328 for
one variation of the same issue in #9759
PR #9801 while it is correct, the loop isEndpointConnected()
was changed to rely on endpoint.String() which has the host
information as well, which is not correct value as input to
detect if the disk is down or up, if endpoint is local use
its local path value instead.
This commit changes the data key generation such that
if a MinIO server/nodes tries to generate a new DEK
but the particular master key does not exist - then
MinIO asks KES to create a new master key and then
requests the DEK again.
From now on, a SSE-S3 master key must not be created
explicitly via: `kes key create <key-name>`.
Instead, it is sufficient to just set the env. var.
```
export MINIO_KMS_KES_KEY_NAME=<key-name>
```
However, the MinIO identity (mTLS client certificate)
must have the permission to access the `/v1/key/create/`
API. Therefore, KES policy for MinIO must look similar to:
```
[
/v1/key/create/<key-name-pattern>
/v1/key/generate/<key-name-pattern>
/v1/key/decrypt/<key-name-pattern>
]
```
However, in our guides we already suggest that.
See e.g.: https://github.com/minio/kes/wiki/MinIO-Object-Storage#kes-server-setup
***
The ability to create master keys on request may also be
necessary / useful in case of SSE-KMS.
Current code was relying on globalEndpoints as
the source of secondary truth to obtain
the missing endpoints list when the disk
is offline, this is problematic
- there is no way to know if the getDisks()
returned endpoints total is same as the
ones list of globalEndpoints and it
belongs to a particular set.
- there is no order guarantee as getDisks()
is ordered as per format.json, globalEndpoints
may not be, so potentially end up including
incorrect endpoints.
To fix this bring getEndpoints() just like getDisks()
to ensure that consistently ordered endpoints are
always available for us to ensure that returned values
are consistent with what each erasure set would observe.
Uploading files with names that could not be written to disk
would result in "reduce your request" errors returned.
Instead check explicitly for disallowed characters and reject
files with `Object name contains unsupported characters.`
At a customer setup with lots of concurrent calls
it can be observed that in newRetryTimer there
were lots of tiny alloations which are not
relinquished upon retries, in this codepath
we were only interested in re-using the timer
and use it wisely for each locker.
```
(pprof) top
Showing nodes accounting for 8.68TB, 97.02% of 8.95TB total
Dropped 1198 nodes (cum <= 0.04TB)
Showing top 10 nodes out of 79
flat flat% sum% cum cum%
5.95TB 66.50% 66.50% 5.95TB 66.50% time.NewTimer
1.16TB 13.02% 79.51% 1.16TB 13.02% github.com/ncw/directio.AlignedBlock
0.67TB 7.53% 87.04% 0.70TB 7.78% github.com/minio/minio/cmd.xlObjects.putObject
0.21TB 2.36% 89.40% 0.21TB 2.36% github.com/minio/minio/cmd.(*posix).Walk
0.19TB 2.08% 91.49% 0.27TB 2.99% os.statNolog
0.14TB 1.59% 93.08% 0.14TB 1.60% os.(*File).readdirnames
0.10TB 1.09% 94.17% 0.11TB 1.25% github.com/minio/minio/cmd.readDirN
0.10TB 1.07% 95.23% 0.10TB 1.07% syscall.ByteSliceFromString
0.09TB 1.03% 96.27% 0.09TB 1.03% strings.(*Builder).grow
0.07TB 0.75% 97.02% 0.07TB 0.75% path.(*lazybuf).append
```
For example `{1...17}/{1...52}` symmetrical
distribution of drives cannot be obtained
- Because 17 is a prime number
- Is not divisible by any pre-defined setCounts i.e
from 1 to 16
Manual healing (as background healing) creates a heal task with a
possiblity to override healing options, such as deep or normal mode.
Use a pointer type in heal opts so nil would mean use the default
healing options.
aws cli fails to set a bucket encryption configuration to MinIO server.
The reason is that aws cli does not send MD5-Content header. It seems
that MD5-Content is not required anymore.
This commit also returns Not Implemented header early to help mint tests
to ignore testing this API in gateway modes.
CopyObject was not correctly figuring out the correct
destination object location and would end up creating
duplicate objects on two different zones, reproduced
by doing encryption based key rotation.
Advantages avoids 100's of stats which are needed for each
upload operation in FS/NAS gateway mode when uploading a large
multipart object, dramatically increases performance for
multipart uploads by avoiding recursive calls.
For other gateway's simplifies the approach since
azure, gcs, hdfs gateway's don't capture any specific
metadata during upload which needs handler validation
for encryption/compression.
Erasure coding was already optimized, additionally
just avoids small allocations of large data structure.
Fixes#7206
GetDiskID() in storage rest client does not really issue a REST request
to the remote disk, but returns an in-memory value instead.
However, GetDiskID() should return an error when format.json is not
found or for other similar issues (unmounted disks, etc..)
GetDiskID() is only called when formatting disks and getting storage
informatio, hence this commit should not have a performance degradation.
Additionally also fix STS logs to filter out LDAP
password to be sent out in audit logs.
Bonus fix handle the reload of users properly by
making sure to preserve the newer users during the
reload to be not invalidated.
Fixes#9707Fixes#9644Fixes#9651
Bonus fixes in quota enforcement to use the
new datastructure and use timedValue to cache
a value/reload automatically avoids one less
global variable.
If the requested server is part of the set this will always read
from the local disk, even if the disk contains a parity shard.
In default setup there is a 50% chance that at least
one shard that otherwise would have been fetched remotely
will be read locally instead.
It basically trades RPC call overhead for reed-solomon.
On distributed localhost this seems to be fairly break-even,
with a very small gain in throughput and latency.
However on networked servers this should be a bigger
1MB objects, before:
```
Operation: GET. Concurrency: 32. Hosts: 4.
Requests considered: 76257:
* Avg: 25ms 50%: 24ms 90%: 32ms 99%: 42ms Fastest: 7ms Slowest: 67ms
* First Byte: Average: 23ms, Median: 22ms, Best: 5ms, Worst: 65ms
Throughput:
* Average: 1213.68 MiB/s, 1272.63 obj/s (59.948s, starting 14:45:44 CEST)
```
After:
```
Operation: GET. Concurrency: 32. Hosts: 4.
Requests considered: 78845:
* Avg: 24ms 50%: 24ms 90%: 31ms 99%: 39ms Fastest: 8ms Slowest: 62ms
* First Byte: Average: 22ms, Median: 21ms, Best: 6ms, Worst: 57ms
Throughput:
* Average: 1255.11 MiB/s, 1316.08 obj/s (59.938s, starting 14:43:58 CEST)
```
Bonus fix: Only ask for heal once on an object.
This value is requested on every upload when there are multiple zones.
Since this will result in an RPC call to every remote disk this scales
quite badly in a distributed setup. Load every 1second interval.
2 servers, localhost only. In large distributed setups much bigger
gains can be expected.
```
Operations: 21743 -> 22454
* Average: +3.28% (+0.0 MiB/s) throughput, +3.28% (+11.9) obj/s
* Fastest: +3.37% (+0.0 MiB/s) throughput, +3.37% (+13.0) obj/s
* 50% Median: +3.03% (+0.0 MiB/s) throughput, +3.03% (+11.2) obj/s
* Slowest: +8.03% (+0.0 MiB/s) throughput, +8.03% (+22.8) obj/s
```
For easy management of this a generic helper has been added.
some clients such as veeam expect the x-amz-meta to
be sent in lower cased form, while this does indeed
defeats the HTTP protocol contract it is harder to
change these applications, while these applications
get fixed appropriately in future.
x-amz-meta is usually sent in lowercased form
by AWS S3 and some applications like veeam
incorrectly end up relying on the case sensitivity
of the HTTP headers.
Bonus fixes
- Fix the iso8601 time format to keep it same as
AWS S3 response
- Increase maxObjectList to 50,000 and use
maxDeleteList as 10,000 whenever multi-object
deletes are needed.
No one really uses FS for large scale accounting
usage, neither we crawl in NAS gateway mode. It is
worthwhile to simply disable this feature as its
not useful for anyone.
Bonus disable bucket quota ops as well in, FS
and gateway mode