It was observed in VMware vsphere environment during a
pod replacement, `mc admin info` might report incorrect
offline nodes for the replaced drive. This issue eventually
goes away but requires quite a lot of time for all servers
to be in sync.
This PR fixes this behavior properly.
If the ILM document requires removing noncurrent versions, the
the server should be able to remove 'null' versions as well.
'null' versions are created when versioning is not enabled
or suspended.
The entire encryption layer is dependent on the fact that
KMS should be configured for S3 encryption to work properly
and we only support passing the headers as is to the backend
for encryption only if KMS is configured.
Make sure that this predictability is maintained, currently
the code was allowing encryption to go through and fail
at later to indicate that KMS was not configured. We should
simply reply "NotImplemented" if KMS is not configured, this
allows clients to simply proceed with their tests.
This is to ensure that Go contexts work properly, after some
interesting experiments I found that Go net/http doesn't
cancel the context when Body is non-zero and hasn't been
read till EOF.
The following gist explains this, this can lead to pile up
of go-routines on the server which will never be canceled
and will die at a really later point in time, which can
simply overwhelm the server.
https://gist.github.com/harshavardhana/c51dcfd055780eaeb71db54f9c589150
To avoid this refactor the locking such that we take locks after we
have started reading from the body and only take locks when needed.
Also, remove contextReader as it's not useful, doesn't work as expected
context is not canceled until the body reaches EOF so there is no point
in wrapping it with context and putting a `select {` on it which
can unnecessarily increase the CPU overhead.
We will still use the context to cancel the lockers etc.
Additional simplification in the locker code to avoid timers
as re-using them is a complicated ordeal avoid them in
the hot path, since locking is very common this may avoid
lots of allocations.
configurable remote transport timeouts for some special cases
where this value needs to be bumped to a higher value when
transferring large data between federated instances.
In `(*cacheObjects).GetObjectNInfo` copy the metadata before spawning a goroutine.
Clean up a few map[string]string copies as well, reducing allocs and simplifying the code.
Fixes#10426
From https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html#intro-lifecycle-rules-actions
```
When specifying the number of days in the NoncurrentVersionTransition
and NoncurrentVersionExpiration actions in a Lifecycle configuration,
note the following:
It is the number of days from when the version of the object becomes
noncurrent (that is, when the object is overwritten or deleted), that
Amazon S3 will perform the action on the specified object or objects.
Amazon S3 calculates the time by adding the number of days specified in
the rule to the time when the new successor version of the object is
created and rounding the resulting time to the next day midnight UTC.
For example, in your bucket, suppose that you have a current version of
an object that was created at 1/1/2014 10:30 AM UTC. If the new version
of the object that replaces the current version is created at 1/15/2014
10:30 AM UTC, and you specify 3 days in a transition rule, the
transition date of the object is calculated as 1/19/2014 00:00 UTC.
```
This PR adds a DNS target that ensures to update an entry
into Kubernetes operator when a bucket is created or deleted.
See minio/operator#264 for details.
Co-authored-by: Harshavardhana <harsha@minio.io>
MaxConnsPerHost can potentially hang a call without any
way to timeout, we do not need this setting for our proxy
and gateway implementations instead IdleConn settings are
good enough.
Also ensure to use NewRequestWithContext and make sure to
take the disks offline only for network errors.
Fixes#10304
inconsistent drive healing when one of the drive is offline
while a new drive was replaced, this change is to ensure
that we can add the offline drive back into the mix by
healing it again.
Add context to all (non-trivial) calls to the storage layer.
Contexts are propagated through the REST client.
- `context.TODO()` is left in place for the places where it needs to be added to the caller.
- `endWalkCh` could probably be removed from the walkers, but no changes so far.
The "dangerous" part is that now a caller disconnecting *will* propagate down, so a
"delete" operation will now be interrupted. In some cases we might want to disconnect
this functionality so the operation completes if it has started, leaving the system in a cleaner state.
This commit refactors the certificate management implementation
in the `certs` package such that multiple certificates can be
specified at the same time. Therefore, the following layout of
the `certs/` directory is expected:
```
certs/
│
├─ public.crt
├─ private.key
├─ CAs/ // CAs directory is ignored
│ │
│ ...
│
├─ example.com/
│ │
│ ├─ public.crt
│ └─ private.key
└─ foobar.org/
│
├─ public.crt
└─ private.key
...
```
However, directory names like `example.com` are just for human
readability/organization and don't have any meaning w.r.t whether
a particular certificate is served or not. This decision is made based
on the SNI sent by the client and the SAN of the certificate.
***
The `Manager` will pick a certificate based on the client trying
to establish a TLS connection. In particular, it looks at the client
hello (i.e. SNI) to determine which host the client tries to access.
If the manager can find a certificate that matches the SNI it
returns this certificate to the client.
However, the client may choose to not send an SNI or tries to access
a server directly via IP (`https://<ip>:<port>`). In this case, we
cannot use the SNI to determine which certificate to serve. However,
we also should not pick "the first" certificate that would be accepted
by the client (based on crypto. parameters - like a signature algorithm)
because it may be an internal certificate that contains internal hostnames.
We would disclose internal infrastructure details doing so.
Therefore, the `Manager` returns the "default" certificate when the
client does not specify an SNI. The default certificate the top-level
`public.crt` - i.e. `certs/public.crt`.
This approach has some consequences:
- It's the operator's responsibility to ensure that the top-level
`public.crt` does not disclose any information (i.e. hostnames)
that are not publicly visible. However, this was the case in the
past already.
- Any other `public.crt` - except for the top-level one - must not
contain any IP SAN. The reason for this restriction is that the
Manager cannot match a SNI to an IP b/c the SNI is the server host
name. The entire purpose of SNI is to indicate which host the client
tries to connect to when multiple hosts run on the same IP. So, a
client will not set the SNI to an IP.
If we would allow IP SANs in a lower-level `public.crt` a user would
expect that it is possible to connect to MinIO directly via IP address
and that the MinIO server would pick "the right" certificate. However,
the MinIO server cannot determine which certificate to serve, and
therefore always picks the "default" one. This may lead to all sorts
of confusing errors like:
"It works if I use `https:instance.minio.local` but not when I use
`https://10.0.2.1`.
These consequences/limitations should be pointed out / explained in our
docs in an appropriate way. However, the support for multiple
certificates should not have any impact on how deployment with a single
certificate function today.
Co-authored-by: Harshavardhana <harsha@minio.io>
- do not fail the healthcheck if heal status
was not obtained from one of the nodes,
if many nodes fail then report this as a
catastrophic error.
- add "x-minio-write-quorum" value to match
the write tolerance supported by server.
- admin info now states if a drive is healing
where madmin.Disk.Healing is set to true
and madmin.Disk.State is "ok"
Currently, cache purges are triggered as soon as the low watermark is exceeded.
To reduce IO this should only be done when reaching the high watermark.
This simplifies checks and reduces all calls for a GC to go through
`dcache.diskSpaceAvailable(size)`. While a comment claims that
`dcache.triggerGC <- struct{}{}` was non-blocking I don't see how
that was possible. Instead, we add a 1 size to the queue channel
and use channel semantics to avoid blocking when a GC has
already been requested.
`bytesToClear` now takes the high watermark into account to it will
not request any bytes to be cleared until that is reached.
This commit reduces the retry delay when retrying a request
to a KES server by:
- reducing the max. jitter delay from 3s to 1.5s
- skipping the random delay when there are more KES endpoints
available.
If there are more KES endpoints we can directly retry to the request
by sending it to the next endpoint - as pointed out by @krishnasrinivas
- delete-marker should be created on a suspended bucket as `null`
- delete-marker should delete any pre-existing `null` versioned
object and create an entry `null`
When checking parts we already do a stat for each part.
Since we have the on disk size check if it is at least what we expect.
When checking metadata check if metadata is 0 bytes.
The `getNonLoopBackIP` may grab an IP from an interface that
doesn't allow binding (on Windows), so this test consistently fails.
We exclude that specific error.
* readDirN: Check if file is directory
`syscall.FindNextFile` crashes if the handle is a file.
`errFileNotFound` matches 'unix' functionality: d19b434ffc/cmd/os-readdir_unix.go (L106)Fixes#10384
This commit addresses a maintenance / automation problem when MinIO-KES
is deployed on bare-metal. In orchestrated env. the orchestrator (K8S)
will make sure that `n` KES servers (IPs) are available via the same DNS
name. There it is sufficient to provide just one endpoint.
ListObjectsV1 requests are actually redirected to a specific node,
depending on the bucket name. The purpose of this behavior was
to optimize listing.
However, the current code sends a Bad Gateway error if the
target node is offline, which is a bad behavior because it means
that the list request will fail, although this is unnecessary since
we can still use the current node to list as well (the default behavior
without using proxying optimization)
Currently, you can see mint fails when there is one offline node, after
this PR, mint will always succeed.
We can reduce this further in the future, but this is a good
value to keep around. With the advent of continuous healing,
we can be assured that namespace will eventually be
consistent so we are okay to avoid the necessity to
a list across all drives on all sets.
Bonus Pop()'s in parallel seem to have the potential to
wait too on large drive setups and cause more slowness
instead of gaining any performance remove it for now.
Also, implement load balanced reply for local disks,
ensuring that local disks have an affinity for
- cleanupStaleMultipartUploads()
If DiskInfo calls failed the information returned was used anyway
resulting in no endpoint being set.
This would make the drive be attributed to the local system since
`disk.Endpoint == disk.DrivePath` in that case.
Instead, if the call fails record the endpoint and the error only.
bonus make sure to ignore objectNotFound, and versionNotFound
errors properly at all layers, since HealObjects() returns
objectNotFound error if the bucket or prefix is empty.
When crawling never use a disk we know is healing.
Most of the change involves keeping track of the original endpoint on xlStorage
and this also fixes DiskInfo.Endpoint never being populated.
Heal master will print `data-crawl: Disk "http://localhost:9001/data/mindev/data2/xl1" is
Healing, skipping` once on a cycle (no more often than every 5m).
We should only enforce quotas if no error has been returned.
firstErr is safe to access since all goroutines have exited at this point.
If `firstErr` hasn't been set by something else return the context error if cancelled.
Keep dataUpdateTracker while goroutine is starting.
This will ensure the object is updated one `start` returns
Tested with
```
λ go test -cpu=1,2,4,8 -test.run TestDataUpdateTracker -count=1000
PASS
ok github.com/minio/minio/cmd 8.913s
```
Fixes#10295
fresh drive setups when one of the drive is
a root drive, we should ignore such a root
drive and not proceed to format.
This PR handles this properly by marking
the disks which are root disk and they are
taken offline.
newDynamicTimeout should be allocated once, in-case
of temporary locks in config and IAM we should
have allocated timeout once before the `for loop`
This PR doesn't fix any issue as such, but provides
enough dynamism for the timeout as per expectation.
Based on our previous conversations I assume we should send the version
id when healing an object.
Maybe we should even list object versions and heal all?
In a non recursive mode, issuing a list request where prefix
is an existing object with a slash and delimiter is a slash will
return entries in the object directory (data dir IDs)
```
$ aws s3api --profile minioadmin --endpoint-url http://localhost:9000 \
list-objects-v2 --bucket testbucket --prefix code_of_conduct.md/ --delimiter '/'
{
"CommonPrefixes": [
{
"Prefix":
"code_of_conduct.md/ec750fe0-ea7e-4b87-bbec-1e32407e5e47/"
}
]
}
```
This commit adds a fast exit track in Walk() in this specific case.
use `/etc/hosts` instead of `/` to check for common
device id, if the device is same for `/etc/hosts`
and the --bind mount to detect root disks.
Bonus enhance healthcheck logging by adding maintenance
tags, for all messages.
adds a feature where we can fetch the MinIO
command-line remotely, this
is primarily meant to add some stateless
nature to the MinIO deployment in k8s
environments, MinIO operator would run a
webhook service endpoint
which can be used to fetch any environment
value in a generalized approach.
It is possible in situations when server was deployed
in asymmetric configuration in the past such as
```
minio server ~/fs{1...4}/disk{1...5}
```
Results in setDriveCount of 10 in older releases
but with fairly recent releases we have moved to
having server affinity which means that a set drive
count ascertained from above config will be now '4'
While the object layer make sure that we honor
`format.json` the storageClass configuration however
was by mistake was using the global value obtained
by heuristics. Which leads to prematurely using
lower parity without being requested by the an
administrator.
This PR fixes this behavior.
Bonus also fix a bug where we did not purge relevant
service accounts generated by rotating credentials
appropriately, service accounts should become invalid
as soon as its corresponding parent user becomes invalid.
Since service account themselves carry parent claim always
we would never reach this problem, as the access get
rejected at IAM policy layer.
when source and destination are same and versioning is enabled
on the destination bucket - we do not need to re-create the entire
object once again to optimize on space utilization.
Cases this PR is not supporting
- any pre-existing legacy object will not
be preserved in this manner, meaning a new
dataDir will be created.
- key-rotation and storage class changes
of course will never re-use the dataDir
conflicting files can exist on FS at
`.minio.sys/buckets/testbucket/policy.json/`, this is an
expected valid scenario for FS mode allow it to work,
i.e ignore and move forward
With reduced parity our write quorum should be same
as read quorum, but code was still assuming
```
readQuorum+1
```
In all situations which is not necessary.
Generalize replication target management so
that remote targets for a bucket can be
managed with ARNs. `mc admin bucket remote`
command will be used to manage targets.
Context timeout might race on each other when timeouts are lower
i.e when two lock attempts happened very quickly on the same resource
and the servers were yet trying to establish quorum.
This situation can lead to locks held which wouldn't be unlocked
and subsequent lock attempts would fail.
This would require a complete server restart. A potential of this
issue happening is when server is booting up and we are trying
to hold a 'transaction.lock' in quick bursts of timeout.
replace dummy buffer with nullReader{} instead,
to avoid large memory allocations in memory
constrainted environments. allows running
obd tests in such environments.
Currently, listing directories on HDFS incurs a per-entry remote Stat() call
penalty, the cost of which can really blow up on directories with many
entries (+1,000) especially when considered in addition to peripheral
calls (such as validation) and the fact that minio is an intermediary to the
client (whereas other clients listed below can query HDFS directly).
Because listing directories this way is expensive, the Golang HDFS library
provides the [`Client.Open()`] function which creates a [`FileReader`] that is
able to batch multiple calls together through the [`Readdir()`] function.
This is substantially more efficient for very large directories.
In one case we were witnessing about +20 seconds to list a directory with 1,500
entries, admittedly large, but the Java hdfs ls utility as well as the HDFS
library sample ls utility were much faster.
Hadoop HDFS DFS (4.02s):
λ ~/code/minio → use-readdir
» time hdfs dfs -ls /directory/with/1500/entries/
…
hdfs dfs -ls 5.81s user 0.49s system 156% cpu 4.020 total
Golang HDFS library (0.47s):
λ ~/code/hdfs → master
» time ./hdfs ls -lh /directory/with/1500/entries/
…
./hdfs ls -lh 0.13s user 0.14s system 56% cpu 0.478 total
mc and minio **without** optimization (16.96s):
λ ~/code/minio → master
» time mc ls myhdfs/directory/with/1500/entries/
…
./mc ls 0.22s user 0.29s system 3% cpu 16.968 total
mc and minio **with** optimization (0.40s):
λ ~/code/minio → use-readdir
» time mc ls myhdfs/directory/with/1500/entries/
…
./mc ls 0.13s user 0.28s system 102% cpu 0.403 total
[`Client.Open()`]: https://godoc.org/github.com/colinmarc/hdfs#Client.Open
[`FileReader`]: https://godoc.org/github.com/colinmarc/hdfs#FileReader
[`Readdir()`]: https://godoc.org/github.com/colinmarc/hdfs#FileReader.Readdir
If there are many listeners to bucket notifications or to the trace
subsystem, healing fails to work properly since it suspends itself when
the number of concurrent connections is above a certain threshold.
These connections are also continuous and not costly (*no disk access*),
it is okay to just ignore them in waitForLowHTTPReq().
this is to detect situations of corruption disk
format etc errors quickly and keep the disk online
in such scenarios for requests to fail appropriately.
Fixes two different types of problems
- continuation of the problem seen in FS #9992
as not fixed for erasure coded deployments,
reproduced this issue with spark and its fixed now
- another issue was leaking walk go-routines which
would lead to high memory usage and crash the system
this is simply because all the walks which were purged
at the top limit had leaking end walkers which would
consume memory endlessly.
closes#9966closes#10088
not all claims need to be present for
the JWT claim, let the policies not
exist and only apply which are present
when generating the credentials
once credentials are generated then
those policies should exist, otherwise
the request will fail.
- copyObject in-place decryption failed
due to incorrect verification of headers
- do not decode ETag when object is encrypted
with SSE-C, so that pre-conditions don't fail
prematurely.
This PR adds support for healing older
content i.e from 2yrs, 1yr. Also handles
other situations where our config was
not encrypted yet.
This PR also ensures that our Listing
is consistent and quorum friendly,
such that we don't list partial objects
In federated NAS gateway setups, multiple hosts in srvRecords
was picked at random which could mean that if one of the
host was down the request can indeed fail and if client
retries it would succeed. Instead allow server to figure
out the current online host quickly such that we can
exclude the host which is down.
At the max the attempt to look for a downed node is to
300 millisecond, if the node is taking longer to respond
than this value we simply ignore and move to the node,
total attempts are equal to number of srvRecords if no
server is online we simply fallback to last dialed host.
healing was not working properly when drives were
replaced, due to the error check in root disk
calculation this PR fixes this behavior
This PR also adds additional fix for missing
metadata entries from .minio.sys as part of
disk healing as well.
Added code to ignore and print more context
sensitive errors for better debugging.
This PR is continuation of fix in 7b14e9b660
Enforce bucket quotas when crawling has finished.
This ensures that we will not do quota enforcement on old data.
Additionally, delete less if we are closer to quota than we thought.
for users who don't have access to HDFS rootPath '/'
can optionally specify `minio gateway hdfs hdfs://namenode:8200/path`
for which they have access to, allowing all writes to be
performed at `/path`.
NOTE: once configured in this manner you need to make
sure command line is correctly specified, otherwise
your data might not be visible
closes#10011
this PR to allow legacy support for big-data
applications which run older Java versions
which do not support the secure ciphers
currently defaulted in MinIO. This option
allows optionally to turn them off such
that client and server can negotiate the
best ciphers themselves.
This env is purposefully not documented,
meant as a last resort when client
application cannot be changed easily.
- admin info node offline check is now quicker
- admin info now doesn't duplicate the code
across doing the same checks for disks
- rely on StorageInfo to return appropriate errors
instead of calling locally.
- diskID checks now return proper errors when
disk not found v/s format.json missing.
- add more disk states for more clarity on the
underlying disk errors.
while we handle all situations for writes and reads
on older format, what we didn't cater for properly
yet was delete where we only ended up deleting
just `xl.meta` - instead we should allow all the
deletes to go through for older format without
versioning enabled buckets.
CORS is notorious requires specific headers to be
handled appropriately in request and response,
using cors package as part of handlerFunc() for
options method lacks the necessary control this
package needs to add headers.
Without instantiating a new rest client we can
have a recursive error which can lead to
healthcheck returning always offline, this can
prematurely take the servers offline.
gorilla/mux broke their recent release 1.7.4 which we
upgraded to, we need the current workaround to ensure
that our regex matches appropriately.
An upstream PR is sent, we should remove the
workaround once we have a new release.
- reduce locker timeout for early transaction lock
for more eagerness to timeout
- reduce leader lock timeout to range from 30sec to 1minute
- add additional log message during bootstrap phase
Main issue is that `t.pool[params]` should be `t.pool[oldest]`.
We add a bit more safety features for the code.
* Make writes to the endTimerCh non-blocking in all cases
so multiple releases cannot lock up.
* Double check expectations.
* Shift down deletes with copy instead of truncating slice.
* Actually delete the oldest if we are above total limit.
* Actually delete the oldest found and not the current.
* Unexport the mutex so nobody from the outside can meddle with it.
This commit adds a new admin API for creating master keys.
An admin client can send a POST request to:
```
/minio/admin/v3/kms/key/create?key-id=<keyID>
```
The name / ID of the new key is specified as request
query parameter `key-id=<ID>`.
Creating new master keys requires KES - it does not work with
the native Vault KMS (deprecated) nor with a static master key
(deprecated).
Further, this commit removes the `UpdateKey` method from the `KMS`
interface. This method is not needed and not used anymore.
with the merge of https://github.com/etcd-io/etcd/pull/11823
etcd v3.5.0 will now have a properly imported versioned path
this fixes our pending migration to newer repo
Use a separate client for these calls that can take a long time.
Add request context to these so they are canceled when the client
disconnects instead except for ListObject which doesn't have any equivalent.