Use a single allocation for reading the file, not the growing buffer of `io.ReadAll`.
Reuse the write buffer if we can when writing metadata in RenameData.
* reduce extra getObjectInfo() calls during ILM transition
This PR also changes expiration logic to be non-blocking,
scanner is now free from additional costs incurred due
to slower object layer calls and hitting the drives.
* move verifying expiration inside locks
A multi resources lock is a single lock UID with multiple associated
resources. This is created for example by multi objects delete
operation. This commit changes the behavior of Refresh() to iterate over
all locks having the same UID and refresh them.
Bonus: Fix showing top locks for multi delete objects
#11878 added "keepHTTPResponseAlive" to CreateFile requests.
The problem is that it will begin writing to the response before the
body is read after 10 seconds. This will abort the writes on the
client-side, since it assumes the server has received what it wants.
The proposed solution here is to monitor the completion of the body
before beginning to send keepalive pings.
Fixes observed high number of goroutines stuck in `io.Copy` in
`github.com/minio/minio/cmd.(*xlStorage).CreateFile` and
`(*storageRESTClient).CreateFile` stuck in `http.DrainBody`.
In the event when a lock is not refreshed in the cluster, this latter
will be automatically removed in the subsequent cleanup of non
refreshed locks routine, but it forgot to clean the local server,
hence having the same weird stale locks present.
This commit will remove the lock locally also in remote nodes, if
removing a lock from a remote node will fail, it will be anyway
removed later in the locks cleanup routine.
Currently in master this can cause existing
parent users to stop working and lead to
credentials getting overwritten.
```
~ mc admin user add alias/ minio123 minio123456
```
```
~ mc admin user svcacct add alias/ minio123 \
--access-key minio123 --secret-key minio123456
```
This PR rejects all such scenarios.
Faster healing as well as making healing more
responsive for faster scanner times.
also fixes a bug introduced in #13079, newly replaced
disks were not healing automatically.
- remove sourceCh usage from healing
we already have tasks and resp channel
- use read locks to lookup globalHealConfig
- fix healing resolver to pick candidates quickly
that need healing, without this resolver was
unexpectedly skipping.
healObject() should be non-blocking to ensure
that scanner is not blocked for a long time,
this adversely affects performance of the scanner
and also affects the way usage is updated
subsequently.
This PR allows for a non-blocking behavior for
healing, dropping operations that cannot be queued
anymore.
Synchronize bucket cycles so it is much more
likely that the same prefixes will be picked up
for scanning.
Use the global bloom filter cycle for that.
Bump bloom filter versions to clear those.
The intention is to list values of sys config that can potentially
impact the performance of minio.
At present, it will return max value configured for rlimit
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
proceed to heal the cluster when all the
drives in a set have failed, this is extremely
rare occurrence but even if it happens we allow
the cluster to be functional.
A recent regression caused new disks not being re-formatted. In the old
code, a disk needed be 'online' to be chosen to be formatted but the
disk has to be already formatted for XL storage IsOnline() function to
return true.
It is enough to check if XL storage is nil or not if we want to avoid
formatting root disks.
Co-authored-by: Anis Elleuch <anis@min.io>
markRootDisksAsDown() relies on disk info even if the
disk is unformatted. Therefore, we should always return
DiskInfo data even when DiskInfo storage API returns
errUnformattedDisk
`mc admin heal` command will show servers/disks tolerance, for that
purpose, you need to know the number of parity disks for each storage
class.
Parity is always the same in all pools.
prefixes at top level create such as
```
~ mc mb alias/bucket/prefix
```
The prefix/ incorrect appears as prefix__XL_DIR__/
in the accountInfo output, make sure to trim '__XL_DIR__'
Objects uploaded in this format for example
```
mc cp /etc/hosts alias/bucket/foo/bar/xl.meta
mc ls -r alias/bucket/foo/bar
```
Won't list the object, handle this scenario.
Ensure that one call will succeed and others will serialize
Example failure without code in place:
```
bucket-policy-handlers_test.go:120: unexpected error: cmd.InsufficientWriteQuorum: Storage resources are insufficient for the write operation doz2wjqaovp5kvlrv11fyacowgcvoziszmkmzzz9nk9au946qwhci4zkane5-1/
bucket-policy-handlers_test.go:120: unexpected error: cmd.InsufficientWriteQuorum: Storage resources are insufficient for the write operation doz2wjqaovp5kvlrv11fyacowgcvoziszmkmzzz9nk9au946qwhci4zkane5-1/
bucket-policy-handlers_test.go:135: want 1 ok, got 0
```
We are observing heavy system loads, potentially
locking the system up for periods when concurrent
listing operations are performed.
We place a per-disk lock on walk IO operations.
This will minimize the impact of concurrent listing
operations on the entire system and de-prioritize
them compared to other operations.
Single list operations should remain largely unaffected.
Some applications albeit poorly written rather than using headObject
rely on listObjects to check for existence of object, this unusual
request always has prefix=(to actual object) and max-keys=1
handle this situation specially such that we can avoid readdir()
on the top level parent to avoid sorting and skipping, ensuring
that such type of listObjects() always behaves similar to a
headObject() call.
this addresses a regression from #12984
which only addresses flat key from single
level deep at bucket level.
added extra tests as well to cover all
these scenarios.
- deletes should always Sweep() for tiering at the
end and does not need an extra getObjectInfo() call
- puts, copy and multipart writes should conditionally
do getObjectInfo() when tiering targets are configured
- introduce 'TransitionedObject' struct for ease of usage
and understanding.
- multiple-pools optimization deletes don't need to hold
read locks verifying objects across namespace and pools.
baseDir is empty if the top level prefix does not
end with `/` this causes large recursive listings
without any filtering, to fix this filtering make
sure to set the filter prefix appropriately.
also do not navigate folders at top level that do
not match the filter prefix, entries don't need
to match prefix since they are never prefixed
with the prefix anyways.
The previous code removes SVC/STS accounts for ldap users that do not
exist anymore in LDAP server. This commit will actually re-evaluate
filter as well if it is changed and remove all local SVC/STS accounts
beloning to the ldap user if the latter is not eligible for the
search filter anymore.
For example: the filter selects enabled users among other criteras in
the LDAP database, if one ldap user changes his status to disabled
later, then associated SVC/STS accounts will be removed because that user
does not meet the filter search anymore.
Traffic metering was not protected against concurrent updates.
```
WARNING: DATA RACE
Read at 0x00c02b0dace8 by goroutine 235:
github.com/minio/minio/cmd.setHTTPStatsHandler.func1()
d:/minio/minio/cmd/generic-handlers.go:360 +0x27d
net/http.HandlerFunc.ServeHTTP()
...
Previous write at 0x00c02b0dace8 by goroutine 994:
github.com/minio/minio/internal/http/stats.(*IncomingTrafficMeter).Read()
d:/minio/minio/internal/http/stats/http-traffic-recorder.go:34 +0xd2
```
The intention is to provide status of any sys services that can
potentially impact the performance of minio.
At present, it will return information about the `selinux` service
(not-installed/disabled/permissive/enforcing)
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
This happens because of a change added where any sub-credential
with parentUser == rootCredential i.e (MINIO_ROOT_USER) will
always be an owner, you cannot generate credentials with lower
session policy to restrict their access.
This doesn't affect user service accounts created with regular
users, LDAP or OpenID
Use `readMetadata` when reading version
information without data requested.
Reduces IO on inlined data.
Bonus: Inline compressed data as well when
compression is enabled.
- avoid extra lookup for 'xl.meta' since we are
definitely sure that it doesn't exist.
- use this in newMultipartUpload() as well
- also additionally do not write with O_DSYNC
to avoid loading the drives, instead create
'xl.meta' for listing operations without
O_DSYNC since these are ephemeral objects.
- do the same with newMultipartUpload() since
it gets synced when the PutObjectPart() is
attempted, we do not need to tax newMultipartUpload()
instead.
removes unexpected features from regular putObject() such as
- increasing parity when disks are down, avoids
a lot of DiskInfo() calls.
- triggering MRF for metacache objects
if disks are offline
- avoiding renames from temporary location
to actual namespace, not needed since
metacache files are unique.
Disregard interfaces that are down when selecting bind addresses
Windows often has a number of disabled NICs used for VPN and other services.
This often causes minio to select an address for contacting the console that is on a disabled (virtual) NIC.
This checks if the interface is up before adding it to the pool on Windows.
Before, the gateway will complain that it found KMS configured in the
environment but the gateway mode does not support encryption. This
commit will allow starting of the gateway but ensure that S3 operations
with encryption headers will fail when the gateway doesn't support
encryption. That way, the user can use etcd + KMS and have IAM data
encrypted in the etcd store.
Co-authored-by: Anis Elleuch <anis@min.io>
improvements include
- skip IPv6 correctly
- do not set default value for
MINIO_SERVER_URL, let it be
configured if not use local IPs
Bonus:
- In healing return error from listPathRaw()
- update console to v0.8.3
Remote caches were not returned correctly, so they would not get updated on save.
Furthermore make some tweaks for more reliable updates.
Invalidate bloom filter to ensure rescan.
we will allow situations such as
```
a/b/1.txt
a/b
```
and
```
a/b
a/b/1.txt
```
we are going to document that this usecase is
not supported and we will never support it, if
any application does this users have to delete
the top level parent to make sure namespace is
accessible at lower level.
rest of the situations where the prefixes get
created across sets are supported as is.
its possible that some multipart uploads would have
uploaded only single parts so relying on `len(o.Parts)`
alone is not sufficient, we need to look for ETag
pattern to be absolutely sure.
Some incorrect setups might have multiple audiences
where they are trying to use a single authentication
endpoint for multiple services.
Nevertheless OpenID spec allows it to make it
even more confusin for no good reason.
> It MUST contain the OAuth 2.0 client_id of the
> Relying Party as an audience value. It MAY also
> contain identifiers for other audiences. In the
> general case, the aud value is an array of case
> sensitive strings. In the common special case
> when there is one audience, the aud value MAY
> be a single case sensitive string.
fixes#12809
this healing optimization caused multiple
regressions in healing
- delete-markers incorrectly missing
heal and returning incorrect healing
results to client.
- missing individual 'parts' such
as for restored object or simply
for all objects just missing few parts.
This optimization is not necessary, we
should proceed to verify all cases possible
not just when metadata is inconsistent.
destination path and old path will be similar
when healing occurs, this can lead to healed
parts being again purged leading to always an
inconsistent state on an object which might
further cause reduction in quorum eventually.
delete-markers missing on drives were
not healed due to few things
disksWithAllParts() does not know-how
to deal with delete markers, add support
for that.
fixes#12787
- delete-markers are incorrectly reported
as corrupt with wrong data sent to client
'mc admin heal -r' on objects with delete
marker will report as 'grey' incorrectly.
- do not heal delete-markers during HeadObject()
this can lead to inconsistent order of heals
on the object, although this is not an issue
in terms of order of versions it is rather
simpler to keep the same order on all drives.
- defaultHealResult() should handle 'err == nil'
case such that valid cases should be handled
as 'drive' status OK.
- remove use of getOnlineDisks() instead rely on fallbackDisks()
when disk return errors like diskNotFound, unformattedDisk
use other fallback disks to list from, instead of paying the
price for checking getOnlineDisks()
- optimize getDiskID() further to avoid large write locks when
looking formatLastCheck time window
This new change allows for a more relaxed fallback for listing
allowing for more tolerance and also eventually gain more
consistency in results even if using '3' disks by default.
When configured in Lookup Bind mode, the server now periodically queries the
LDAP IDP service to find changes to a user's group memberships, and saves this
info to update the access policies for all temporary and service account
credentials belonging to LDAP users.
Add a new goroutine file which has another printing format. We need it
to see how much time each goroutine was blocked. Easier to detect stops.
Co-authored-by: Anis Elleuch <anis@min.io>
when TLS is configured using IPs directly
might interfere and not work properly when
the server is configured with TLS certs but
the certs only have domain certs.
Also additionally allow users to specify
a public accessible URL for console to talk
to MinIO i.e `MINIO_SERVER_URL` this would
allow them to use an external ingress domain
to talk to MinIO. This internally fixes few
problems such as presigned URL generation on
the console UI etc.
This needs to be done additionally for any
MinIO deployments that might have a much more
stricter requirement when running in standalone
mode such as FS or standalone erasure code.
This method is used to add expected expiration and transition time
for an object in GET/HEAD Object response headers.
Also fixed bugs in lifecycle.PredictTransitionTime and
getLifecycleTransitionTier in handling current and
non-current versions.
This allows remote bucket admin to identify the origin of transitioned
objects by simply inspecting the object prefixes.
e.g let's take a remote tier TIER-1 pointing to a remote bucket (prefix)
testbucket/testprefix-1. The remote bucket admin can list all transitioned objects
from a MinIO deployment identified by '2e78e906-1c5d-4f94-8689-9df44cafde39' and
source bucket 'mybucket' like so,
```
$ ./mc ls -r minio-tier-target/testbucket/testprefix-1/2e78e906-1c5d-4f94-8689-9df44cafde39/mybucket/
[2021-07-12 17:15:50 PDT] 160B 48/fb/48fbc0e6-3a73-458b-9337-8e722c619ca4
[2021-07-12 16:58:46 PDT] 160B 7d/1c/7d1c96bd-031a-48d4-99ea-b1304e870830
```
In case of non-distributed setup, if the server start command contains a
`--console-address` flag and its value contains a hostname, it is not
getting anonymized.
Fixed by replacing the console host also with `server1`
This commit gathers MRF metrics from
all nodes in a cluster and return it to the caller. This will show information about the
number of objects in the MRF queues
waiting to be healed.
In case of replication healing, we always store completed status in the
object metadata, which is wrong because replication could fail in the
further retries.
Ensure that hostnames / ip addresses are not printed in the subnet
health report. Anonymize them by replacing them with `servern` where `n`
represents the position of the server in the pool.
This is done by building a `host anonymizer` map that maps every
possible value containing the host e.g. host, host:port,
http://host:port, etc to the corresponding anonymized name and using
this map to replace the values at the time of health report generation.
A different logic is used to anonymize host names in the `procinfo`
data, as the host names are part of an ellipses pattern in the process
start command. Here we just replace the prefix/suffix of the ellipses
pattern with their hashes.
Gzip responses if appropriate, except GetObject requests.
List reponses has an almost 10:1 compression ratio with no
measurable slowdown (in fact it seems a bit faster).
- ParentUser for OIDC auth changed to `openid:`
instead of `jwt:` to avoid clashes with variable
substitution
- Do not pass in random parents into IsAllowed()
policy evaluation as it can change the behavior
of looking for correct policies underneath.
fixes#12676fixes#12680
with console addition users cannot login with
root credentials without etcd persistent layer,
allow a dummy store such that such functionalities
can be supported when running as non-persistent
manner, this enables all calls and operations.
MinIO might be running inside proxies, and
console while being on another port might not be
reachable on a specific port behind such proxies.
For such scenarios customize the redirect URL
such that console can be redirected to correct
proxy endpoint instead.
fixes#12661
Download files from *any* bucket/path as an encrypted zip file.
The key is included in the response but can be separated so zip
and the key doesn't have to be sent on the same channel.
Requires https://github.com/minio/pkg/pull/6
Additional support for vendor-specific admin API
integrations for OpenID, to ensure validity of
credentials on MinIO.
Every 5minutes check for validity of credentials
on MinIO with vendor specific IDP.
DiskInfo() calls can stagger and wait if run
serially timing out 10secs per drive, to avoid
this lets check DiskInfo in parallel to avoid
delays when nodes get disconnected.
Fixes brought forward from https://github.com/minio/minio/pull/12605
Fixes resolution when an object is in prefix of another and one zone returns the directory and another the object.
Fixes resolution on single entries that arrive first, so resolution doesn't depend on order.
auditLog should be attempted right before the
return of the function and not multiple times
per function, this ensures that we only trigger
it once per function call.
if object was uploaded with multipart. This is to ensure that
GetObject calls with partNumber in URI request parameters
have same behavior on source and replication target.
also do not incorrectly double count
objExists unless its selected and it
matches with previous entry.
Bonus: change listQuorum to match with
AskDisks to ensure that we atleast by
default choose all the "drives" that
we asked is consistent.
Bonus: remove kms_kes as sub-system, since its ENV only.
- also fixes a crash with etcd cluster without KMS
configured and also if KMS decryption is missing.
backend-encrypted doesn't need to be explicitly healed anymore
since this file is deleted upon upgrade and migration to the
KMS based encrypted config/IAM credentials.
goroutine 1 [running]:
runtime/internal/atomic.panicUnaligned()
/usr/local/go/src/runtime/internal/atomic/unaligned.go:8 +0x24
golang doc:
// BUG(rsc): On x86-32, the 64-bit functions use instructions unavailable before the Pentium MMX.
//
// On non-Linux ARM, the 64-bit functions use instructions unavailable before the ARMv6k core.
//
// On ARM, x86-32, and 32-bit MIPS,
// it is the caller's responsibility to arrange for 64-bit
// alignment of 64-bit words accessed atomically. The first word in a
// variable or in an allocated struct, array, or slice can be relied upon to be
// 64-bit aligned.
Create write lock on PutObject and CopyObject when on multi-pool setup.
Use the same lock as NewMultipartUpload so all creation calls share the same lock.
This feature also changes the default port where
the browser is running, now the port has moved
to 9001 and it can be configured with
```
--console-address ":9001"
```
When no results are sent `result.end` is never sent, so the list becomes hot until the list is full.
Break immediately when channel is closed.
Fixes#12518
previous PR incorrectly changed rename() from
tmp to -> tmp/.trash/uuid, since it is self
referential - to clear this up make sure its
renamed to a separate folder and deleted
in background - just like before.
Each multipart upload is holding a read lock for the entire upload
duration of each part.
This makes it impossible for other parts to complete until all currently
uploading parts have released their locks.
It will also make it impossible for new parts to start as long as the
write lock is still being requested, essentially deadlocking uploads
until all that may have been granted a read lock has been completed.
Refactor to only hold the upload id lock while reading and writing
the metadata, but hold a part id lock while the part is being uploaded.
This commit adds an admin API for fetching
the KMS status information (default key ID, endpoints, ...).
With this commit the server exposes REST endpoint:
```
GET <admin-api>/kms/status
```
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Web Handlers can generate STS tokens but forgot to create a parent user
and save it along with the temporary access account. This commit fixes
this.
fixes#12381
its possible that, version might exist on second pool such that
upon deleteBucket() might have deleted the bucket on pool1 successfully
since it doesn't have any objects, undo such operations properly in
all any error scenario.
Also delete bucket metadata from pool layer rather than sets layer.
objectErasureMap in the audit holds information about the objects
involved in the current S3 operation such as pool index, set an index,
and disk endpoints. One user saw a crash due to a concurrent update of
objectErasureMap information. Use sync.Map to prevent a crash.
Always use `GetActualSize` to get the part size, not just when encrypted.
Fixes mint test io.minio.MinioClient.uploadPartCopy,
error "Range specified is not valid for source object".
healing code was using incorrect buffers to heal older
objects with 10MiB erasure blockSize, incorrect calculation
of such buffers can lead to incorrect premature closure of
io.Pipe() during healing.
fixes#12410
- it is possible that during I/O failures we might
leave partially written directories, make sure
we purge them after.
- rename current data-dir (null) versionId only after
the newer xl.meta has been written fully.
- attempt removal once for minioMetaTmpBucket/uuid/
as this folder is empty if all previous operations
were successful, this allows avoiding recursive os.Remove()
- for single pool setups usage is not checked.
- for pools, only check the "set" in which it would be placed.
- keep a minimum number of inodes (when we know it).
- ignore for `.minio.sys`.