listing can fail and it is allowed to be retried,
instead of returning right away return an error at
the end - heal the rest of the buckets and objects,
and when we are retrying skip the buckets that
are already marked done by using the tracked buckets.
fixes#12972
- Go might reset the internal http.ResponseWriter() to `nil`
after Write() failure if the go-routine has returned, do not
flush() such scenarios and avoid spurious flushes() as
returning handlers always flush.
- fix some racy tests with the console
- avoid ticker leaks in certain situations
Existing:
```go
type xlMetaV2 struct {
Versions []xlMetaV2Version `json:"Versions" msg:"Versions"`
}
```
Serialized as regular MessagePack.
```go
//msgp:tuple xlMetaV2VersionHeader
type xlMetaV2VersionHeader struct {
VersionID [16]byte
ModTime int64
Type VersionType
Flags xlFlags
}
```
Serialize as streaming MessagePack, format:
```
int(headerVersion)
int(xlmetaVersion)
int(nVersions)
for each version {
binary blob, xlMetaV2VersionHeader, serialized
binary blob, xlMetaV2Version, serialized.
}
```
xlMetaV2VersionHeader is <= 30 bytes serialized. Deserialized struct
can easily be reused and does not contain pointers, so efficient as a
slice (single allocation)
This allows quickly parsing everything as slices of bytes (no copy).
Versions are always *saved* sorted by modTime, newest *first*.
No more need to sort on load.
* Allows checking if a version exists.
* Allows reading single version without unmarshal all.
* Allows reading latest version of type without unmarshal all.
* Allows reading latest version without unmarshal of all.
* Allows checking if the latest is deleteMarker by reading first entry.
* Allows adding/updating/deleting a version with only header deserialization.
* Reduces allocations on conversion to FileInfo(s).
This will help other projects like `health-analyzer` to verify that the
struct was indeed populated by the minio server, and is not
default-populated during unmarshalling of the JSON.
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
legacy objects in 'xl.json' after upgrade, should have
following sequence of events - bucket should have versioning
enabled and the object should have been overwritten with
another version of an object.
this situation was not handled, which would lead to older
objects to stay perpetually with "legacy" dataDir, however
these objects were readable by all means - there weren't
converted to newer format.
This PR fixes this situation properly.
Add a new Prometheus metric for bucket replication latency
e.g.:
minio_bucket_replication_latency_ns{
bucket="testbucket",
operation="upload",
range="LESS_THAN_1_MiB",
server="127.0.0.1:9001",
targetArn="arn:minio:replication::45da043c-14f5-4da4-9316-aba5f77bf730:testbucket"} 2.2015663e+07
Co-authored-by: Klaus Post <klauspost@gmail.com>
container limits would not be properly honored in
our current implementation, mem.VirtualMemory()
function only reads /proc/meminfo which points to
the host system information inside the container.
This feature is useful in situations when console is exposed
over multiple intranent or internet entities when users are
connecting over local IP v/s going through load balancer.
Related console work was merged here
373bfbfe3f
- remove some duplicated code
- reported a bug, separately fixed in #13664
- using strings.ReplaceAll() when needed
- using filepath.ToSlash() use when needed
- remove all non-Go style comments from the codebase
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
- add checks such that swapped disks are detected
and ignored - never used for normal operations.
- implement `unrecognizedDisk` to be ignored with
all operations returning `errDiskNotFound`.
- also add checks such that we do not load unexpected
disks while connecting automatically.
- additionally humanize the values when printing the errors.
Bonus: fixes handling of non-quorum situations in
getLatestFileInfo(), that does not work when 2 drives
are down, currently this function would return errors
incorrectly.
creating service accounts is implicitly enabled
for all users, this PR however adds support to
reject creating service accounts, with an explicit
"Deny" policy.
On first list resume or when specifying a custom markers entries could be missed in rare cases.
Do conservative truncation of entries when forwarding.
Replaces #13619
If a given MinIO config is dynamic (can be changed without restart),
ensure that it can be reset also without restart.
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
when `MINIO_CACHE_COMMIT` is set.
- `writeback` caching applies only to single
uploads. When cache commit mode is
`writeback`, default multipart caching to be
synchronous.
- Add writethrough caching for single uploads
This commit makes the MinIO server behavior more consistent
w.r.t. key usage verification.
When MinIO verifies the client certificates it also checks
that the client certificate is valid of client authentication
(or any (i.e. wildcard) usage).
However, the MinIO server used to not verify the client key usage
when client certificate verification was disabled.
Now, the MinIO server verifies the client key usage even when
client certificate verification has been disabled. This makes
the MinIO behavior more consistent from a client's perspective.
Now, a client certificate has to be valid for client authentication
in all cases.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Bonus: if runs have PUT higher then capture it anyways
to display an unexpected result, which provides a way
to understand what might be slowing things down on the
system.
For example on a Data24 WDC setup it is clearly visible
there is a bug in the hardware.
```
./mc admin speedtest wdc/
⠧ Running speedtest (With 64 MiB object size, 32 concurrency) PUT: 31 GiB/s GET: 24 GiB/s
⠹ Running speedtest (With 64 MiB object size, 48 concurrency) PUT: 38 GiB/s GET: 24 GiB/s
MinIO 2021-11-04T06:08:33Z, 6 servers, 48 drives
PUT: 38 GiB/s, 605 objs/s
GET: 24 GiB/s, 383 objs/s
```
Reads are almost 14GiB/sec slower than Writes which
is practically not possible.
This reverts commit 091a7ae359.
- Ensure all actions accessing storage lock properly.
- Behavior change: policies can be deleted only when they
are not associated with any active credentials.
Also adds fix for accidental canned policy removal that was present in the
reverted version of the change.
Windows users often click on the binary without
knowing MinIO is a command-line tool and should be
run from a terminal. Throw a message to guide them
on what to do.
Co-authored-by: Klaus Post <klauspost@gmail.com>
Borrowed idea from Go's usage of this
optimization for ReadFrom() on client
side, we should re-use the 32k buffers
io.Copy() allocates for generic copy
from a reader to writer.
the performance increase for reads for
really tiny objects is at this range
after this change.
> * Fastest: +7.89% (+1.3 MiB/s) throughput, +7.89% (+1308.1) obj/s
- Ensure all actions accessing storage lock properly.
- Behavior change: policies can be deleted only when they
are not associated with any active credentials.
- The race happens with a goroutine that refreshes IAM cache data from storage.
- It could lead to deleted users re-appearing as valid live credentials.
- This change also causes CI to run tests without a race flag (in addition to
running it with).
deleting collection of versions belonging
to same object, we can avoid re-reading
the xl.meta from the disk instead purge
all the requested versions in-memory,
the tradeoff is to allocate a map to de-dup
the versions, allow disks to be read only
once per object.
additionally reduce the data transfer between
nodes by shortening msgp data values.
- combine similar looking functionalities into single
handlers, and remove unnecessary proxying of the
requests at handler layer.
- remove bucket forwarding handler as part of default setup
add it only if bucket federation is enabled.
Improvements observed for 1kiB object reads.
```
-------------------
Operation: GET
Operations: 4538555 -> 4595804
* Average: +1.26% (+0.2 MiB/s) throughput, +1.26% (+190.2) obj/s
* Fastest: +4.67% (+0.7 MiB/s) throughput, +4.67% (+739.8) obj/s
* 50% Median: +1.15% (+0.2 MiB/s) throughput, +1.15% (+173.9) obj/s
```
Removes RLock/RUnlock for updating metadata,
since we already take a write lock to update
metadata, this change removes reading of xl.meta
as well as an additional lock, the performance gain
should increase 3x theoretically for
- PutObjectRetention
- PutObjectLegalHold
This optimization is mainly for Veeam like
workloads that require a certain level of iops
from these API calls, we were losing iops.
read/writers are not concurrent in handlers
and self contained - no need to use atomics on
them.
avoids unnecessary contentions where it's not
required.
Logger targets were not race protected against concurrent updates from for example `HTTPConsoleLoggerSys`.
Restrict direct access to targets and make slices immutable so a returned slice can be processed safely without locks.
As we use etcd's watch interface, we do not need the
network notifications as they are no-ops anyway.
Bonus: Remove globalEtcdClient global usage in IAM
IAMSys is a higher-level object, that should not be called by the lower-level
storage API interface for IAM. This is to prepare for further improvements in
IAM code.
etcd operations, get/put/delete, should be logged when failed
with errors other than not found error. It will make it easier to
see connections issues from MinIO to etcd.
Refresh was doing a linear scan of all locked resources. This was adding
up to significant delays in locking on high load systems with long
running requests.
Add a secondary index for O(log(n)) UID -> resource lookups.
Multiple resources are stored in consecutive strings.
Bonus fixes:
* On multiple Unlock entries unlock the write locks we can.
* Fix `expireOldLocks` skipping checks on entry after expiring one.
* Return fast on canTakeUnlock/canTakeLock.
* Prealloc some places.
We do not reliably know the length of compressed data, including headers.
Request until the end-of-stream. Results will still be properly truncated.
Fixes#13441
Testing with `mc sql --compression BZIP2 --csv-input "rd=\n,fh=USE,fd=;" --query="select COUNT(*) from S3Object" local2/testbucket/nyc-taxi-data-10M.csv.bz2`
Before 96.98s, after 10.79s. Uses about 70% CPU while running.
offset+length should match the Size() of the individual parts
return 'errFileCorrupt' otherwise, to trigger healing of the individual
parts do not error out prematurely when healing such bitrot's upon
successful parts being written to the client.
another issue this PR fixes is to not return and error to
the client if we have just triggered a heal on a specific
part of the object, instead continue to read all the content
and let the heal happen asynchronously later.
Looks like policy restriction was not working properly
for normal users when they are not svc or STS accounts.
- svc accounts are now properly fixed to get
right permissions when its inherited, so
we do not have to set 'owner = true'
- sts accounts have always been using right
permissions, do not need an explicit lookup
- regular users always have proper policy mapping
- avoids relying in listQuorum from the underlying listObjects()
and potentially missing entries if any.
- avoid the entire merging logic etc, listing raw set by set
and loading whatever is found is cleaner when dealing with
a large cluster for IAM metadata.
* fix: disallow invalid x-amz-security-token for root credentials
fixes#13335
This was a regression added in #12947 when this part of the
code was refactored to avoid privilege issues with service
accounts with session policy.
Bonus:
- fix: AssumeRoleWithCertificate policy mapping and reload
AssumeRoleWithCertificate was not mapping to correct
policies even after successfully generating keys, since
the claims associated with this API were never looked up
properly. Ensure that policies are set appropriately.
- GetUser() API was not loading policies correctly based
on AccessKey based mapping which is true with OpenID
and AssumeRoleWithCertificate API.
with some broken clients allow non-strict validation
of sha256 when ContentLength > 0, it has been found in
the wild some applications that need this behavior. This
shall be only allowed if `--no-compat` is used.
change credentials handling such that
prefer MINIO_* envs first if they work,
if not fallback to AWS credentials. If
they fail we fail to start anyways.
This change allows a set of MinIO sites (clusters) to be configured
for mutual replication of all buckets (including bucket policies, tags,
object-lock configuration and bucket encryption), IAM policies,
LDAP service accounts and LDAP STS accounts.
In (erasureServerPools).MakeBucketWithLocation deletes the created
buckets if any set returns an error.
Add `NoRecreate` option, which will not recreate the bucket
in `DeleteBucket`, if the operation fails.
Additionally use context.Background() for operations we always want to be performed.
bucket deletes should purge entire bucket metadata
appropriately, use rename() to move the metadata files
to trash folder,
for dangling buckets instead of doing recursive deletes,
rename such buckets to trash folder as well.
Bonus: reduce retry duration for listing to 200ms
additionally optimize for IP only setups, avoid doing
unnecessary lookups if the Dial addr is an IP.
allow support for multiple listeners on same socket,
this is mainly meant for future purposes.
DeleteMarkers were unreadable if they had quorum based
guarantees, this PR tries to fix this behavior appropriately.
DeleteMarkers with sufficient should be allowed and the
return error should be accordingly with or without version-id.
This also allows for overwrites which may not be possible
in a multi-pool setup.
fixes#12787
it would seem like using `bufio.Scan()` is very
slow for heavy concurrent I/O, ie. when r.Body
is slow , instead use a proper
binary exchange format, to marshal and unmarshal
the LockArgs datastructure in a cleaner way.
this PR increases performance of the locking
sub-system for tiny repeated read lock requests
on same object.
```
BenchmarkLockArgs
BenchmarkLockArgs-4 6417609 185.7 ns/op 56 B/op 2 allocs/op
BenchmarkLockArgsOld
BenchmarkLockArgsOld-4 1187368 1015 ns/op 4096 B/op 1 allocs/op
```
This PR brings two optimizations mainly
for page-cache build-up and how to avoid
getting OOM killed in the process. Although
these memories are reclaimable Linux is not
fast enough to reclaim them as needed on a
very busy system. fadvise is a system call
implemented in Linux to advise page-cache to
avoid overload as we get significant amount
of requests on the server.
- FADV_SEQUENTIAL tells that all I/O from now
is going to be sequential, allowing for more
resposive throughput.
- FADV_NOREUSE tells kernel to start removing
things for this 'fd' from page-cache.
An endpoint can be empty when a disk is offline or something
wrong with it. Avoid it by filling erasureSets.endpointStrings
with values from arguments.
This was a regression introduced in '14bb969782'
this has the potential to cause corruption when
there are concurrent overwrites attempting to update
the content on the namespace.
This PR adds a situation where PutObject(), CopyObject()
compete properly for the same locks with NewMultipartUpload()
however it ends up turning off competing locks for the actual
object with GetObject() and DeleteObject() - since they do not
compete due to concurrent I/O on a versioned bucket it can lead
to loss of versions.
This PR fixes this bug with multi-pool setup with replication
that causes corruption of inlined data due to lack of competing
locks in a multi-pool setup.
Instead CompleteMultipartUpload holds the necessary
locks when finishing the transaction, knowing the exact
location of an object to schedule the multipart upload
doesn't need to compete in this manner, a pool id location
for existing object.
This reverts commit 91567ba916.
Revert because the error was incorrectly converted, there are
callers that rely on errConfigNotFound and it also took away
the migration code.
Instead the correct fix is PutBucketTaggingHandler() which
is already added.
also remove HealObjects() code from dataScanner running another
listing from the data-scanner is super in-efficient and in-fact
this code is redundant since we already attempt to heal all
dangling objects anyways.
- Supports object locked buckets that require
PutObject() to set content-md5 always.
- Use SSE-S3 when S3 gateway is being used instead
of SSE-KMS for auto-encryption.
This PR however also proceeds to simplify the loading
of various subsystems such as
- globalNotificationSys
- globalTargetSys
converge them directly into single bucket metadata sys
loader, once that is loaded automatically every other
target should be loaded and configured properly.
fixes#13252
DeleteObject() on existing objects before `xl.json` to
`xl.meta` change were not working, not sure when this
regression was added. This PR fixes this properly.
Also this PR ensures that we perform rename of xl.json
to xl.meta only during "write" phase of the call i.e
either during Healing or PutObject() overwrites.
Also handles few other scenarios during migration where
`backendEncryptedFile` was missing deleteConfig() will
fail with `configNotFound` this case was not ignored,
which can lead to failure during upgrades.
once we have competed for locks, verify if the
context is still valid - this is to ensure that
we do not start readdir() or read() calls on the
drives on canceled connections.
This commit brings two locks instead of single lock for
WalkDir() calls on top of c25816eabc.
The main reason is to avoid contention between readMetadata()
and ListDir() calls, ListDir() can take time on prefixes that
are huge for readdir() but this shouldn't end up blocking
all readMetadata() operations, this allows for more room for
I/O while not overly penalizing all listing operations.
This commit fixes an issue in the `AssumeRoleWithCertificate`
handler.
Before clients received an error when they send
a chain of X.509 certificates (their client certificate as
well as intermediate / root CAs).
Now, client can send a certificate chain and the server
will only consider non-CA / leaf certificates as possible
client certificate candidates. However, the client still
can only send one certificate.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
When unable to load existing metadata new versions
would not be written. This would leave objects in a
permanently unrecoverable state
Instead, start with clean metadata and write the incoming data.
Some identity providers like GitLab do not provide
information about group membership as part of the
identity token claims. They only expose it via OIDC compatible
'/oauth/userinfo' endpoint, as described in the OpenID
Connect 1.0 sepcification.
But this of course requires application to make sure to add
additional accessToken, since idToken cannot be re-used to
perform the same 'userinfo' call. This is why this is specialized
requirement. Gitlab seems to be the only OpenID vendor that requires
this support for the time being.
fixes#12367
Don't perform an independent evaluation of inlining, but mirror the decision made when uploading the object.
Leads to some objects being inlined or not based on new metrics. Instead respect previous decision.
Replication was not working properly for encrypted
objects in single PUT object for preserving etag,
We need to make sure to preserve etag such that replication
works properly and not gets into infinite loops of copying
due to ETag mismatches.
This will allow objects to relinquish read lock held during
replication earlier if the target is known to be down
without waiting for connection timeout when replication
is attempted.
Stop async listing if we have not heard back from the client for 3 minutes.
This will stop spending resources on async listings when they are unlikely to get used.
If the client returns a new listing will be started on the second request.
Stop saving cache metadata to disk. It is cleared on restarts anyway. Removes all
load/save functionality
This commit adds a new STS API for X.509 certificate
authentication.
A client can make an HTTP POST request over a TLS connection
and MinIO will verify the provided client certificate, map it to an
S3 policy and return temp. S3 credentials to the client.
So, this STS API allows clients to authenticate with X.509
certificates over TLS and obtain temp. S3 credentials.
For more details and examples refer to the docs/sts/tls.md
documentation.
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Azure storage SDK uses http.Request feature which panics when the
request contains r.Form popuplated.
Azure gateway code creates a new request, however it modifies the
transport to add our metrics code which sets Request.Form during
shouldMeterRequest() call.
This commit simplifies shouldMeterRequest() to avoid setting
request.Form and avoid the crash.
console service should be shutdown last once all shutdown
sequences are complete, this is to ensure that we do not
prematurely kill the server before it cleans up the
`.minio.sys/tmp/uuid` folder.
NOTE: this only applies to NAS gateway setup.
Use a single allocation for reading the file, not the growing buffer of `io.ReadAll`.
Reuse the write buffer if we can when writing metadata in RenameData.
* reduce extra getObjectInfo() calls during ILM transition
This PR also changes expiration logic to be non-blocking,
scanner is now free from additional costs incurred due
to slower object layer calls and hitting the drives.
* move verifying expiration inside locks
A multi resources lock is a single lock UID with multiple associated
resources. This is created for example by multi objects delete
operation. This commit changes the behavior of Refresh() to iterate over
all locks having the same UID and refresh them.
Bonus: Fix showing top locks for multi delete objects
#11878 added "keepHTTPResponseAlive" to CreateFile requests.
The problem is that it will begin writing to the response before the
body is read after 10 seconds. This will abort the writes on the
client-side, since it assumes the server has received what it wants.
The proposed solution here is to monitor the completion of the body
before beginning to send keepalive pings.
Fixes observed high number of goroutines stuck in `io.Copy` in
`github.com/minio/minio/cmd.(*xlStorage).CreateFile` and
`(*storageRESTClient).CreateFile` stuck in `http.DrainBody`.
In the event when a lock is not refreshed in the cluster, this latter
will be automatically removed in the subsequent cleanup of non
refreshed locks routine, but it forgot to clean the local server,
hence having the same weird stale locks present.
This commit will remove the lock locally also in remote nodes, if
removing a lock from a remote node will fail, it will be anyway
removed later in the locks cleanup routine.
Currently in master this can cause existing
parent users to stop working and lead to
credentials getting overwritten.
```
~ mc admin user add alias/ minio123 minio123456
```
```
~ mc admin user svcacct add alias/ minio123 \
--access-key minio123 --secret-key minio123456
```
This PR rejects all such scenarios.
Faster healing as well as making healing more
responsive for faster scanner times.
also fixes a bug introduced in #13079, newly replaced
disks were not healing automatically.
- remove sourceCh usage from healing
we already have tasks and resp channel
- use read locks to lookup globalHealConfig
- fix healing resolver to pick candidates quickly
that need healing, without this resolver was
unexpectedly skipping.
healObject() should be non-blocking to ensure
that scanner is not blocked for a long time,
this adversely affects performance of the scanner
and also affects the way usage is updated
subsequently.
This PR allows for a non-blocking behavior for
healing, dropping operations that cannot be queued
anymore.
Synchronize bucket cycles so it is much more
likely that the same prefixes will be picked up
for scanning.
Use the global bloom filter cycle for that.
Bump bloom filter versions to clear those.
The intention is to list values of sys config that can potentially
impact the performance of minio.
At present, it will return max value configured for rlimit
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
proceed to heal the cluster when all the
drives in a set have failed, this is extremely
rare occurrence but even if it happens we allow
the cluster to be functional.
A recent regression caused new disks not being re-formatted. In the old
code, a disk needed be 'online' to be chosen to be formatted but the
disk has to be already formatted for XL storage IsOnline() function to
return true.
It is enough to check if XL storage is nil or not if we want to avoid
formatting root disks.
Co-authored-by: Anis Elleuch <anis@min.io>
markRootDisksAsDown() relies on disk info even if the
disk is unformatted. Therefore, we should always return
DiskInfo data even when DiskInfo storage API returns
errUnformattedDisk
`mc admin heal` command will show servers/disks tolerance, for that
purpose, you need to know the number of parity disks for each storage
class.
Parity is always the same in all pools.
prefixes at top level create such as
```
~ mc mb alias/bucket/prefix
```
The prefix/ incorrect appears as prefix__XL_DIR__/
in the accountInfo output, make sure to trim '__XL_DIR__'
Objects uploaded in this format for example
```
mc cp /etc/hosts alias/bucket/foo/bar/xl.meta
mc ls -r alias/bucket/foo/bar
```
Won't list the object, handle this scenario.
Ensure that one call will succeed and others will serialize
Example failure without code in place:
```
bucket-policy-handlers_test.go:120: unexpected error: cmd.InsufficientWriteQuorum: Storage resources are insufficient for the write operation doz2wjqaovp5kvlrv11fyacowgcvoziszmkmzzz9nk9au946qwhci4zkane5-1/
bucket-policy-handlers_test.go:120: unexpected error: cmd.InsufficientWriteQuorum: Storage resources are insufficient for the write operation doz2wjqaovp5kvlrv11fyacowgcvoziszmkmzzz9nk9au946qwhci4zkane5-1/
bucket-policy-handlers_test.go:135: want 1 ok, got 0
```
We are observing heavy system loads, potentially
locking the system up for periods when concurrent
listing operations are performed.
We place a per-disk lock on walk IO operations.
This will minimize the impact of concurrent listing
operations on the entire system and de-prioritize
them compared to other operations.
Single list operations should remain largely unaffected.
Some applications albeit poorly written rather than using headObject
rely on listObjects to check for existence of object, this unusual
request always has prefix=(to actual object) and max-keys=1
handle this situation specially such that we can avoid readdir()
on the top level parent to avoid sorting and skipping, ensuring
that such type of listObjects() always behaves similar to a
headObject() call.
this addresses a regression from #12984
which only addresses flat key from single
level deep at bucket level.
added extra tests as well to cover all
these scenarios.
- deletes should always Sweep() for tiering at the
end and does not need an extra getObjectInfo() call
- puts, copy and multipart writes should conditionally
do getObjectInfo() when tiering targets are configured
- introduce 'TransitionedObject' struct for ease of usage
and understanding.
- multiple-pools optimization deletes don't need to hold
read locks verifying objects across namespace and pools.
baseDir is empty if the top level prefix does not
end with `/` this causes large recursive listings
without any filtering, to fix this filtering make
sure to set the filter prefix appropriately.
also do not navigate folders at top level that do
not match the filter prefix, entries don't need
to match prefix since they are never prefixed
with the prefix anyways.
The previous code removes SVC/STS accounts for ldap users that do not
exist anymore in LDAP server. This commit will actually re-evaluate
filter as well if it is changed and remove all local SVC/STS accounts
beloning to the ldap user if the latter is not eligible for the
search filter anymore.
For example: the filter selects enabled users among other criteras in
the LDAP database, if one ldap user changes his status to disabled
later, then associated SVC/STS accounts will be removed because that user
does not meet the filter search anymore.
Traffic metering was not protected against concurrent updates.
```
WARNING: DATA RACE
Read at 0x00c02b0dace8 by goroutine 235:
github.com/minio/minio/cmd.setHTTPStatsHandler.func1()
d:/minio/minio/cmd/generic-handlers.go:360 +0x27d
net/http.HandlerFunc.ServeHTTP()
...
Previous write at 0x00c02b0dace8 by goroutine 994:
github.com/minio/minio/internal/http/stats.(*IncomingTrafficMeter).Read()
d:/minio/minio/internal/http/stats/http-traffic-recorder.go:34 +0xd2
```
The intention is to provide status of any sys services that can
potentially impact the performance of minio.
At present, it will return information about the `selinux` service
(not-installed/disabled/permissive/enforcing)
Signed-off-by: Shireesh Anjal <shireesh@minio.io>
This happens because of a change added where any sub-credential
with parentUser == rootCredential i.e (MINIO_ROOT_USER) will
always be an owner, you cannot generate credentials with lower
session policy to restrict their access.
This doesn't affect user service accounts created with regular
users, LDAP or OpenID
Use `readMetadata` when reading version
information without data requested.
Reduces IO on inlined data.
Bonus: Inline compressed data as well when
compression is enabled.
- avoid extra lookup for 'xl.meta' since we are
definitely sure that it doesn't exist.
- use this in newMultipartUpload() as well
- also additionally do not write with O_DSYNC
to avoid loading the drives, instead create
'xl.meta' for listing operations without
O_DSYNC since these are ephemeral objects.
- do the same with newMultipartUpload() since
it gets synced when the PutObjectPart() is
attempted, we do not need to tax newMultipartUpload()
instead.
removes unexpected features from regular putObject() such as
- increasing parity when disks are down, avoids
a lot of DiskInfo() calls.
- triggering MRF for metacache objects
if disks are offline
- avoiding renames from temporary location
to actual namespace, not needed since
metacache files are unique.
Disregard interfaces that are down when selecting bind addresses
Windows often has a number of disabled NICs used for VPN and other services.
This often causes minio to select an address for contacting the console that is on a disabled (virtual) NIC.
This checks if the interface is up before adding it to the pool on Windows.
Before, the gateway will complain that it found KMS configured in the
environment but the gateway mode does not support encryption. This
commit will allow starting of the gateway but ensure that S3 operations
with encryption headers will fail when the gateway doesn't support
encryption. That way, the user can use etcd + KMS and have IAM data
encrypted in the etcd store.
Co-authored-by: Anis Elleuch <anis@min.io>
improvements include
- skip IPv6 correctly
- do not set default value for
MINIO_SERVER_URL, let it be
configured if not use local IPs
Bonus:
- In healing return error from listPathRaw()
- update console to v0.8.3
Remote caches were not returned correctly, so they would not get updated on save.
Furthermore make some tweaks for more reliable updates.
Invalidate bloom filter to ensure rescan.
we will allow situations such as
```
a/b/1.txt
a/b
```
and
```
a/b
a/b/1.txt
```
we are going to document that this usecase is
not supported and we will never support it, if
any application does this users have to delete
the top level parent to make sure namespace is
accessible at lower level.
rest of the situations where the prefixes get
created across sets are supported as is.
its possible that some multipart uploads would have
uploaded only single parts so relying on `len(o.Parts)`
alone is not sufficient, we need to look for ETag
pattern to be absolutely sure.
Some incorrect setups might have multiple audiences
where they are trying to use a single authentication
endpoint for multiple services.
Nevertheless OpenID spec allows it to make it
even more confusin for no good reason.
> It MUST contain the OAuth 2.0 client_id of the
> Relying Party as an audience value. It MAY also
> contain identifiers for other audiences. In the
> general case, the aud value is an array of case
> sensitive strings. In the common special case
> when there is one audience, the aud value MAY
> be a single case sensitive string.
fixes#12809
this healing optimization caused multiple
regressions in healing
- delete-markers incorrectly missing
heal and returning incorrect healing
results to client.
- missing individual 'parts' such
as for restored object or simply
for all objects just missing few parts.
This optimization is not necessary, we
should proceed to verify all cases possible
not just when metadata is inconsistent.
destination path and old path will be similar
when healing occurs, this can lead to healed
parts being again purged leading to always an
inconsistent state on an object which might
further cause reduction in quorum eventually.
delete-markers missing on drives were
not healed due to few things
disksWithAllParts() does not know-how
to deal with delete markers, add support
for that.
fixes#12787
- delete-markers are incorrectly reported
as corrupt with wrong data sent to client
'mc admin heal -r' on objects with delete
marker will report as 'grey' incorrectly.
- do not heal delete-markers during HeadObject()
this can lead to inconsistent order of heals
on the object, although this is not an issue
in terms of order of versions it is rather
simpler to keep the same order on all drives.
- defaultHealResult() should handle 'err == nil'
case such that valid cases should be handled
as 'drive' status OK.
- remove use of getOnlineDisks() instead rely on fallbackDisks()
when disk return errors like diskNotFound, unformattedDisk
use other fallback disks to list from, instead of paying the
price for checking getOnlineDisks()
- optimize getDiskID() further to avoid large write locks when
looking formatLastCheck time window
This new change allows for a more relaxed fallback for listing
allowing for more tolerance and also eventually gain more
consistency in results even if using '3' disks by default.
When configured in Lookup Bind mode, the server now periodically queries the
LDAP IDP service to find changes to a user's group memberships, and saves this
info to update the access policies for all temporary and service account
credentials belonging to LDAP users.
Add a new goroutine file which has another printing format. We need it
to see how much time each goroutine was blocked. Easier to detect stops.
Co-authored-by: Anis Elleuch <anis@min.io>
when TLS is configured using IPs directly
might interfere and not work properly when
the server is configured with TLS certs but
the certs only have domain certs.
Also additionally allow users to specify
a public accessible URL for console to talk
to MinIO i.e `MINIO_SERVER_URL` this would
allow them to use an external ingress domain
to talk to MinIO. This internally fixes few
problems such as presigned URL generation on
the console UI etc.
This needs to be done additionally for any
MinIO deployments that might have a much more
stricter requirement when running in standalone
mode such as FS or standalone erasure code.
This method is used to add expected expiration and transition time
for an object in GET/HEAD Object response headers.
Also fixed bugs in lifecycle.PredictTransitionTime and
getLifecycleTransitionTier in handling current and
non-current versions.
This allows remote bucket admin to identify the origin of transitioned
objects by simply inspecting the object prefixes.
e.g let's take a remote tier TIER-1 pointing to a remote bucket (prefix)
testbucket/testprefix-1. The remote bucket admin can list all transitioned objects
from a MinIO deployment identified by '2e78e906-1c5d-4f94-8689-9df44cafde39' and
source bucket 'mybucket' like so,
```
$ ./mc ls -r minio-tier-target/testbucket/testprefix-1/2e78e906-1c5d-4f94-8689-9df44cafde39/mybucket/
[2021-07-12 17:15:50 PDT] 160B 48/fb/48fbc0e6-3a73-458b-9337-8e722c619ca4
[2021-07-12 16:58:46 PDT] 160B 7d/1c/7d1c96bd-031a-48d4-99ea-b1304e870830
```
In case of non-distributed setup, if the server start command contains a
`--console-address` flag and its value contains a hostname, it is not
getting anonymized.
Fixed by replacing the console host also with `server1`
This commit gathers MRF metrics from
all nodes in a cluster and return it to the caller. This will show information about the
number of objects in the MRF queues
waiting to be healed.
In case of replication healing, we always store completed status in the
object metadata, which is wrong because replication could fail in the
further retries.
Ensure that hostnames / ip addresses are not printed in the subnet
health report. Anonymize them by replacing them with `servern` where `n`
represents the position of the server in the pool.
This is done by building a `host anonymizer` map that maps every
possible value containing the host e.g. host, host:port,
http://host:port, etc to the corresponding anonymized name and using
this map to replace the values at the time of health report generation.
A different logic is used to anonymize host names in the `procinfo`
data, as the host names are part of an ellipses pattern in the process
start command. Here we just replace the prefix/suffix of the ellipses
pattern with their hashes.
Gzip responses if appropriate, except GetObject requests.
List reponses has an almost 10:1 compression ratio with no
measurable slowdown (in fact it seems a bit faster).
- ParentUser for OIDC auth changed to `openid:`
instead of `jwt:` to avoid clashes with variable
substitution
- Do not pass in random parents into IsAllowed()
policy evaluation as it can change the behavior
of looking for correct policies underneath.
fixes#12676fixes#12680
with console addition users cannot login with
root credentials without etcd persistent layer,
allow a dummy store such that such functionalities
can be supported when running as non-persistent
manner, this enables all calls and operations.
MinIO might be running inside proxies, and
console while being on another port might not be
reachable on a specific port behind such proxies.
For such scenarios customize the redirect URL
such that console can be redirected to correct
proxy endpoint instead.
fixes#12661
Download files from *any* bucket/path as an encrypted zip file.
The key is included in the response but can be separated so zip
and the key doesn't have to be sent on the same channel.
Requires https://github.com/minio/pkg/pull/6
Additional support for vendor-specific admin API
integrations for OpenID, to ensure validity of
credentials on MinIO.
Every 5minutes check for validity of credentials
on MinIO with vendor specific IDP.
DiskInfo() calls can stagger and wait if run
serially timing out 10secs per drive, to avoid
this lets check DiskInfo in parallel to avoid
delays when nodes get disconnected.
Fixes brought forward from https://github.com/minio/minio/pull/12605
Fixes resolution when an object is in prefix of another and one zone returns the directory and another the object.
Fixes resolution on single entries that arrive first, so resolution doesn't depend on order.
auditLog should be attempted right before the
return of the function and not multiple times
per function, this ensures that we only trigger
it once per function call.
if object was uploaded with multipart. This is to ensure that
GetObject calls with partNumber in URI request parameters
have same behavior on source and replication target.
also do not incorrectly double count
objExists unless its selected and it
matches with previous entry.
Bonus: change listQuorum to match with
AskDisks to ensure that we atleast by
default choose all the "drives" that
we asked is consistent.
Bonus: remove kms_kes as sub-system, since its ENV only.
- also fixes a crash with etcd cluster without KMS
configured and also if KMS decryption is missing.
backend-encrypted doesn't need to be explicitly healed anymore
since this file is deleted upon upgrade and migration to the
KMS based encrypted config/IAM credentials.
goroutine 1 [running]:
runtime/internal/atomic.panicUnaligned()
/usr/local/go/src/runtime/internal/atomic/unaligned.go:8 +0x24
golang doc:
// BUG(rsc): On x86-32, the 64-bit functions use instructions unavailable before the Pentium MMX.
//
// On non-Linux ARM, the 64-bit functions use instructions unavailable before the ARMv6k core.
//
// On ARM, x86-32, and 32-bit MIPS,
// it is the caller's responsibility to arrange for 64-bit
// alignment of 64-bit words accessed atomically. The first word in a
// variable or in an allocated struct, array, or slice can be relied upon to be
// 64-bit aligned.
Create write lock on PutObject and CopyObject when on multi-pool setup.
Use the same lock as NewMultipartUpload so all creation calls share the same lock.
This feature also changes the default port where
the browser is running, now the port has moved
to 9001 and it can be configured with
```
--console-address ":9001"
```
When no results are sent `result.end` is never sent, so the list becomes hot until the list is full.
Break immediately when channel is closed.
Fixes#12518
previous PR incorrectly changed rename() from
tmp to -> tmp/.trash/uuid, since it is self
referential - to clear this up make sure its
renamed to a separate folder and deleted
in background - just like before.
Each multipart upload is holding a read lock for the entire upload
duration of each part.
This makes it impossible for other parts to complete until all currently
uploading parts have released their locks.
It will also make it impossible for new parts to start as long as the
write lock is still being requested, essentially deadlocking uploads
until all that may have been granted a read lock has been completed.
Refactor to only hold the upload id lock while reading and writing
the metadata, but hold a part id lock while the part is being uploaded.
This commit adds an admin API for fetching
the KMS status information (default key ID, endpoints, ...).
With this commit the server exposes REST endpoint:
```
GET <admin-api>/kms/status
```
Signed-off-by: Andreas Auernhammer <hi@aead.dev>
Web Handlers can generate STS tokens but forgot to create a parent user
and save it along with the temporary access account. This commit fixes
this.
fixes#12381
its possible that, version might exist on second pool such that
upon deleteBucket() might have deleted the bucket on pool1 successfully
since it doesn't have any objects, undo such operations properly in
all any error scenario.
Also delete bucket metadata from pool layer rather than sets layer.
objectErasureMap in the audit holds information about the objects
involved in the current S3 operation such as pool index, set an index,
and disk endpoints. One user saw a crash due to a concurrent update of
objectErasureMap information. Use sync.Map to prevent a crash.
Always use `GetActualSize` to get the part size, not just when encrypted.
Fixes mint test io.minio.MinioClient.uploadPartCopy,
error "Range specified is not valid for source object".
healing code was using incorrect buffers to heal older
objects with 10MiB erasure blockSize, incorrect calculation
of such buffers can lead to incorrect premature closure of
io.Pipe() during healing.
fixes#12410
- it is possible that during I/O failures we might
leave partially written directories, make sure
we purge them after.
- rename current data-dir (null) versionId only after
the newer xl.meta has been written fully.
- attempt removal once for minioMetaTmpBucket/uuid/
as this folder is empty if all previous operations
were successful, this allows avoiding recursive os.Remove()
- for single pool setups usage is not checked.
- for pools, only check the "set" in which it would be placed.
- keep a minimum number of inodes (when we know it).
- ignore for `.minio.sys`.
It makes sense that a node that has multiple disks starts when one
disk fails, returning an i/o error for example. This commit will make this
faulty tolerance available in this specific use case.
Due to incorrect KMS context constructed, we need to add
additional fallbacks and also fix the original root cause
to fix already migrated deployments.
Bonus remove double migration is avoided in gateway mode
for etcd, instead do it once in iam.Init(), also simplify
the migration by not migrating STS users instead let the
clients regenerate them.
- Adds versioning support for S3 based remote tiers that have versioning
enabled. This ensures that when reading or deleting we specify the specific
version ID of the object. In case of deletion, this is important to ensure that
the object version is actually deleted instead of simply being marked for
deletion.
- Stores the remote object's version id in the tier-journal. Tier-journal file
version is not bumped up as serializing the new struct version is
compatible with old journals without the remote object version id.
- `storageRESTVersion` is bumped up as FileInfo struct now includes a
`TransitionRemoteVersionID` member.
- Azure and GCS support for this feature will be added subsequently.
Co-authored-by: Krishnan Parthasarathi <krisis@users.noreply.github.com>
Also adding an API to allow resyncing replication when
existing object replication is enabled and the remote target
is entirely lost. With the `mc replicate reset` command, the
objects that are eligible for replication as per the replication
config will be resynced to target if existing object replication
is enabled on the rule.
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
IAM not initialized doesn't mean we can't still
read the content from the disk, we should just
allow the request to go-through if object layer
is initialized.
Real-time metrics calculated in-memory rely on the initial
replication metrics saved with data usage. However, this can
lag behind the actual state of the cluster at the time of server
restart leading to inaccurate Pending size/counts reported to
Prometheus. Dropping the Pending metrics as this can be more
reliably monitored by applications with replication notifications.
Signed-off-by: Poorna Krishnamoorthy <poorna@minio.io>
LDAPusername is the simpler form of LDAPUser (userDN),
using a simpler version is convenient from policy
conditions point of view, since these are unique id's
used for LDAP login.
In cases where a cluster is degraded, we do not uphold our consistency
guarantee and we will write fewer erasure codes and rely on healing
to recreate the missing shards.
In some cases replacing known bad disks in practice take days.
We want to change the behavior of a known degraded system to keep
the erasure code promise of the storage class for each object.
This will create the objects with the same confidence as a fully
functional cluster. The tradeoff will be that objects created
during a partial outage will take up slightly more space.
This means that when the storage class is EC:4, there should
always be written 4 parity shards, even if some disks are unavailable.
When an object is created on a set, the disks are immediately
checked. If any disks are unavailable additional parity shards
will be made for each offline disk, up to 50% of the number of disks.
We add an internal metadata field with the actual and intended
erasure code level, this can optionally be picked up later by
the scanner if we decide that data like this should be re-sharded.
Bonus change LDAP settings such as user, group mappings
are now listed as part of `mc admin user list` and
`mc admin group list`
Additionally this PR also deprecates the `/v2` API
that is no longer in use.
A configured audit logger or HTTP logger is validated during MinIO
server startup. Relax the timeout to 10 seconds in that case, otherwise,
both loggers won't be used.
1 second could be too low for a busy HTTP endpoint.
This commit fixes a bug causing the MinIO server to compute
the ETag of a single-part object as MD5 of the compressed
content - not as MD5 of the actual content.
This usually does not affect clients since the MinIO appended
a `-1` to indicate that the ETag belongs to a multipart object.
However, this behavior was problematic since:
- A S3 client being very strict should reject such an ETag since
the client uploaded the object via single-part API but got
a multipart ETag that is not the content MD5.
- The MinIO server leaks (via the ETag) that it compressed the
object.
This commit addresses both cases. Now, the MinIO server returns
an ETag equal to the content MD5 for single-part objects that got
compressed.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
A lot of healing is likely to be on non-existing objects and
locks are very expensive and will slow down scanning
significantly.
In cases where all are valid or, all are broken allow
rejection without locking.
Keep the existing behavior, but move the check for
dangling objects to after the lock has been acquired.
```
_, err = getLatestFileInfo(ctx, partsMetadata, errs)
if err != nil {
return er.purgeObjectDangling(ctx, bucket, object, versionID, partsMetadata, errs, []error{}, opts)
}
```
Revert "heal: Hold lock when reading xl.meta from disks (#12362)"
This reverts commit abd32065aa
This PR fixes two bugs
- Remove fi.Data upon overwrite of objects from inlined-data to non-inlined-data
- Workaround for an existing bug on disk with latest releases to ignore fi.Data
and instead read from the disk for non-inlined-data
- Addtionally add a reserved metadata header to indicate data is inlined for
a given version.
Lock is hold in healObject() after reading xl.meta from disks the first
time. This commit will held the lock since the beginning of HealObject()
Co-authored-by: Anis Elleuch <anis@min.io>
Fixes `testSSES3EncryptedGetObjectReadSeekFunctional` mint test.
```
{
"args": {
"bucketName": "minio-go-test-w53hbpat649nhvws",
"objectName": "6mdswladz4vfpp2oit1pkn3qd11te5"
},
"duration": 7537,
"error": "We encountered an internal error, please try again.: cause(The requested range \"bytes 251717932 -> -116384170 of 135333762\" is not satisfiable.)",
"function": "GetObject(bucketName, objectName)",
"message": "CopyN failed",
"name": "minio-go: testSSES3EncryptedGetObjectReadSeekFunctional",
"status": "FAIL"
}
```
Compressed files always start at the beginning of a part so no additional offset should be added.
Previous PR #12351 added functions to read from the reader
stream to reduce memory usage, use the same technique in
few other places where we are not interested in reading the
data part.
in setups with lots of drives the server
startup is slow, initialize all local drives
in parallel before registering with muxer.
this speeds up when there are multiple pools
and large collection of drives.
This commit adds the `X-Amz-Server-Side-Encryption-Aws-Kms-Key-Id`
response header to the GET, HEAD, PUT and Download API.
Based on AWS documentation [1] AWS S3 returns the KMS key ID as part
of the response headers.
[1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/specifying-kms-encryption.html
Signed-off-by: Andreas Auernhammer <aead@mail.de>
multi-disk clusters initialize buffer pools
per disk, this is perhaps expensive and perhaps
not useful, for a running server instance. As this
may disallow re-use of buffers across sets,
this change ensures that buffers across sets
can be re-used at drive level, this can reduce
quite a lot of memory on large drive setups.
In lieu of new changes coming for server command line, this
change is to deprecate strict requirement for distributed setups
to provide root credentials.
Bonus: remove MINIO_WORM warning from April 2020, it is time to
remove this warning.
However, this slice is also used for closing the writers, so close is never called on these.
Furthermore when an error is returned from a write it is now reported to the reader.
bonus: remove unused heal param from `newBitrotWriter`.
* Remove copy, now that we don't mutate.
At some places bloom filter tracker was getting
updated for `.minio.sys/tmp` bucket, there is no
reason to update bloom filters for those.
And add a missing bloom filter update for MakeBucket()
Bonus: purge unused function deleteEmptyDir()
gracefully start the server, if there are other drives
available - print enough information for administrator
to notice the errors in console.
Bonus: for really large streams use larger buffer for
writes.
- GetObject() should always use a common dataDir to
read from when it starts reading, this allows the
code in erasure decoding to have sane expectations.
- Healing should always heal on the common dataDir, this
allows the code in dangling object detection to purge
dangling content.
These both situations can happen under certain types of
retries during PUT when server is restarting etc, some
namespace entries might be left over.
attempt a delete on remote DNS store first before
attempting locally, because removing at DNS store
is cheaper than deleting locally, in case of
errors locally we can cheaply recreate the
bucket on dnsStore instead of.
This commit adds support for SSE-KMS bucket configurations.
Before, the MinIO server did not support SSE-KMS, and therefore,
it was not possible to specify an SSE-KMS bucket config.
Now, this is possible. For example:
```
mc encrypt set sse-kms some-key <alias>/my-bucket
```
Further, this commit fixes an issue caused by not supporting
SSE-KMS bucket configuration and switching to SSE-KMS as default
SSE method.
Before, the server just checked whether an SSE bucket config was
present (not which type of SSE config) and applied the default
SSE method (which was switched from SSE-S3 to SSE-KMS).
This caused objects to get encrypted with SSE-KMS even though a
SSE-S3 bucket config was present.
This issue is fixed as a side-effect of this commit.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
when bidirectional replication is set up.
If ReplicaModifications is enabled in the replication
configuration, sync metadata updates to source if
replication rules are met. By default, if this
configuration is unset, MinIO automatically sync's
metadata updates on replica back to the source.
This commit adds a check to the MinIO server setup that verifies
that MinIO can reach KES, if configured, and that the default key
exists. If the default key does not exist it will create it
automatically.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
A cache structure will be kept with a tree of usages.
The cache is a tree structure where each keeps track
of its children.
An uncompacted branch contains a count of the files
only directly at the branch level, and contains link to
children branches or leaves.
The leaves are "compacted" based on a number of properties.
A compacted leaf contains the totals of all files beneath it.
A leaf is only scanned once every dataUsageUpdateDirCycles,
rarer if the bloom filter for the path is clean and no lifecycles
are applied. Skipped leaves have their totals transferred from
the previous cycle.
A clean leaf will be included once every healFolderIncludeProb
for partial heal scans. When selected there is a one in
healObjectSelectProb that any object will be chosen for heal scan.
Compaction happens when either:
- The folder (and subfolders) contains less than dataScannerCompactLeastObject objects.
- The folder itself contains more than dataScannerCompactAtFolders folders.
- The folder only contains objects and no subfolders.
- A bucket root will never be compacted.
Furthermore, if a has more than dataScannerCompactAtChildren recursive
children (uncompacted folders) the tree will be recursively scanned and the
branches with the least number of objects will be compacted until the limit
is reached.
This ensures that any branch will never contain an unreasonable amount
of other branches, and also that small branches with few objects don't
take up unreasonable amounts of space.
Whenever a branch is scanned, it is assumed that it will be un-compacted
before it hits any of the above limits. This will make the branch rebalance
itself when scanned if the distribution of objects has changed.
TLDR; With current values: No bucket will ever have more than 10000
child nodes recursively. No single folder will have more than 2500 child
nodes by itself. All subfolders are compacted if they have less than 500
objects in them recursively.
We accumulate the (non-deletemarker) version count for paths as well,
since we are changing the structure anyway.
MRF does not detect when a node is disconnected and reconnected quickly
this change will ensure that MRF is alerted by comparing the last disk
reconnection timestamp with the last MRF check time.
Signed-off-by: Anis Elleuch <anis@min.io>
Co-authored-by: Klaus Post <klauspost@gmail.com>
wait groups are necessary with io.Pipes() to avoid
races when a blocking function may not be expected
and a Write() -> Close() before Read() races on each
other. We should avoid such situations..
Co-authored-by: Klaus Post <klauspost@gmail.com>
This commit replaces the custom KES client implementation
with the KES SDK from https://github.com/minio/kes
The SDK supports multi-server client load-balancing and
requests retry out of the box. Therefore, this change reduces
the overall complexity within the MinIO server and there
is no need to maintain two separate client implementations.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
This commit enforces the usage of AES-256
for config and IAM data en/decryption in FIPS
mode.
Further, it improves the implementation of
`fips.Enabled` by making it a compile time
constant. Now, the compiler is able to evaluate
the any `if fips.Enabled { ... }` at compile time
and eliminate unused code.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
p.writers is a verbatim value of bitrotWriter
backed by a pipe() that should never be nil'ed,
instead use the captured errors to skip the writes.
additionally detect also short writes, and reject
them as errors.
currently GetUser() returns 403 when IAM is not initialized
this can lead to applications crashing, instead return 503
so that the applications can retry and backoff.
fixes#12078
as there is no automatic way to detect if there
is a root disk mounted on / or /var for the container
environments due to how the root disk information
is masked inside overlay root inside container.
this PR brings an environment variable to set
root disk size threshold manually to detect the
root disks in such situations.
This commit fixes a bug in the single-part object decryption
that is triggered in case of SSE-KMS. Before, it was assumed
that the encryption is either SSE-C or SSE-S3. In case of SSE-KMS
the SSE-C branch was executed. This lead to an invalid SSE-C
algorithm error.
This commit fixes this by inverting the `if-else` logic.
Now, the SSE-C branch only gets executed when SSE-C headers
are present.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
This commit fixes a bug introduced by af0c65b.
When there is no / an empty client-provided SSE-KMS
context the `ParseMetadata` may return a nil map
(`kms.Context`).
When unsealing the object key we must check that
the context is nil before assigning a key-value pair.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
UpdateServiceAccount ignores updating fields when not passed from upper
layer, such as empty policy, empty account status, and empty secret key.
This PR will check for a secret key only if it is empty and add more
check on the value of the account status.
Signed-off-by: Anis Elleuch <anis@min.io>
This commit adds basic SSE-KMS support.
Now, a client can specify the SSE-KMS headers
(algorithm, optional key-id, optional context)
such that the object gets encrypted using the
SSE-KMS method. Further, auto-encryption now
defaults to SSE-KMS.
This commit does not try to do any refactoring
and instead tries to implement SSE-KMS as a minimal
change to the code base. However, refactoring the entire
crypto-related code is planned - but needs a separate
effort.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
It is possible in some scenarios that in multiple pools,
two concurrent calls for the same object as a multipart operation
can lead to duplicate entries on two different pools.
This PR fixes this
- hold locks to serialize multiple callers so that we don't race.
- make sure to look for existing objects on the namespace as well
not just for existing uploadIDs
When running MinIO server without LDAP/OpenID, we should error out when
the code tries to create a service account for a non existant regular
user.
Bonus: refactor the check code to be show all cases more clearly
Signed-off-by: Anis Elleuch <anis@min.io>
Co-authored-by: Anis Elleuch <anis@min.io>
This commit adds basic SSE-KMS support.
Now, a client can specify the SSE-KMS headers
(algorithm, optional key-id, optional context)
such that the object gets encrypted using the
SSE-KMS method. Further, auto-encryption now
defaults to SSE-KMS.
This commit does not try to do any refactoring
and instead tries to implement SSE-KMS as a minimal
change to the code base. However, refactoring the entire
crypto-related code is planned - but needs a separate
effort.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
Co-authored-by: Klaus Post <klauspost@gmail.com>
avoid time_wait build up with getObject requests if there are
pending callers and they timeout, can lead to time_wait states
Bonus share the same buffer pool with erasure healing logic,
additionally also fixes a race where parallel readers were
never cleanup during Encode() phase, because pipe.Reader end
was never closed().
Added closer right away upon an error during Encode to make
sure to avoid racy Close() while stream was still being
Read().
This commit fixes a bug when parsing the env. variable
`MINIO_KMS_SECRET_KEY`. Before, the env. variable
name - instead of its value - was parsed. This (obviously)
did not work properly.
This commit fixes this.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
cleanup functions should never be cleaned before the reader is
instantiated, this type of design leads to situations where order
of lockers and places for them to use becomes confusing.
Allow WithCleanupFuncs() if the caller wishes to add cleanupFns
to be run upon close() or an error during initialization of the
reader.
Also make sure streams are closed before we unlock the resources,
this allows for ordered cleanup of resources.
upon errors to acquire lock context would still leak,
since the cancel would never be called. since the lock
is never acquired - proactively clear it before returning.
failed queue should be used for retried requests to
avoid cascading the failures into incoming queue, this
would allow for a more fair retry for failed replicas.
Additionally also avoid taking context in queue task
to avoid confusion, simplifies its usage.
There can be situations where replication completed but the
`X-Amz-Replication-Status` metadata update failed such as
when the server returns 503 under high load. This object version will
continue to be picked up by the scanner and replicateObject would perform
no action since the versions match between source and target.
The metadata would never reflect that replication was successful
without this fix, leading to repeated re-queuing.
This commit enhances the docs about IAM encryption.
It adds a quick-start section that explains how to
get started quickly with `MINIO_KMS_SECRET_KEY`
instead of setting up KES.
It also removes the startup message that gets printed
when the server migrates IAM data to plaintext.
We will point this out in the release notes.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
OpenID connect generated service accounts do not work
properly after console logout, since the parentUser state
is lost - instead use sub+iss claims for parentUser, according
to OIDC spec both the claims provide the necessary stability
across logins etc.
Currently, only credentials could be updated with
`mc admin bucket remote edit`.
Allow updating synchronous replication flag, path,
bandwidth and healthcheck duration on buckets, and
a flag to disable proxying in active-active replication.
* lock: Always cancel the returned Get(R)Lock context
There is a leak with cancel created inside the locking mechanism. The
cancel purpose was to cancel operations such erasure get/put that are
holding non-refreshable locks.
This PR will ensure the created context.Cancel is passed to the unlock
API so it will cleanup and avoid leaks.
* locks: Avoid returning nil cancel in local lockers
Since there is no Refresh mechanism in the local locking mechanism, we
do not generate a new context or cancel. Currently, a nil cancel
function is returned but this can cause a crash. Return a dummy function
instead.
https://github.com/minio/console takes over the functionality for the
future object browser development
Signed-off-by: Harshavardhana <harsha@minio.io>
LDAP DN should be used when allowing setting service accounts
for LDAP users instead of just simple user,
Bonus root owner should be allowed full access
to all service account APIs.
Signed-off-by: Harshavardhana <harsha@minio.io>
Part ETags are not available after multipart finalizes, removing this
check as not useful.
Signed-off-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
This commit reverts a change that added support for
parsing base64-encoded keys set via `MINIO_KMS_MASTER_KEY`.
The env. variable `MINIO_KMS_MASTER_KEY` is deprecated and
should ONLY support parsing existing keys - not the new format.
Any new deployment should use `MINIO_KMS_SECRET_KEY`. The legacy
env. variable `MINIO_KMS_MASTER_KEY` will be removed at some point
in time.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
avoid re-read of xl.meta instead just use
the success criteria from PutObjectPart()
and check the ETag matches per Part, if
they match then the parts have been
successfully restored as is.
Signed-off-by: Harshavardhana <harsha@minio.io>
Bonus fix fallback to decrypt previously
encrypted content as well using older master
key ciphertext format.
Signed-off-by: Harshavardhana <harsha@minio.io>
just like replication workers, allow failed replication
workers to be configurable in situations like DR failures
etc to catch up on replication sooner when DR is back
online.
Signed-off-by: Harshavardhana <harsha@minio.io>
peer nodes would not update if policy is unset on
a user, until policies reload every 5minutes. Make
sure to reload the policies properly, if no policy
is found make sure to delete such users and groups
fixes#12074
Signed-off-by: Harshavardhana <harsha@minio.io>
With this change, MinIO's ILM supports transitioning objects to a remote tier.
This change includes support for Azure Blob Storage, AWS S3 compatible object
storage incl. MinIO and Google Cloud Storage as remote tier storage backends.
Some new additions include:
- Admin APIs remote tier configuration management
- Simple journal to track remote objects to be 'collected'
This is used by object API handlers which 'mutate' object versions by
overwriting/replacing content (Put/CopyObject) or removing the version
itself (e.g DeleteObjectVersion).
- Rework of previous ILM transition to fit the new model
In the new model, a storage class (a.k.a remote tier) is defined by the
'remote' object storage type (one of s3, azure, GCS), bucket name and a
prefix.
* Fixed bugs, review comments, and more unit-tests
- Leverage inline small object feature
- Migrate legacy objects to the latest object format before transitioning
- Fix restore to particular version if specified
- Extend SharedDataDirCount to handle transitioned and restored objects
- Restore-object should accept version-id for version-suspended bucket (#12091)
- Check if remote tier creds have sufficient permissions
- Bonus minor fixes to existing error messages
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Krishna Srinivas <krishna@minio.io>
Signed-off-by: Harshavardhana <harsha@minio.io>
instead use expect continue timeout, and have
higher response header timeout, the new higher
timeout satisfies worse case scenarios for total
response time on a CreateFile operation.
Also set the "expect" continue header to satisfy
expect continue timeout behavior.
Some clients seem to cause CreateFile body to be
truncated, leading to no errors which instead
fails with ObjectNotFound on a PUT operation,
this change avoids such failures appropriately.
Signed-off-by: Harshavardhana <harsha@minio.io>
allow restrictions on who can access Prometheus
endpoint, additionally add prometheus as part of
diagnostics canned policy.
Signed-off-by: Harshavardhana <harsha@minio.io>
This commit changes the config/IAM encryption
process. Instead of encrypting config data
(users, policies etc.) with the root credentials
MinIO now encrypts this data with a KMS - if configured.
Therefore, this PR moves the MinIO-KMS configuration (via
env. variables) to a "top-level" configuration.
The KMS configuration cannot be stored in the config file
since it is used to decrypt the config file in the first
place.
As a consequence, this commit also removes support for
Hashicorp Vault - which has been deprecated anyway.
Signed-off-by: Andreas Auernhammer <aead@mail.de>
* fix: pick valid FileInfo additionally based on dataDir
historically we have always relied on modTime
to be consistent and same, we can now add additional
reference to look for the same dataDir value.
A dataDir is the same for an object at a given point in
time for a given version, let's say a `null` version
is overwritten in quorum we do not by mistake pick
up the fileInfo's incorrectly.
* make sure to not preserve fi.Data
Signed-off-by: Harshavardhana <harsha@minio.io>
InfoServiceAccount admin API does not correctly calculate the policy for
a given service account in case if the policy is implied. Fix it.
Signed-off-by: Anis Elleuch <anis@min.io>
avoid potential for duplicates under multi-pool
setup, additionally also make sure CompleteMultipart
is using a more optimal API for uploadID lookup
and never delete the object there is a potential
to create a delete marker during complete multipart.
Signed-off-by: Harshavardhana <harsha@minio.io>
This is an optimization by reducing one extra system call,
and many network operations. This reduction should increase
the performance for small file workloads.
When an error is reported it is ignored and zipping continues with the next object.
However, if there is an error it will write a response to `writeWebErrorResponse(w, err)`, but responses are still being built.
Fixes#12082
Bonus: Exclude common compressed image types.
Thanks to @Alevsk for noticing this nuanced behavior
change between releases from 03-04 to 03-20, make sure
that we handle the legacy path removal as well.
also make sure to close the channel on the producer
side, not in a separate go-routine, this can lead
to races between a writer and a closer.
fixes#12073
This is an optimization to save IOPS. The replication
failures will be re-queued once more to re-attempt
replication. If it still does not succeed, the replication
status is set as `FAILED` and will be caught up on
scanner cycle.
For InfoServiceAccount API, calculating the policy before showing it to
the user was not correctly done (only UX issue, not a security issue)
This commit fixes it.
policy might have an associated mapping with an expired
user key, do not return an error during DeletePolicy
for such situations - proceed normally as its an
expected situation.
This commit introduces a new package `pkg/kms`.
It contains basic types and functions to interact
with various KMS implementations.
This commit also moves KMS-related code from `cmd/crypto`
to `pkg/kms`. Now, it is possible to implement a KMS-based
config data encryption in the `pkg/config` package.
This commit introduces a new package `pkg/fips`
that bundles functionality to handle and configure
cryptographic protocols in case of FIPS 140.
If it is compiled with `--tags=fips` it assumes
that a FIPS 140-2 cryptographic module is used
to implement all FIPS compliant cryptographic
primitives - like AES, SHA-256, ...
In "FIPS mode" it excludes all non-FIPS compliant
cryptographic primitives from the protocol parameters.
- Add 32-bit checksum (32 LSB part of xxhash64) of the serialized metadata.
This will ensure that we always reject corrupted metadata.
- Add automatic repair of inline data, so the data structure can be used.
If data was corrupted, we remove all unreadable entries to ensure that operations
can succeed on the object. Since higher layers add bitrot checks this is not a big problem.
Cannot downgrade to v1.1 metadata, but since that isn't released, no need for a major bump.
only in case of S3 gateway we have a case where we
need to allow for SSE-S3 headers as passthrough,
If SSE-C headers are passed then they are rejected
if KMS is not configured.
This code is necessary for `mc admin update` command
to work with fips compiled binaries, with fips tags
the releaseInfo will automatically point to fips
specific binaries.
This commit fixes a bug in the put-part
implementation. The SSE headers should be
set as specified by AWS - See:
https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPart.html
Now, the MinIO server should set SSE-C headers,
like `x-amz-server-side-encryption-customer-algorithm`.
Fixes#11991
locks can get relinquished when Read() sees io.EOF
leading to prematurely closing of the readers
concurrent writes on the same object can have
undesired consequences here when these locks
are relinquished.
EOF may be sent along with data so queue it up and
return it when the buffer is empty.
Also, when reading data without direct io don't add a buffer
that only results in extra memcopy.
Multiple disks from the same set would be writing concurrently.
```
WARNING: DATA RACE
Write at 0x00c002100ce0 by goroutine 166:
github.com/minio/minio/cmd.(*erasureSets).connectDisks.func1()
d:/minio/minio/cmd/erasure-sets.go:254 +0x82f
Previous write at 0x00c002100ce0 by goroutine 129:
github.com/minio/minio/cmd.(*erasureSets).connectDisks.func1()
d:/minio/minio/cmd/erasure-sets.go:254 +0x82f
Goroutine 166 (running) created at:
github.com/minio/minio/cmd.(*erasureSets).connectDisks()
d:/minio/minio/cmd/erasure-sets.go:210 +0x324
github.com/minio/minio/cmd.(*erasureSets).monitorAndConnectEndpoints()
d:/minio/minio/cmd/erasure-sets.go:288 +0x244
Goroutine 129 (finished) created at:
github.com/minio/minio/cmd.(*erasureSets).connectDisks()
d:/minio/minio/cmd/erasure-sets.go:210 +0x324
github.com/minio/minio/cmd.(*erasureSets).monitorAndConnectEndpoints()
d:/minio/minio/cmd/erasure-sets.go:288 +0x244
```
This commit adds a self-test for all bitrot algorithms:
- SHA-256
- BLAKE2b
- HighwayHash
The self-test computes an incremental checksum of pseudo-random
messages. If a bitrot algorithm implementation stops working on
some CPU architecture or with a certain Go version this self-test
will prevent the server from starting and silently corrupting data.
For additional context see: minio/highwayhash#19
Metrics calculation was accumulating inital usage across all nodes
rather than using initial usage only once.
Also fixing:
- bug where all peer traffic was going to the same node.
- reset counters when replication status changes from
PENDING -> FAILED
This PR fixes
- close leaking bandwidth report channel leakage
- remove the closer requirement for bandwidth monitor
instead if Read() fails remember the error and return
error for all subsequent reads.
- use locking for usage-cache.bin updates, with inline
data we cannot afford to have concurrent writes to
usage-cache.bin corrupting xl.meta
implementation in #11949 only catered from single
node, but we need cluster metrics by capturing
from all peers. introduce bucket stats API that
will be used for capturing in-line bucket usage
as well eventually
Current implementation heavily relies on readAllFileInfo
but with the advent of xl.meta inlined with data, we cannot
easily avoid reading data when we are only interested is
updating metadata, this leads to invariably write
amplification during metadata updates, repeatedly reading
data when we are only interested in updating metadata.
This PR ensures that we implement a metadata only update
API at storage layer, that handles updates to metadata alone
for any given version - given the version is valid and
present.
This helps reduce the chattiness for following calls..
- PutObjectTags
- DeleteObjectTags
- PutObjectLegalHold
- PutObjectRetention
- ReplicateObject (updates metadata on replication status)
- collect real time replication metrics for prometheus.
- add pending_count, failed_count metric for total pending/failed replication operations.
- add API to get replication metrics
- add MRF worker to handle spill-over replication operations
- multiple issues found with replication
- fixes an issue when client sends a bucket
name with `/` at the end from SetRemoteTarget
API call make sure to trim the bucket name to
avoid any extra `/`.
- hold write locks in GetObjectNInfo during replication
to ensure that object version stack is not overwritten
while reading the content.
- add additional protection during WriteMetadata() to
ensure that we always write a valid FileInfo{} and avoid
ever writing empty FileInfo{} to the lowest layers.
Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
current master breaks this important requirement
we need to preserve legacyXLv1 format, this is simply
ignored and overwritten causing a myriad of issues
by leaving stale files on the namespace etc.
for now lets still use the two-phase approach of
writing to `tmp` and then renaming the content to
the actual namespace.
versionID is the one that needs to be preserved and as
well as overwritten in case of replication, transition
etc - dataDir is an ephemeral entity that changes
during overwrites - make sure that versionID is used
to save the object content.
this would break things if you are already running
the latest master, please wipe your current content
and re-do your setup after this change.
upgrading from 2yr old releases is expected to work,
the issue was we were missing checksum info to be
passed down to newBitrotReader() for whole bitrot
calculation
Ensure that we don't use potentially broken algorithms for critical functions, whether it be a runtime problem or implementation problem for a specific platform.
It is inefficient to decide to heal an object before checking its
lifecycle for expiration or transition. This commit will just reverse
the order of action: evaluate lifecycle and heal only if asked and
lifecycle resulted a NoneAction.
replication didn't work as expected when deletion of
delete markers was requested in DeleteMultipleObjects
API, this is due to incorrect lookup elements being
used to look for delete markers.
This allows us to speed up or slow down sleeps
between multiple scanner cycles, helps in testing
as well as some deployments might want to run
scanner more frequently.
This change is also dynamic can be applied on
a running cluster, subsequent cycles pickup
the newly set value.
using Lstat() is causing tiny memory allocations,
that are usually wasted and never used, instead
we can simply uses Access() call that does 0
memory allocations.
This feature brings in support for auto extraction
of objects onto MinIO's namespace from an incoming
tar gzipped stream, the only expected metadata sent
by the client is to set `snowball-auto-extract`.
All the contents from the tar stream are saved as
folders and objects on the namespace.
fixes#8715
service accounts were not inheriting parent policies
anymore due to refactors in the PolicyDBGet() from
the latest release, fix this behavior properly.
The local node name is heavily used in tracing, create a new global
variable to store it. Multiple goroutines can access it since it won't be
changed later.
In #11888 we observe a lot of running, WalkDir calls.
There doesn't appear to be any listerners for these calls, so they should be aborted.
Ensure that WalkDir aborts when upstream cancels the request.
Fixes#11888
The background healing can return NoSuchUpload error, the reason is that
healing code can return errFileNotFound with three parameters. Simplify
the code by returning exact errUploadNotFound error in multipart code.
Also ensure that a typed error is always returned whatever the number of
parameters because it is better than showing internal error.
For large objects taking more than '3 minutes' response
times in a single PUT operation can timeout prematurely
as 'ResponseHeader' timeout hits for 3 minutes. Avoid
this by keeping the connection active during CreateFile
phase.
baseDirFromPrefix(prefix) for object names without
parent directory incorrectly uses empty path, leading
to long listing at various paths that are not useful
for healing - avoid this listing completely if "baseDir"
returns empty simple use the "prefix" as is.
this improves startup performance significantly
For large objects taking more than '3 minutes' response
times in a single PUT operation can timeout prematurely
as 'ResponseHeader' timeout hits for 3 minutes. Avoid
this by keeping the connection active during CreateFile
phase.
some SDKs might incorrectly send duplicate
entries for keys such as "conditions", Go
stdlib unmarshal for JSON does not support
duplicate keys - instead skips the first
duplicate and only preserves the last entry.
This can lead to issues where a policy JSON
while being valid might not properly apply
the required conditions, allowing situations
where POST policy JSON would end up allowing
uploads to unauthorized buckets and paths.
This PR fixes this properly.
This commit adds a `MarshalText` implementation
to the `crypto.Context` type.
The `MarshalText` implementation replaces the
`WriteTo` and `AppendTo` implementation.
It is slightly slower than the `AppendTo` implementation
```
goos: darwin
goarch: arm64
pkg: github.com/minio/minio/cmd/crypto
BenchmarkContext_AppendTo/0-elems-8 381475698 2.892 ns/op 0 B/op 0 allocs/op
BenchmarkContext_AppendTo/1-elems-8 17945088 67.54 ns/op 0 B/op 0 allocs/op
BenchmarkContext_AppendTo/3-elems-8 5431770 221.2 ns/op 72 B/op 2 allocs/op
BenchmarkContext_AppendTo/4-elems-8 3430684 346.7 ns/op 88 B/op 2 allocs/op
```
vs.
```
BenchmarkContext/0-elems-8 135819834 8.658 ns/op 2 B/op 1 allocs/op
BenchmarkContext/1-elems-8 13326243 89.20 ns/op 128 B/op 1 allocs/op
BenchmarkContext/3-elems-8 4935301 243.1 ns/op 200 B/op 3 allocs/op
BenchmarkContext/4-elems-8 2792142 428.2 ns/op 504 B/op 4 allocs/op
goos: darwin
```
However, the `AppendTo` benchmark used a pre-allocated buffer. While
this improves its performance it does not match the actual usage of
`crypto.Context` which is passed to a `KMS` and always encoded into
a newly allocated buffer.
Therefore, this change seems acceptable since it should not impact the
actual performance but reduces the overall code for Context marshaling.
When an object is removed, its parent directory is inspected to check if
it is empty to remove if that is the case.
However, we can use os.Remove() directly since it is only able to remove
a file or an empty directory.
RenameData renames xl.meta and data dir and removes the parent directory
if empty, however, there is a duplicate check for empty dir, since the
parent dir of xl.meta is always the same as the data-dir.
on freshReads if drive returns errInvalidArgument, we
should simply turn-off DirectIO and read normally, there
are situations in k8s like environments where the drives
behave sporadically in a single deployment and may not
have been implemented properly to handle O_DIRECT for
reads.
This PR adds deadlines per Write() calls, such
that slow drives are timed-out appropriately and
the overall responsiveness for Writes() is always
up to a predefined threshold providing applications
sustained latency even if one of the drives is slow
to respond.
MRF was starting to heal when it receives a disk connection event, which
is not good when a node having multiple disks reconnects to the cluster.
Besides, MRF needs Remove healing option to remove stale files.