fixes#18724
A regression was introduced in #18547, that attempted
to file adding a missing `null` marker however we
should not skip returning based on versionID instead
it must be based on if we are being asked to create
a DEL marker or not.
The PR also has a side-affect for replicating `null`
marker permanent delete, as it may end up adding a
`null` marker while removing one.
This PR should address both scenarios.
NOTE: This feature is not retro-active; it will not cater to previous transactions
on existing setups.
To enable this feature, please set ` _MINIO_DRIVE_QUORUM=on` environment
variable as part of systemd service or k8s configmap.
Once this has been enabled, you need to also set `list_quorum`.
```
~ mc admin config set alias/ api list_quorum=auto`
```
A new debugging tool is available to check for any missing counters.
Following policies if present
```
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"54.240.143.0/24",
"2001:DB8:1234:5678::/64"
]
}
}
```
And client is making a request to MinIO via IPv6 can
potentially crash the server.
Workarounds are turn-off IPv6 and use only IPv4
This PR also increases per node bpool memory from 1024 entries
to 2048 entries; along with that, it also moves the byte pool
centrally instead of being per pool.
minio_node_tier_ttlb_seconds - Distribution of time to last byte for streaming objects from warm tier
minio_node_tier_requests_success - Number of requests to download object from warm tier that were successful
minio_node_tier_requests_failure - Number of requests to download object from warm tier that failed
SUBNET now has a v2 of license that is returned in the new key
`license_v2`. mc will start reading and storing the same. (The old key
`license` is deprecated but is still available in SUBNET response to
ensure that the current released version of minio doesn't break)
`(*xlStorageDiskIDCheck).CreateFile` wraps the incoming reader in `xioutil.NewDeadlineReader`.
The wrapped reader is handed to `(*xlStorage).CreateFile`. This performs a Read call via `writeAllDirect`,
which reads into an `ODirectPool` buffer.
`(*DeadlineReader).Read` spawns an async read into the buffer. If a timeout is hit while reading,
the read operation returns to `writeAllDirect`. The operation returns an error and the buffer is reused.
However, if the async `Read` call unblocks, it will write to the now recycled buffer.
Fix: Remove the `DeadlineReader` - it is inherently unsafe. Instead, rely on the network timeouts.
This is not a disk timeout, anyway.
Regression in https://github.com/minio/minio/pull/17745
This patch adds the targetID to the existing notification target metrics
and deprecates the current target metrics which points to the overall
event notification subsystem
historically, we have always kept storage-rest-server
and a local storage API separate without much trouble,
since they both can independently operate due to no
special state() between them.
however, over some time, we have added state()
such as
- drive monitoring threads now there will be "2" of
them per drive instead of just 1.
- concurrent tokens available per drive are now twice
instead of just single shared, allowing unexpectedly
high amount of I/O to go through.
- applying serialization by using walkMutexes can now
be adequately honored for both remote callers and local
callers.
Regression from #18285. CopyObject options were inheriting source MTime
for metadata timestamps if unspecified, removing this prevented metadata
updates from being applied on target.
By default the cpu load is the cumulative of all cores. Capture the
percentage load (load * 100 / cpu-count)
Also capture the percentage memory used (used * 100 / total)
use memory for async events when necessary and dequeue them as
needed, for all synchronous events customers must enable
```
MINIO_API_SYNC_EVENTS=on
```
Async events can be lost but is upto to the admin to
decide what they want, we will not create run-away number
of goroutines per event instead we will queue them properly.
Currently the max async workers is set to runtime.GOMAXPROCS(0)
which is more than sufficient in general, but it can be made
configurable in future but may not be needed.
there is potential for danglingWrites when quorum failed, where
only some drives took a successful write, generally this is left
to the healing routine to pick it up. However it is better that
we delete it right away to avoid potential for quorum issues on
version signature when there are many versions of an object.
it is okay if the warm-tier cannot keep up, we should continue
to take I/O at hot-tier, only fail hot-tier or block it when
we are disk full.
Bonus: add metrics counter for these missed tasks, we will
know for sure if one of the node is lagging behind or is
losing too many tasks during transitioning.
A disk that is not able to initialize when an instance is started
will never have a handler registered, which means a user will
need to restart the node after fixing the disk;
This will also prevent showing the wrong 'upgrade is needed.'
error message in that case.
When the disk is still failing, print an error every 30 minutes;
Disk reconnection will be retried every 30 seconds.
Co-authored-by: Anis Elleuch <anis@min.io>
`OpMuxConnectError` was not handled correctly.
Remove local checks for single request handlers so they can
run before being registered locally.
Bonus: Only log IAM bootstrap on startup.
While healing the latest changes of expiry rules across sites
if target had pre existing transition rules, they were getting
overwritten as cloned latest expiry rules from remote site were
getting written as is. Fixed the same and added test cases as
well.
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
moveToTrash() function moves a folder to .trash, for example, when
doing some object deletions: a data dir that has many parts will be
renamed to the trash folder; However, ENOSPC is a valid error from
rename(), and it can cripple a user trying to free some space in an
entire disk situation.
Therefore, this commit will try to do a recursive delete in that case.
This allows batch replication to basically do not
attempt to copy objects that do not have read quorum.
This PR also allows walk() to provide custom
values for quorum under batch replication, and
key rotation.
this PR allows following policy
```
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Deny a presigned URL request if the signature is more than 10 min old",
"Effect": "Deny",
"Action": "s3:*",
"Resource": "arn:aws:s3:::DOC-EXAMPLE-BUCKET1/*",
"Condition": {
"NumericGreaterThan": {
"s3:signatureAge": 600000
}
}
}
]
}
```
This is to basically disable all pre-signed URLs that are older than 10 minutes.
AWS S3 closes keep-alive connections frequently
leading to frivolous logs filling up the MinIO
logs when the transition tier is an AWS S3 bucket.
Ignore such transient errors, let MinIO retry
it when it can.
When minio runs with MINIO_CI_CD=on, it is expected to communicate
with the locally running SUBNET. This is happening in the case of MinIO
via call home functionality. However, the subnet-related functionality inside the
console continues to talk to the SUBNET production URL. Because of this,
the console cannot be tested with a locally running SUBNET.
Set the env variable CONSOLE_SUBNET_URL correctly in such cases.
(The console already has code to use the value of this variable
as the subnet URL)
Optionally allows customers to enable
- Enable an external cache to catch GET/HEAD responses
- Enable skipping disks that are slow to respond in GET/HEAD
when we have already achieved a quorum
Bonus: allow replication to attempt Deletes/Puts when
the remote returns quorum errors of some kind, this is
to ensure that MinIO can rewrite the namespace with the
latest version that exists on the source.
This PR adds a WebSocket grid feature that allows servers to communicate via
a single two-way connection.
There are two request types:
* Single requests, which are `[]byte => ([]byte, error)`. This is for efficient small
roundtrips with small payloads.
* Streaming requests which are `[]byte, chan []byte => chan []byte (and error)`,
which allows for different combinations of full two-way streams with an initial payload.
Only a single stream is created between two machines - and there is, as such, no
server/client relation since both sides can initiate and handle requests. Which server
initiates the request is decided deterministically on the server names.
Requests are made through a mux client and server, which handles message
passing, congestion, cancelation, timeouts, etc.
If a connection is lost, all requests are canceled, and the calling server will try
to reconnect. Registered handlers can operate directly on byte
slices or use a higher-level generics abstraction.
There is no versioning of handlers/clients, and incompatible changes should
be handled by adding new handlers.
The request path can be changed to a new one for any protocol changes.
First, all servers create a "Manager." The manager must know its address
as well as all remote addresses. This will manage all connections.
To get a connection to any remote, ask the manager to provide it given
the remote address using.
```
func (m *Manager) Connection(host string) *Connection
```
All serverside handlers must also be registered on the manager. This will
make sure that all incoming requests are served. The number of in-flight
requests and responses must also be given for streaming requests.
The "Connection" returned manages the mux-clients. Requests issued
to the connection will be sent to the remote.
* `func (c *Connection) Request(ctx context.Context, h HandlerID, req []byte) ([]byte, error)`
performs a single request and returns the result. Any deadline provided on the request is
forwarded to the server, and canceling the context will make the function return at once.
* `func (c *Connection) NewStream(ctx context.Context, h HandlerID, payload []byte) (st *Stream, err error)`
will initiate a remote call and send the initial payload.
```Go
// A Stream is a two-way stream.
// All responses *must* be read by the caller.
// If the call is canceled through the context,
//The appropriate error will be returned.
type Stream struct {
// Responses from the remote server.
// Channel will be closed after an error or when the remote closes.
// All responses *must* be read by the caller until either an error is returned or the channel is closed.
// Canceling the context will cause the context cancellation error to be returned.
Responses <-chan Response
// Requests sent to the server.
// If the handler is defined with 0 incoming capacity this will be nil.
// Channel *must* be closed to signal the end of the stream.
// If the request context is canceled, the stream will no longer process requests.
Requests chan<- []byte
}
type Response struct {
Msg []byte
Err error
}
```
There are generic versions of the server/client handlers that allow the use of type
safe implementations for data types that support msgpack marshal/unmarshal.
With an odd number of drives per erasure set setup, the write/quorum is
the half + 1; however the decommissioning listing will still list those
objects and does not consider those as stale.
Fix it by using (N+1)/2 formula.
Co-authored-by: Anis Elleuch <anis@min.io>
Immediate transition use case and is mostly used to fill warm
backend with a lot of data when a new deployment is created
Currently, if the transition queue is complete, the transition will be
deferred to the scanner; change this behavior by blocking the PUT request
until the transition queue has a new place for a transition task.
Currently if the object does not exist in quorum disks of an erasure
set, the dangling code is never called because the returned error will
be errFileNotFound or errFileVersionNotFound;
With this commit, when errFileNotFound or errFileVersionNotFound is
returning when trying to calculate the quorum of a given object, the
code checks if a disk returned nil, which means a stale object exists in
that disk, that will trigger deleteIfDangling() function
This commit splits the liveness and readiness
handler into two separate handlers. In K8S, a
liveness probe is used to determine whether the
pod is in "live" state and functioning at all.
In contrast, the readiness probe is used to
determine whether the pod is ready to serve
requests.
A failing liveness probe causes pod restarts while
a failing readiness probe causes k8s to stop routing
traffic to the pod. Hence, a liveness probe should
be as robust as possible while a readiness probe
should be used to load balancing.
Ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
Signed-off-by: Andreas Auernhammer <github@aead.dev>
This patch takes care of loading the bucket configs of failed buckets
during the periodic refresh. This makes sure the event notifiers and
remote bucket targets are properly initialized.
users might use MinIO on NFS, GPFS that provide dynamic
inodes and may not even have a concept of free inodes.
to allow users to use MinIO on top of GPFS relax the
free inode check.
* creating a byte buffer for SFTP file segments
* Adding an error condition for when there are
remaining segments in the queue
* Simplification of the queue using a map
it is possible that ILM or Deletes got triggered on batch
of objects that we are attempting to batch replicate, ignore
this scenario as valid behavior.
sendfile implementation to perform DMA on all platforms
Go stdlib already supports sendfile/splice implementations
for
- Linux
- Windows
- *BSD
- Solaris
Along with this change however O_DIRECT for reads() must be
removed as well since we need to use sendfile() implementation
The main reason to add O_DIRECT for reads was to reduce the
chances of page-cache causing OOMs for MinIO, however it would
seem that avoiding buffer copies from user-space to kernel space
this issue is not a problem anymore.
There is no Go based memory allocation required, and neither
the page-cache is referenced back to MinIO. This page-
cache reference is fully owned by kernel at this point, this
essentially should solve the problem of page-cache build up.
With this now we also support SG - when NIC supports Scatter/Gather
https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing)
`monitorAndConnectEndpoints` will continue to attempt to reconnect offline disks.
Since disks were never closed, a `MarkOffline` would continue to try to check these disks forever.
Close previous disks.
replace io.Discard usage to fix NUMA copy() latencies
On NUMA systems copying from 8K buffer allocated via
io.Discard leads to large latency build-up for every
```
copy(new8kbuf, largebuf)
```
can in-cur upto 1ms worth of latencies on NUMA systems
due to memory sharding across NUMA nodes.
Fix various regressions from #18029
* If context is canceled the token is never returned. This will lead to scanner being unable to save and deadlocking.
* Fix backup not being able to get any data (hr empty)
* Reduce backup timeout.
Tiering statistics have been broken for some time now, a regression
was introduced in 6f2406b0b6
Bonus fixes an issue where the objects are not assumed to be
of the 'STANDARD' storage-class for the objects that have
not yet tiered, this should be conditional based on the object's
metadata not a default assumption.
This PR also does some cleanup in terms of implementation,
fixes#18070
https://github.com/minio/minio/pull/18307 partially removed the duplicate upload id check.
While I can't really see how ListDir can return duplicate entries, let's re-add it, since it is a cheap sanity check.
There can be rare situations where errors seen in bucket metadata
load on startup or subsequent metadata updates can result in missing
replication remotes.
Attempt a refresh of remote targets backed by a good replication config
lazily in 5 minute intervals if there ever occurs a situation where
remote targets go AWOL.
Since relaxing quorum the error across pools
for ListBuckets(), GetBucketInfo() we hit a
situation where loading IAM could potentially
return an error for second pool that server
is not initialized.
We need to handle this, let the pool come online
and retry transparently - this PR fixes that.
x-amz-signed-headers is meant for HTTP headers only
not for query params, using that to verify things
further can lead to failure.
The generated presigned URL with custom metadata
is already kosher (tamper proof).
fixes#18281
`resourceMetricsMap` has no protection against concurrent reads and writes.
Add a mutex and don't use maps from the last iteration.
Bug introduced in #18057Fixes#18271
globalDeploymentID was being read while it was being set.
Fixes race:
```
WARNING: DATA RACE
Write at 0x0000079605a0 by main goroutine:
github.com/minio/minio/cmd.connectLoadInitFormats()
github.com/minio/minio/cmd/prepare-storage.go:269 +0x14f0
github.com/minio/minio/cmd.waitForFormatErasure()
github.com/minio/minio/cmd/prepare-storage.go:294 +0x21d
...
Previous read at 0x0000079605a0 by goroutine 105:
github.com/minio/minio/cmd.newContext()
github.com/minio/minio/cmd/utils.go:817 +0x31e
github.com/minio/minio/cmd.adminMiddleware.func1()
github.com/minio/minio/cmd/admin-router.go:110 +0x96
net/http.HandlerFunc.ServeHTTP()
net/http/server.go:2136 +0x47
github.com/minio/minio/cmd.setBucketForwardingMiddleware.func1()
github.com/minio/minio/cmd/generic-handlers.go:460 +0xb1a
net/http.HandlerFunc.ServeHTTP()
net/http/server.go:2136 +0x47
...
```
currently the default for all drives is 512, which is a lot
for HDDs the recent testing has revealed moving this to 32
for HDDs seems like a fair value.
Introducing a new version of healthinfo struct for adding this info is
not correct. It needs to be implemented differently without adding a new
version.
This reverts commit 8737025d940f80360ed4b3686b332db5156f6659.
There is a fundamental race condition in `newErasureServerPools`, where setObjectLayer is
called before the poolMeta has been loaded/populated.
We add a placeholder value to this field but disable all saving of the value, so we don't risk
overwriting the value on disk. Once the value has been loaded or created, it is replaced with
the proper value, which will also be saved.
Also fixes various accesses of `poolMeta` that were done without locks.
We make the `poolMeta.IsSuspended` return false, even if we shouldn't risk out-of-bounds
reads anymore.
if erasure upgrade is needed rely on the in-memory
values, instead of performing a "DiskInfo()" call.
https://brendangregg.com/blog/2016-09-03/sudden-disk-busy.html
for HDDs these are problematic, lets avoid this because
there is no value in "being" absolutely strict here
in terms of parity. We are okay to increase parity
as we see based on the in-memory online/offline ratio.
Several callers to putObjectTar may be fighting to set sc. Move the write out of the loop.
Use static resp, and request elements.
Fixes tests with -race:
```
WARNING: DATA RACE
Read at 0x00c01cd680e0 by goroutine 691354:
github.com/minio/minio/cmd.objectAPIHandlers.PutObjectExtractHandler.func1()
e:/gopath/src/github.com/minio/minio/cmd/object-handlers.go:2130 +0x149
github.com/minio/minio/cmd.untar.func1()
e:/gopath/src/github.com/minio/minio/cmd/untar.go:250 +0x2b6
github.com/minio/minio/cmd.untar.func8()
e:/gopath/src/github.com/minio/minio/cmd/untar.go:261 +0xa4
Previous write at 0x00c01cd680e0 by goroutine 691352:
github.com/minio/minio/cmd.objectAPIHandlers.PutObjectExtractHandler.func1()
e:/gopath/src/github.com/minio/minio/cmd/object-handlers.go:2131 +0x15d
github.com/minio/minio/cmd.untar.func1()
e:/gopath/src/github.com/minio/minio/cmd/untar.go:250 +0x2b6
github.com/minio/minio/cmd.untar.func8()
e:/gopath/src/github.com/minio/minio/cmd/untar.go:261 +0xa4
```
Calling unfreezeServices twice results in panic:
```
panic: "POST /minio/peer/v32/signalservice?signal=4&sub-sys=": close of nil channel
goroutine 14703 [running]:
runtime/debug.Stack()
runtime/debug/stack.go:24 +0x65
github.com/minio/minio/cmd.setCriticalErrorHandler.func1.1()
github.com/minio/minio/cmd/generic-handlers.go:549 +0x8e
panic({0x27c3020, 0x4c9b370})
runtime/panic.go:884 +0x212
github.com/minio/minio/cmd.unfreezeServices()
github.com/minio/minio/cmd/service.go:112 +0xc7
github.com/minio/minio/cmd.(*peerRESTServer).SignalServiceHandler(0x0?, {0x4cb6af0, 0xc010b96420}, 0xc01affab00)
github.com/minio/minio/cmd/peer-rest-server.go:837 +0x13a
net/http.HandlerFunc.ServeHTTP(...)
```
If the function was called a second time `val` would not be nil, but the returned channel `ch` would be, causing the panic.
Check the channel isn't nil and also use Swap for an atomic swap instead of 2 separate operations (though we are in a mutex).
Disk level O_DIRECT support checking at xl storage initialization was
conditional on a config setting being enabled. (This never took effect
because config initialization happens after ObjectLayer is ready.) This
is not necessary as the config setting is dynamic - O_DIRECT should be
enabled via runtime config. So we need to do the disk level support
check regardless of the config setting.
- Trace needs higher buffered channels than 4000 to ensure
when we run `mc admin trace -a` it captures all information
sufficiently.
- Listen event notification needs the event channel to be
`apiRequestsMaxPerNode` * number of nodes
Currently, the retry is not fully used when there is no backup copy of
the data usage; use 5 retry attempts when we don't have any valid data,
new or backup, unless we have seen an un-recognized error.
comment in the code provides more detailed explanation
on what this PR entails and its assumptions.
this PR reduces the amount of listing() by an order
of magnitude, however there are other such calls that
still needs further optimization that shall be done
in subsequent PRs.
Add a new endpoint for "resource" metrics `/v2/metrics/resource`
This should return system metrics related to drives, network, CPU and
memory. Except for drives, other metrics should have corresponding "avg"
and "max" values also.
Reuse the real-time feature to capture the required data,
introducing CPU and memory metrics in it.
Collect the data every minute and keep updating the average and max values
accordingly, returning the latest values when the API is called.
without this the rename2() can rename the previous dataDir
causing issues for different versions of the object, only
latest version is preserved due to this bug.
Added healing code to ensure recovery of such content.
not checking w.Close() can prematurely make us
think that the w.Write() actually succeeded, apparently
Write() may or may not return an error but sometimes
only during a Close() call to the fd we may see the
error from Write() propagate.
Fdatasync(w) on the FD would return an error requiring
Close() error handling is less of a concern, however it may
happen such that fdatasync() did not return an error, where
as Close() would.
Currently, setting a new tiering target returns an error when a bucket
is versioned and the tiering credentials does not have authorization to
specify a version-id when reading or removing a specific version;
Since tiering does not require versioning anymore; avoid doing versioned
operations when performing checklist ops while adding a new tiering
configuration.
Do not error out when a provided marker is before or after the prefix, but instead just ignore it if before and return an empty list when after.
Fixes#18093