Do listings with prefix filter when bloom filter is dirty.
This will forward the prefix filter to the lister which will make it
only scan the folders/objects with the specified prefix.
If we have a clean bloom filter we try to build a more generally
useful cache so in that case, we will list all objects/folders.
Similar to #10775 for fewer memory allocations, since we use
getOnlineDisks() extensively for listing we should optimize it
further.
Additionally, remove all unused walkers from the storage layer
WriteAll saw 127GB allocs in a 5 minute timeframe for 4MiB buffers
used by `io.CopyBuffer` even if they are pooled.
Since all writers appear to write byte buffers, just send those
instead and write directly. The files are opened through the `os`
package so they have no special properties anyway.
This removes the alloc and copy for each operation.
REST sends content length so a precise alloc can be made.
this reduces allocations in order of magnitude
Also, revert "erasure: delete dangling objects automatically (#10765)"
affects list caching should be investigated.
Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c
Gist of improvements:
* Cross-server caching and listing will use the same data across servers and requests.
* Lists can be arbitrarily resumed at a constant speed.
* Metadata for all files scanned is stored for streaming retrieval.
* The existing bloom filters controlled by the crawler is used for validating caches.
* Concurrent requests for the same data (or parts of it) will not spawn additional walkers.
* Listing a subdirectory of an existing recursive cache will use the cache.
* All listing operations are fully streamable so the number of objects in a bucket no
longer dictates the amount of memory.
* Listings can be handled by any server within the cluster.
* Caches are cleaned up when out of date or superseded by a more recent one.
add a hint on the disk to allow for tracking fresh disk
being healed, to allow for restartable heals, and also
use this as a way to track and remove disks.
There are more pending changes where we should move
all the disk formatting logic to backend drives, this
PR doesn't deal with this refactor instead makes it
easier to track healing in the future.
It was observed in VMware vsphere environment during a
pod replacement, `mc admin info` might report incorrect
offline nodes for the replaced drive. This issue eventually
goes away but requires quite a lot of time for all servers
to be in sync.
This PR fixes this behavior properly.
Add context to all (non-trivial) calls to the storage layer.
Contexts are propagated through the REST client.
- `context.TODO()` is left in place for the places where it needs to be added to the caller.
- `endWalkCh` could probably be removed from the walkers, but no changes so far.
The "dangerous" part is that now a caller disconnecting *will* propagate down, so a
"delete" operation will now be interrupted. In some cases we might want to disconnect
this functionality so the operation completes if it has started, leaving the system in a cleaner state.
When crawling never use a disk we know is healing.
Most of the change involves keeping track of the original endpoint on xlStorage
and this also fixes DiskInfo.Endpoint never being populated.
Heal master will print `data-crawl: Disk "http://localhost:9001/data/mindev/data2/xl1" is
Healing, skipping` once on a cycle (no more often than every 5m).
this is to detect situations of corruption disk
format etc errors quickly and keep the disk online
in such scenarios for requests to fail appropriately.
- admin info node offline check is now quicker
- admin info now doesn't duplicate the code
across doing the same checks for disks
- rely on StorageInfo to return appropriate errors
instead of calling locally.
- diskID checks now return proper errors when
disk not found v/s format.json missing.
- add more disk states for more clarity on the
underlying disk errors.
- Add changes to ensure remote disks are not
incorrectly taken online if their order has
changed or are incorrect disks.
- Bring changes to peer to detect disconnection
with separate Health handler, to avoid a
rather expensive call GetLocakDiskIDs()
- Follow up on the same changes for Lockers
as well
- Implement a new xl.json 2.0.0 format to support,
this moves the entire marshaling logic to POSIX
layer, top layer always consumes a common FileInfo
construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
for object placements.
Fixes#2111
GetDiskID() in storage rest client does not really issue a REST request
to the remote disk, but returns an in-memory value instead.
However, GetDiskID() should return an error when format.json is not
found or for other similar issues (unmounted disks, etc..)
GetDiskID() is only called when formatting disks and getting storage
informatio, hence this commit should not have a performance degradation.
Shuffling arguments that we pass to MinIO server are supported. However,
when that happens, Prometheus returns wrong information about disks usage
and online/offline status.
The commit fixes the issue by avoiding relying on xl.endpoints since
it is not ordered.
The `keepHTTPResponseAlive` would cause errors to be
returned with status OK.
- Add '32' as a filler byte until a response is ready
- '0' to indicate the response is ready to be consumed
- '1' to indicate response has an error which needs
to be returned to the caller
Clear out 'file not found' errors from dir walker, since it may be
in a folder that has been deleted since it was scanned.
- Add conservative timeouts upto 3 minutes
for internode communication
- Add aggressive timeouts of 30 seconds
for gateway communication
Fixes#9105Fixes#8732Fixes#8881Fixes#8376Fixes#9028
Bulk delete API was using cleanupObjectsBulk() which calls posix
listing and delete API to remove objects internal files in the
backend (xl.json and parts) one by one.
Add DeletePrefixes in the storage API to remove the content
of a directory in a single call.
Also use a remove goroutine for each disk to accelerate removal.
Upgrades between releases are failing due to strict
rule to avoid rolling upgrades, it is enough to
bump up APIs between versions to allow for quorum
failure and wait times. Authentication failures are
catastrophic in nature which leads to server not
be able to upgrade properly.
Fixes#9021Fixes#8968
multi-delete API failed with write quorum errors
under following situations
- list of files requested for delete doesn't exist
anymore can lead to quorum errors and failure
- due to usage of query param for paths, for really
long paths MinIO server rejects these requests as
malformed as unexpected.
This was reproduced with warp
JWT parsing is simplified by using a custom claim
data structure such as MapClaims{}, also writes
a custom Unmarshaller for faster unmarshalling.
- Avoid as much reflections as possible
- Provide the right types for functions as much
as possible
- Avoid strings.Join, strings.Split to reduce
allocations, rely on indexes directly.
instead perform a liveness check call to
verify if server is online and print relevant
errors.
Also introduce a StorageErr string error type
instead of errors.New() deprecate usage of
VerifyFileError, DeleteFileError for gob,
change in datastructure also requires bump in
storage REST version to v13.
Fixes#8811
Admin data usage info API returns the following
(Only FS & XL, for now)
- Number of buckets
- Number of objects
- The total size of objects
- Objects histogram
- Bucket sizes
This PR adds code to appropriately handle versioning issues
that come up quite constantly across our API changes. Currently
we were also routing our requests wrong which sort of made it
harder to write a consistent error handling code to appropriately
reject or honor requests.
This PR potentially fixes issues
- old mc is used against new minio release which is incompatible
returns an appropriate for client action.
- any older servers talking to each other, report appropriate error
- incompatible peer servers should report error and reject the calls
with appropriate error
This change is related to larger config migration PR
change, this is a first stage change to move our
configs to `cmd/config/` - divided into its subsystems
- Heal if the part.1 is truncated from its original size
- Heal if the part.1 fails while being verified in between
- Heal if the part.1 fails while being at a certain offset
Other cleanups include make sure to flush the HTTP responses
properly from storage-rest-server, avoid using 'defer' to
improve call latency. 'defer' incurs latency avoid them
in our hot-paths such as storage-rest handlers.
Fixes#8319
errors.errorString() cannot be marshalled by gob
encoder, so using a slice of []error would fail
to be encoded. This leads to no errors being
generated instead gob.Decoder on the storage-client
would see an io.EOF
To avoid such bugs introduce a typed error for
handling such translations and register this type
for gob encoding support.
VerifyFile in the distributed setup does not work with
the non streaming highway hash. The reason is that the
internode mux router did not expect `storageRESTBitrotHash`
parameter.
posix.VerifyFile() doesn't know how to check if a file
is corrupted if that file is empty. We do have the part
size in xl.json so we pass it to VerifyFile to return
an error so healing empty parts can work properly.
When MinIO is behind a proxy, proxies end up killing
clients when no data is seen on the connection, adding
the right content-type ensures that proxies do not come
in the way.
This allows for canonicalization of the strings
throughout our code and provides a common space
for all these constants to reside.
This list is rather non-exhaustive but captures
all the headers used in AWS S3 API operations
With these changes we are now able to peak performances
for all Write() operations across disks HDD and NVMe.
Also adds readahead for disk reads, which also increases
performance for reads by 3x.
This is necessary to avoid connection build up between servers
unexpectedly for example in a situation where 16 servers are
talking to each other and one server now allows a maximum of
15*4096 = 61440 idle connections
Will be kept in pool. Such a large pool is perhaps inefficient for
many reasons and also affects overall system resources.
This PR also reduces idleConnection timeout from 120 secs to 60 secs.
Bulk delete at storage level in Multiple Delete Objects API
In order to accelerate bulk delete in Multiple Delete objects API,
a new bulk delete is introduced in storage layer, which will accept
a list of objects to delete rather than only one. Consequently,
a new API is also need to be added to Object API.
Other listing optimizations include
- remove double sorting while filtering object entries
- improve error message when upload-id is not in quorum
- use jsoniter for full unmarshal json, instead of gjson
- remove unused code
In distributed mode, use REST API to acquire and manage locks instead
of RPC.
RPC has been completely removed from MinIO source.
Since we are moving from RPC to REST, we cannot use rolling upgrades as the
nodes that have not yet been upgraded cannot talk to the ones that have
been upgraded.
We expect all minio processes on all nodes to be stopped and then the
upgrade process to be completed.
Also force http1.1 for inter-node communication
This commit fixes another privilege escalation issue
abusing the inter-node communication of distributed
servers to obtain/modify the server configuration.
The inter-node communication is authenticated using
JWT-Tokens. Further, IAM users accessing the cluster
via the web UI also get a JWT token and the browser
will add this "user" JWT token to each the request.
Now, a user can extract that JWT token an can craft
HTTP POST requests for the inter-node communication
API endpoint. Since the server accepts ANY valid
JWT token it also accepts inter-node commands from
an authenticated user such that the user can execute
arbitrary commands bypassing the IAM policy engine
and impersonate other users, change its own IAM policy
or extract the admin access/secret key.
This is fixed by only accepting "admin" JWT tokens
(tokens containing the admin access key - and therefore
were generated with the admin secret key). Consequently,
only the admin user can execute such inter-node commands.
registering notFound handler more than once causes
gorilla mux to return error for all registered paths
greater than > 8. This might be a bug in the gorilla/mux
but we shouldn't be using it this way. NotFound handler
should be only registered once per root router.
Fixes#6915
Rolling update doesn't work properly because Storage REST API has
a new API WriteAll() but without API version number increase.
Also be sure to return 404 for unknown http paths.
xl.json is the source of truth for all erasure
coded objects, without which we won't be able to
read the objects properly. This PR enables sync
mode for writing `xl.json` such all writes go hit
the disk and are persistent under situations such
as abrupt power failures on servers running Minio.