minio

Commit Graph

Author	SHA1	Message	Date
Klaus Post	e72429c79c	Add sizes to traces (#19851 ) added to storage and grid traces. Can provide more context for traces that aren't HTTP. Others may apply.	2024-05-31 22:17:37 -07:00
Harshavardhana	d1c58fc2eb	remove older deploymentID fix behavior to speed up startup (#19497 ) since mid 2018 we do not have any deployments without deployment-id, it is time to put this code to rest, this PR removes this old code as its no longer valuable. on setups with 1000's of drives these are all quite expensive operations.	2024-04-15 01:25:46 -07:00
Harshavardhana	a207bd6790	turn-off Nlink readdir() optimization for NFS/CIFS (#19420 ) fixes #19418 fixes #19416	2024-04-05 08:17:08 -07:00
Anis Eleuch	95bf4a57b6	logging: Add subsystem to log API (#19002 ) Create new code paths for multiple subsystems in the code. This will make maintaing this easier later. Also introduce bugLogIf() for errors that should not happen in the first place.	2024-04-04 05:04:40 -07:00
Harshavardhana	35deb1a8e2	do not block on send channels under high load (#19090 ) all send channels must compete with `ctx` if not they will perpetually stay alive.	2024-02-20 15:00:35 -08:00
Harshavardhana	80ca120088	remove checkBucketExist check entirely to avoid fan-out calls (#18917 ) Each Put, List, Multipart operations heavily rely on making GetBucketInfo() call to verify if bucket exists or not on a regular basis. This has a large performance cost when there are tons of servers involved. We did optimize this part by vectorizing the bucket calls, however its not enough, beyond 100 nodes and this becomes fairly visible in terms of performance.	2024-01-30 12:43:25 -08:00
Harshavardhana	1d3bd02089	avoid close 'nil' panics if any (#18890 ) brings a generic implementation that prints a stack trace for 'nil' channel closes(), if not safely closes it.	2024-01-28 10:04:17 -08:00
Harshavardhana	6347fb6636	add missing proper error return in WalkDir() (#18884 ) without this the caller might end up returning incorrect errors and not ignoring the drive properly.	2024-01-27 16:13:41 -08:00
Klaus Post	961b0b524e	Do not require restart when a disk is unreachable during node boot (#18576 ) A disk that is not able to initialize when an instance is started will never have a handler registered, which means a user will need to restart the node after fixing the disk; This will also prevent showing the wrong 'upgrade is needed.' error message in that case. When the disk is still failing, print an error every 30 minutes; Disk reconnection will be retried every 30 seconds. Co-authored-by: Anis Elleuch <anis@min.io>	2023-12-01 12:01:14 -08:00
Klaus Post	51aa59a737	perf: websocket grid connectivity for all internode communication (#18461 ) This PR adds a WebSocket grid feature that allows servers to communicate via a single two-way connection. There are two request types: * Single requests, which are `[]byte => ([]byte, error)`. This is for efficient small roundtrips with small payloads. * Streaming requests which are `[]byte, chan []byte => chan []byte (and error)`, which allows for different combinations of full two-way streams with an initial payload. Only a single stream is created between two machines - and there is, as such, no server/client relation since both sides can initiate and handle requests. Which server initiates the request is decided deterministically on the server names. Requests are made through a mux client and server, which handles message passing, congestion, cancelation, timeouts, etc. If a connection is lost, all requests are canceled, and the calling server will try to reconnect. Registered handlers can operate directly on byte slices or use a higher-level generics abstraction. There is no versioning of handlers/clients, and incompatible changes should be handled by adding new handlers. The request path can be changed to a new one for any protocol changes. First, all servers create a "Manager." The manager must know its address as well as all remote addresses. This will manage all connections. To get a connection to any remote, ask the manager to provide it given the remote address using. ``` func (m Manager) Connection(host string) Connection ``` All serverside handlers must also be registered on the manager. This will make sure that all incoming requests are served. The number of in-flight requests and responses must also be given for streaming requests. The "Connection" returned manages the mux-clients. Requests issued to the connection will be sent to the remote. * `func (c Connection) Request(ctx context.Context, h HandlerID, req []byte) ([]byte, error)` performs a single request and returns the result. Any deadline provided on the request is forwarded to the server, and canceling the context will make the function return at once. `func (c Connection) NewStream(ctx context.Context, h HandlerID, payload []byte) (st Stream, err error)` will initiate a remote call and send the initial payload. ```Go // A Stream is a two-way stream. // All responses must be read by the caller. // If the call is canceled through the context, //The appropriate error will be returned. type Stream struct { // Responses from the remote server. // Channel will be closed after an error or when the remote closes. // All responses must be read by the caller until either an error is returned or the channel is closed. // Canceling the context will cause the context cancellation error to be returned. Responses <-chan Response // Requests sent to the server. // If the handler is defined with 0 incoming capacity this will be nil. // Channel must be closed to signal the end of the stream. // If the request context is canceled, the stream will no longer process requests. Requests chan<- []byte } type Response struct { Msg []byte Err error } ``` There are generic versions of the server/client handlers that allow the use of type safe implementations for data types that support msgpack marshal/unmarshal.	2023-11-20 17:09:35 -08:00
Maxim Tkachenko	ec30bb89a4	simplify channel send() in WalkDir() (#18186 )	2023-10-09 17:27:55 -07:00
Harshavardhana	b1c1f02132	use buffers for pathJoin, to re-use buffers. (#17960 ) ``` benchmark old ns/op new ns/op delta BenchmarkPathJoin/PathJoin-8 79.6 55.3 -30.53% benchmark old allocs new allocs delta BenchmarkPathJoin/PathJoin-8 2 1 -50.00% benchmark old bytes new bytes delta BenchmarkPathJoin/PathJoin-8 48 24 -50.00% ```	2023-08-31 17:58:48 -07:00
Harshavardhana	7cafdc0512	fix: skip access checks further for known buckets (#17934 )	2023-08-28 15:16:41 -07:00
Harshavardhana	0153f96a20	add deadlines for readMetadata() in listing (#17776 ) Bonus: also skip spending time looking for xl.json - Listing() - Delete()	2023-08-01 21:52:31 -07:00
Harshavardhana	14e1ace552	remove serializing WalkDir() across all buckets/prefixes on SSDs (#17707 ) slower drives get knocked off because they are too slow via active monitoring, we do not need to block calls arbitrarily. Serializing adds latencies for already slow calls, remove it for SSDs/NVMEs Also, add a selection with context when writing to `out <-` channel, to avoid any potential blocks.	2023-07-24 09:30:19 -07:00
Kaan Kabalak	21fbe88e1f	Print certain log messages once per error (#17484 )	2023-06-24 20:29:13 -07:00
Klaus Post	839b9c9271	Reduce allocations in Walkdir (#17036 )	2023-04-15 10:25:25 -07:00
Harshavardhana	e0f4dd6027	remove unncessary logs from WalkDir(), PutObject() (#16818 )	2023-03-15 11:52:23 -07:00
Harshavardhana	37134e42d4	ignore io.EOF, io.ErrUnexpectedEOF on xl.meta reads in WalkDir() (#16625 )	2023-02-15 07:12:48 -08:00
Klaus Post	f713436dd0	Fix truncated list response on deleted replicated objects (#16504 )	2023-01-30 09:13:53 -08:00
Harshavardhana	23b329b9df	remove gateway completely (#15929 )	2022-10-24 17:44:15 -07:00
Klaus Post	eee1ce305c	When listing, do not count delete markers (#15689 ) When limiting listing do not count delete, since they may be discarded. Extend limit, since we may be discarding the forward-to marker. Fix directories always being sent to resolve, since they didn't return as match.	2022-09-14 12:11:27 -07:00
Klaus Post	ff9a74b91f	Add fast max-keys=1 support for Listing (#15670 ) Add a listing option to stop when the limit is reached. This can be used by stateless listings for fast results.	2022-09-09 08:13:06 -07:00
Klaus Post	037fe4afdc	Add listing block reuse (#15579 ) When streaming results, pool metadata slices when sent.	2022-08-24 09:11:16 -07:00
Anis Elleuch	9201870f6c	Remove unnecessary code in WalkDir() (#15168 ) Recalculating forward is useless. It is never used and it will be computed again when calling scanDir() again.	2022-06-27 10:26:56 -07:00
Harshavardhana	5d23be6242	fix: ignore printing io.EOF during WalkDir() on concurrently modified objects (#15100 ) fix: ignore print io.EOF during WalkDir() on concurrently modified objects	2022-06-17 08:23:47 -07:00
Klaus Post	b890bbfa63	Add local disk health checks (#14447 ) The main goal of this PR is to solve the situation where disks stop responding to operations. This generally causes an FD build-up and eventually will crash the server. This adds detection of hung disks, where calls on disk get stuck. We add functionality to `xlStorageDiskIDCheck` where it keeps track of the number of concurrent requests on a given disk. A total number of 100 operations are allowed. If this limit is reached we will block (but not reject) new requests, but we will monitor the state of the disk. If no requests have been completed or updated within a 15-second window, we mark the disk as offline. Requests that are blocked will be unblocked and return an error as "faulty disk". New requests will be rejected until the disk is marked OK again. Once a disk has been marked faulty, a check will run every 5 seconds that will attempt to write and read back a file. As long as this fails the disk will remain faulty. To prevent lots of long-running requests to mark the disk faulty we implement a callback feature that allows updating the status as parts of these operations are running. We add a reader and writer wrapper that will update the status of each successful read/write operation. This should allow fine enough granularity that a slow, but still operational disk will not reach 15 seconds where 50 operations have not progressed. Note that errors themselves are not enough to mark a disk faulty. A nil (or io.EOF) error will mark a disk as "good". * Make concurrent disk setting configurable via `_MINIO_DISK_MAX_CONCURRENT`. * de-couple IsOnline() from disk health tracker The purpose of IsOnline() is to ensure that we reconnect the drive only when the "drive" was - disconnected from network we need to validate if the drive is "correct" and is the same drive which belongs to this server. - drive was replaced we have to format it - we support hot swapping of the drives. IsOnline() is not meant for taking the drive offline when it is hung, it is not useful we can let the drive be online instead "return" errors for relevant calls. * return errFaultyDisk for DiskInfo() call Co-authored-by: Harshavardhana <harsha@minio.io> Possible future Improvements: * Unify the REST server and local xlStorageDiskIDCheck. This would also improve stats significantly. * Allow reads/writes to be aborted by the context. * Add usage stats, concurrent count, blocked operations, etc.	2022-03-09 11:38:54 -08:00
Anis Elleuch	84c690cb07	storage: Use request.Form and avoid mux matching (#13858 ) request.Form uses less memory allocation and avoids gorilla mux matching with weird characters in parameters such as '\n' - Remove Queries() to avoid matching - Ensure r.ParseForm is called to populate fields - Add a unit test for object names with '\n'	2021-12-09 08:38:46 -08:00
Klaus Post	c897b6a82d	fix: missing entries on first list resume (#13627 ) On first list resume or when specifying a custom markers entries could be missed in rare cases. Do conservative truncation of entries when forwarding. Replaces #13619	2021-11-10 10:41:21 -08:00
Klaus Post	4f3317effe	Close stream on panic (#13605 ) Always close streamHTTPResponse on panic on main thread to avoid write/flush after response handler has returned.	2021-11-08 08:41:27 -08:00
Harshavardhana	5ed781a330	check for context canceled after competing for locks (#13239 ) once we have competed for locks, verify if the context is still valid - this is to ensure that we do not start readdir() or read() calls on the drives on canceled connections.	2021-09-17 14:11:01 -07:00
Harshavardhana	66fcd02aa2	de-couple walkMu and walkReadMu for some granularity (#13231 ) This commit brings two locks instead of single lock for WalkDir() calls on top of `c25816eabc`. The main reason is to avoid contention between readMetadata() and ListDir() calls, ListDir() can take time on prefixes that are huge for readdir() but this shouldn't end up blocking all readMetadata() operations, this allows for more room for I/O while not overly penalizing all listing operations.	2021-09-17 12:14:12 -07:00
Harshavardhana	0f01e7ef0f	fix: check for xl.meta as directory fallback (#13023 ) Objects uploaded in this format for example ``` mc cp /etc/hosts alias/bucket/foo/bar/xl.meta mc ls -r alias/bucket/foo/bar ``` Won't list the object, handle this scenario.	2021-08-21 00:12:29 -07:00
Klaus Post	c25816eabc	xl walk: Limit walk concurrent IO (#12885 ) We are observing heavy system loads, potentially locking the system up for periods when concurrent listing operations are performed. We place a per-disk lock on walk IO operations. This will minimize the impact of concurrent listing operations on the entire system and de-prioritize them compared to other operations. Single list operations should remain largely unaffected.	2021-08-18 18:10:36 -07:00
Harshavardhana	ee028a4693	listObjects optimized to handle max-keys=1 when prefix is object (#13000 ) Some applications albeit poorly written rather than using headObject rely on listObjects to check for existence of object, this unusual request always has prefix=(to actual object) and max-keys=1 handle this situation specially such that we can avoid readdir() on the top level parent to avoid sorting and skipping, ensuring that such type of listObjects() always behaves similar to a headObject() call.	2021-08-18 18:05:05 -07:00
Harshavardhana	9c65168312	fix: all levels deep flat key match (#12996 ) this addresses a regression from #12984 which only addresses flat key from single level deep at bucket level. added extra tests as well to cover all these scenarios.	2021-08-18 07:40:53 -07:00
Harshavardhana	654a6e9871	always set the filter to skip navigating baseDir (#12984 ) baseDir is empty if the top level prefix does not end with `/` this causes large recursive listings without any filtering, to fix this filtering make sure to set the filter prefix appropriately. also do not navigate folders at top level that do not match the filter prefix, entries don't need to match prefix since they are never prefixed with the prefix anyways.	2021-08-17 07:43:24 -07:00
Klaus Post	89febdb3d6	Reuse small buffers (#12948 ) When reading metadata allow reuse of buffers in certain cases. Take the low-hanging fruit. Reduce GC overhead when listing.	2021-08-12 14:27:22 -07:00
Harshavardhana	a2cd3c9a1d	use ParseForm() to allow query param lookups once (#12900 ) ``` cpu: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz BenchmarkURLQueryForm BenchmarkURLQueryForm-4 247099363 4.809 ns/op 0 B/op 0 allocs/op BenchmarkURLQuery BenchmarkURLQuery-4 2517624 462.1 ns/op 432 B/op 4 allocs/op PASS ok github.com/minio/minio/cmd 3.848s ```	2021-08-07 22:43:01 -07:00
Harshavardhana	e124d88788	optimize listing operation concurrency (#12728 ) - remove use of getOnlineDisks() instead rely on fallbackDisks() when disk return errors like diskNotFound, unformattedDisk use other fallback disks to list from, instead of paying the price for checking getOnlineDisks() - optimize getDiskID() further to avoid large write locks when looking formatLastCheck time window This new change allows for a more relaxed fallback for listing allowing for more tolerance and also eventually gain more consistency in results even if using '3' disks by default.	2021-07-24 22:03:38 -07:00
Klaus Post	05aebc52c2	feat: Implement listing version 3.0 (#12605 ) Co-authored-by: Harshavardhana <harsha@minio.io>	2021-07-05 15:34:41 -07:00
Anis Elleuch	f30c996d48	trace: Add bucket/prefix to WalkDir() tracing (#12510 ) Bonus, replace os.* API with os-instrumented.go	2021-06-15 14:34:26 -07:00
Harshavardhana	1f262daf6f	rename all remaining packages to internal/ (#12418 ) This is to ensure that there are no projects that try to import `minio/minio/pkg` into their own repo. Any such common packages should go to `https://github.com/minio/pkg`	2021-06-01 14:59:40 -07:00
Harshavardhana	0287711dc9	fix: implement readMetadata common function for re-use (#12353 ) Previous PR #12351 added functions to read from the reader stream to reduce memory usage, use the same technique in few other places where we are not interested in reading the data part.	2021-05-21 11:41:25 -07:00
Klaus Post	9d1b6fb37d	Add XL reader without data (#12351 ) Add XL metadata reader that reads metadata only on larger files. Use for scanning and listing for now.	2021-05-21 09:10:54 -07:00
Harshavardhana	d501c5e38b	add missing responseBody drain (#12147 ) Signed-off-by: Harshavardhana <harsha@minio.io>	2021-04-26 08:59:54 -07:00
Harshavardhana	069432566f	update license change for MinIO Signed-off-by: Harshavardhana <harsha@minio.io>	2021-04-23 11:58:53 -07:00
Klaus Post	2623338dc5	Inline small file data in xl.meta file (#11758 )	2021-03-29 17:00:55 -07:00
Klaus Post	9efcb9e15c	Fix listPathRaw/WalkDir cancelation (#11905 ) In #11888 we observe a lot of running, WalkDir calls. There doesn't appear to be any listerners for these calls, so they should be aborted. Ensure that WalkDir aborts when upstream cancels the request. Fixes #11888	2021-03-26 11:18:30 -07:00
Anis Elleuch	0eb146e1b2	add additional metrics per disk API latency, API call counts #11250 ) ``` mc admin info --json ``` provides these details, for now, we shall eventually expose this at Prometheus level eventually. Co-authored-by: Harshavardhana <harsha@minio.io>	2021-03-16 20:06:57 -07:00

1 2

64 Commits