minio

Commit Graph

Author	SHA1	Message	Date
Harshavardhana	6f16d1cb2c	do not count context canceled as timeout errors (#18975 )	2024-02-05 18:16:13 -08:00
Harshavardhana	99fde2ba85	deprecate disk tokens, instead rely on deadlines and active monitoring (#18947 ) disk tokens usage is not necessary anymore with the implementation of deadlines for storage calls and active monitoring of the drive for I/O timeouts. Functionality kicking off a bad drive is still supported, it's just that we do not have to serialize I/O in the manner tokens would do.	2024-02-02 10:10:54 -08:00
Harshavardhana	f25cbdf43c	use all the available nr_requests for NVMe (#18920 )	2024-01-30 14:10:06 -08:00
Harshavardhana	80ca120088	remove checkBucketExist check entirely to avoid fan-out calls (#18917 ) Each Put, List, Multipart operations heavily rely on making GetBucketInfo() call to verify if bucket exists or not on a regular basis. This has a large performance cost when there are tons of servers involved. We did optimize this part by vectorizing the bucket calls, however its not enough, beyond 100 nodes and this becomes fairly visible in terms of performance.	2024-01-30 12:43:25 -08:00
Harshavardhana	486e2e48ea	enable xattr capture by default (#18911 ) - healing must not set the write xattr because that is the job of active healing to update. what we need to preserve is permanent deletes. - remove older env for drive monitoring and enable it accordingly, as a global value.	2024-01-29 23:03:58 -08:00
Harshavardhana	1d3bd02089	avoid close 'nil' panics if any (#18890 ) brings a generic implementation that prints a stack trace for 'nil' channel closes(), if not safely closes it.	2024-01-28 10:04:17 -08:00
Harshavardhana	74851834c0	further bootstrap/startup optimization for reading 'format.json' (#18868 ) - Move RenameFile to websockets - Move ReadAll that is primarily is used for reading 'format.json' to to websockets - Optimize DiskInfo calls, and provide a way to make a NoOp DiskInfo call.	2024-01-25 12:45:46 -08:00
Klaus Post	4a6c97463f	Fix all racy use of NewDeadlineWorker (#18861 ) AlmosAll uses of NewDeadlineWorker, which relied on secondary values, were used in a racy fashion, which could lead to inconsistent errors/data being returned. It also propagates the deadline downstream. Rewrite all these to use a generic WithDeadline caller that can return an error alongside a value. Remove the stateful aspect of DeadlineWorker - it was racy if used - but it wasn't AFAICT. Fixes races like: ``` WARNING: DATA RACE Read at 0x00c130b29d10 by goroutine 470237: github.com/minio/minio/cmd.(xlStorageDiskIDCheck).ReadVersion() github.com/minio/minio/cmd/xl-storage-disk-id-check.go:702 +0x611 github.com/minio/minio/cmd.readFileInfo() github.com/minio/minio/cmd/erasure-metadata-utils.go:160 +0x122 github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.1() github.com/minio/minio/cmd/erasure-object.go:809 +0x27a github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.2() github.com/minio/minio/cmd/erasure-object.go:828 +0x61 Previous write at 0x00c130b29d10 by goroutine 470298: github.com/minio/minio/cmd.(xlStorageDiskIDCheck).ReadVersion.func1() github.com/minio/minio/cmd/xl-storage-disk-id-check.go:698 +0x244 github.com/minio/minio/internal/ioutil.(DeadlineWorker).Run.func1() github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33 WARNING: DATA RACE Write at 0x00c0ba6e6c00 by goroutine 94507: github.com/minio/minio/cmd.(xlStorageDiskIDCheck).StatVol.func1() github.com/minio/minio/cmd/xl-storage-disk-id-check.go:419 +0x104 github.com/minio/minio/internal/ioutil.(DeadlineWorker).Run.func1() github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33 Previous read at 0x00c0ba6e6c00 by goroutine 94463: github.com/minio/minio/cmd.(xlStorageDiskIDCheck).StatVol() github.com/minio/minio/cmd/xl-storage-disk-id-check.go:422 +0x47e github.com/minio/minio/cmd.getBucketInfoLocal.func1() github.com/minio/minio/cmd/peer-s3-server.go:275 +0x122 github.com/minio/pkg/v2/sync/errgroup.(*Group).Go.func1() ``` Probably back from #17701	2024-01-24 10:08:31 -08:00
Harshavardhana	52229a21cb	avoid reload of 'format.json' over the network under normal conditions (#18842 )	2024-01-23 14:11:46 -08:00
Harshavardhana	7c0673279b	capture I/O in waiting and total tokens in diskMetrics (#18819 ) This is needed for the subsequent changes in ServerUpdate(), ServerRestart() etc.	2024-01-18 11:17:43 -08:00
Harshavardhana	e5c8794b8b	avoid disk monitoring leaks under various conditions (#18777 ) - HealFormat() was leaking healthcheck goroutines for disks, we are only interested in enabling healthcheck for the newly formatted disk, not for existing disks. - When disk is a root-disk a random disk monitor was leaking while we ignored the drive. - When loading the disk for each erasure set, we were leaking goroutines for the prepare-storage.go disks which were replaced via the globalLocalDrives slice - avoid disk monitoring utilizing health tokens that would cause exhaustion in the tokens, prematurely which were meant for incoming I/O. This is ensured by avoiding writing O_DIRECT aligned buffer instead write 2048 worth of content only as O_DSYNC, which is sufficient.	2024-01-12 01:48:36 -08:00
Anis Eleuch	7705605b5a	scanner: Add a config to disable short sleep between objects scan (#18734 ) Add a hidden configuration under the scanner sub section to configure if the scanner should sleep between two objects scan. The configuration has only effect when there is no drive activity related to s3 requests or healing. By default, the code will keep the current behavior which is doing sleep between objects. To forcefully enable the full scan speed in idle mode, you can do this: `mc admin config set myminio scanner idle_speed=full`	2024-01-04 15:07:17 -08:00
Anis Eleuch	3f4488c589	scanner: Allow full throttle if there is no parallel disk ops (#18109 )	2024-01-02 13:51:24 -08:00
Harshavardhana	a50ea92c64	feat: introduce list_quorum="auto" to prefer quorum drives (#18084 ) NOTE: This feature is not retro-active; it will not cater to previous transactions on existing setups. To enable this feature, please set ` _MINIO_DRIVE_QUORUM=on` environment variable as part of systemd service or k8s configmap. Once this has been enabled, you need to also set `list_quorum`. ``` ~ mc admin config set alias/ api list_quorum=auto` ``` A new debugging tool is available to check for any missing counters.	2023-12-29 15:52:41 -08:00
Klaus Post	6c89a81af4	Fix CreateFile shared buffer corruption. (#18652 ) `(xlStorageDiskIDCheck).CreateFile` wraps the incoming reader in `xioutil.NewDeadlineReader`. The wrapped reader is handed to `(xlStorage).CreateFile`. This performs a Read call via `writeAllDirect`, which reads into an `ODirectPool` buffer. `(*DeadlineReader).Read` spawns an async read into the buffer. If a timeout is hit while reading, the read operation returns to `writeAllDirect`. The operation returns an error and the buffer is reused. However, if the async `Read` call unblocks, it will write to the now recycled buffer. Fix: Remove the `DeadlineReader` - it is inherently unsafe. Instead, rely on the network timeouts. This is not a disk timeout, anyway. Regression in https://github.com/minio/minio/pull/17745	2023-12-14 10:51:57 -08:00
Harshavardhana	05bb655efc	avoid caching metrics for timeout errors per drive (#18584 ) Bonus: combine the loop for drive/REST registration.	2023-12-04 11:54:13 -08:00
Anis Eleuch	b7d11141e1	rename Force to Immediate for clarity (#18540 )	2023-11-28 22:35:16 -08:00
jiuker	be02333529	feat: drive sub-sys to max timeout reload (#18501 )	2023-11-27 09:15:06 -08:00
Harshavardhana	a4cfb5e1ed	return errors if dataDir is missing during HeadObject() (#18477 ) Bonus: allow replication to attempt Deletes/Puts when the remote returns quorum errors of some kind, this is to ensure that MinIO can rewrite the namespace with the latest version that exists on the source.	2023-11-20 21:33:47 -08:00
Klaus Post	51aa59a737	perf: websocket grid connectivity for all internode communication (#18461 ) This PR adds a WebSocket grid feature that allows servers to communicate via a single two-way connection. There are two request types: * Single requests, which are `[]byte => ([]byte, error)`. This is for efficient small roundtrips with small payloads. * Streaming requests which are `[]byte, chan []byte => chan []byte (and error)`, which allows for different combinations of full two-way streams with an initial payload. Only a single stream is created between two machines - and there is, as such, no server/client relation since both sides can initiate and handle requests. Which server initiates the request is decided deterministically on the server names. Requests are made through a mux client and server, which handles message passing, congestion, cancelation, timeouts, etc. If a connection is lost, all requests are canceled, and the calling server will try to reconnect. Registered handlers can operate directly on byte slices or use a higher-level generics abstraction. There is no versioning of handlers/clients, and incompatible changes should be handled by adding new handlers. The request path can be changed to a new one for any protocol changes. First, all servers create a "Manager." The manager must know its address as well as all remote addresses. This will manage all connections. To get a connection to any remote, ask the manager to provide it given the remote address using. ``` func (m Manager) Connection(host string) Connection ``` All serverside handlers must also be registered on the manager. This will make sure that all incoming requests are served. The number of in-flight requests and responses must also be given for streaming requests. The "Connection" returned manages the mux-clients. Requests issued to the connection will be sent to the remote. * `func (c Connection) Request(ctx context.Context, h HandlerID, req []byte) ([]byte, error)` performs a single request and returns the result. Any deadline provided on the request is forwarded to the server, and canceling the context will make the function return at once. `func (c Connection) NewStream(ctx context.Context, h HandlerID, payload []byte) (st Stream, err error)` will initiate a remote call and send the initial payload. ```Go // A Stream is a two-way stream. // All responses must be read by the caller. // If the call is canceled through the context, //The appropriate error will be returned. type Stream struct { // Responses from the remote server. // Channel will be closed after an error or when the remote closes. // All responses must be read by the caller until either an error is returned or the channel is closed. // Canceling the context will cause the context cancellation error to be returned. Responses <-chan Response // Requests sent to the server. // If the handler is defined with 0 incoming capacity this will be nil. // Channel must be closed to signal the end of the stream. // If the request context is canceled, the stream will no longer process requests. Requests chan<- []byte } type Response struct { Msg []byte Err error } ``` There are generic versions of the server/client handlers that allow the use of type safe implementations for data types that support msgpack marshal/unmarshal.	2023-11-20 17:09:35 -08:00
Harshavardhana	91d8bddbd1	use sendfile/splice implementation to perform DMA (#18411 ) sendfile implementation to perform DMA on all platforms Go stdlib already supports sendfile/splice implementations for - Linux - Windows - *BSD - Solaris Along with this change however O_DIRECT for reads() must be removed as well since we need to use sendfile() implementation The main reason to add O_DIRECT for reads was to reduce the chances of page-cache causing OOMs for MinIO, however it would seem that avoiding buffer copies from user-space to kernel space this issue is not a problem anymore. There is no Go based memory allocation required, and neither the page-cache is referenced back to MinIO. This page- cache reference is fully owned by kernel at this point, this essentially should solve the problem of page-cache build up. With this now we also support SG - when NIC supports Scatter/Gather https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing)	2023-11-10 10:10:14 -08:00
Harshavardhana	483389f2e2	set diskMaxConcurrent to 32 if nrRequests is lower	2023-10-24 17:21:12 -07:00
Harshavardhana	2dc917e87f	maxConcurrent must be set only once per node (#18303 )	2023-10-23 21:42:36 -07:00
Harshavardhana	f91b257f50	choose different max_concurrent requests per drive based on HDD/NVMe (#18254 ) currently the default for all drives is 512, which is a lot for HDDs the recent testing has revealed moving this to 32 for HDDs seems like a fair value.	2023-10-16 17:18:13 -07:00
Harshavardhana	9878031cfd	fix: change DISK_ to DRIVE_ for some drive related envs (#18005 )	2023-09-11 12:19:22 -07:00
Aditya Manthramurthy	1c99fb106c	Update to minio/pkg/v2 (#17967 )	2023-09-04 12:57:37 -07:00
Harshavardhana	124e28578c	remove strict persistence requirements for List() .metacache objects (#17917 ) .metacache objects are transient in nature, and are better left to use page-cache effectively to avoid using more IOPs on the disks. this allows for incoming calls to be not taxed heavily due to multiple large batch listings.	2023-08-25 07:58:11 -07:00
Harshavardhana	c45bc32d98	skip disks under scanning when healing disks (#17822 ) Bonus: - avoid calling DiskInfo() calls when missing blocks instead heal the object using MRF operation. - change the max_sleep to 250ms beyond that we will not stop healing.	2023-08-09 12:51:47 -07:00
Harshavardhana	21cdd2bf5d	avoid overwriting metrics on success, save it in defer (#17780 )	2023-08-01 22:19:56 -07:00
Harshavardhana	a7a7533190	add new errors for Disks with timeouts (#17770 )	2023-08-01 12:47:50 -07:00
Harshavardhana	b0f0e53bba	fix: make sure to correctly initialize health checks (#17765 ) health checks were missing for drives replaced since - HealFormat() would replace the drives without a health check - disconnected drives when they reconnect via connectEndpoint() the loop also loses health checks for local disks and merges these into a single code. - other than this separate cleanUp, health check variables to avoid overloading them with similar requirements. - also ensure that we compete via context selector for disk monitoring such that the canceled disks don't linger around longer waiting for the ticker to trigger. - allow disabling active monitoring.	2023-08-01 10:54:26 -07:00
Harshavardhana	81be718674	fix: optimize DiskInfo() call avoid metrics when not needed (#17763 )	2023-07-31 15:20:48 -07:00
Harshavardhana	5e5bdf5432	capture total errors data availability and any timeout errors (#17748 )	2023-07-29 23:26:26 -07:00
Harshavardhana	731e03fe5a	add ReadFileStream deadline for disk call (#17745 ) timeout the reader side if hung via disk max timeout	2023-07-28 15:37:53 -07:00
Harshavardhana	47dcfcbdd4	introduce deadlines on READ operations (#17724 )	2023-07-27 07:33:05 -07:00
Harshavardhana	a7c71e4c6b	protect disk monitoring to avoid busy loop configuration (#17723 )	2023-07-25 20:02:22 -07:00
Harshavardhana	e7b60c4d65	Add slow drive timeouts to match with active disk monitoring (#17701 ) allow active disk-monitoring to be configurable, and use these add deadlines in various call layers for various syscalls.	2023-07-25 16:58:31 -07:00
Klaus Post	4f89e5bba9	Add active disk health checks (#17539 ) Add check every 2 minutes to see if a write+read operation can complete. If disk is unresponsive for 2 minutes or returns errFaultyDisk, take it offline.	2023-07-13 11:41:55 -07:00
Klaus Post	e20aab25ec	Check for progress before we reach the limit (#17552 )	2023-07-07 00:13:57 -07:00
Klaus Post	6efcf9c982	Do lockless last minute latency metrics (#17576 ) Collect metrics in one second and accumulate lockless before sending upstream.	2023-07-05 10:40:45 -07:00
Aditya Manthramurthy	5a1612fe32	Bump up madmin-go and pkg deps (#17469 )	2023-06-19 17:53:08 -07:00
Harshavardhana	e372e4e592	add nodeName to the log while taking drive offline (#17124 )	2023-05-03 15:05:45 -07:00
jiuker	6e359c586e	fix: close chan before return in scanner usage updates (#16960 )	2023-04-04 10:51:05 -07:00
Klaus Post	fd6622458b	Add detailed scanner trace output and notifications (#16668 )	2023-02-21 09:33:33 -08:00
Aditya Manthramurthy	a30cfdd88f	Bump up madmin-go to v2 (#16162 )	2022-12-06 13:46:50 -08:00
Anis Elleuch	c84e2939e4	trace: Publish storage layer errors (#16153 )	2022-12-01 12:10:54 -08:00
Klaus Post	cc1d8f0057	Check for abandoned data when healing (#16122 )	2022-11-28 10:20:55 -08:00
Harshavardhana	0c81f1bdb3	indicate how long it took to bring the drive online (#15835 )	2022-10-11 11:33:56 -07:00
Harshavardhana	2d9b5a65f1	verify RenameData() versions to be consistent (#15649 ) xl.meta gets written and never rolled back, however we definitely need to validate the state that is persisted on the disk, if there are inconsistencies - more than write quorum we should return an error to the client - if write quorum was achieved however there are inconsistent xl.meta's we should simply trigger an MRF on them	2022-09-05 16:51:37 -07:00
Harshavardhana	8eec49304d	use logger.Info instead of logger.LogIf	2022-08-08 16:13:58 -07:00

1 2

98 Commits