minio

Commit Graph

Author	SHA1	Message	Date
Klaus Post	51aa59a737	perf: websocket grid connectivity for all internode communication (#18461 ) This PR adds a WebSocket grid feature that allows servers to communicate via a single two-way connection. There are two request types: * Single requests, which are `[]byte => ([]byte, error)`. This is for efficient small roundtrips with small payloads. * Streaming requests which are `[]byte, chan []byte => chan []byte (and error)`, which allows for different combinations of full two-way streams with an initial payload. Only a single stream is created between two machines - and there is, as such, no server/client relation since both sides can initiate and handle requests. Which server initiates the request is decided deterministically on the server names. Requests are made through a mux client and server, which handles message passing, congestion, cancelation, timeouts, etc. If a connection is lost, all requests are canceled, and the calling server will try to reconnect. Registered handlers can operate directly on byte slices or use a higher-level generics abstraction. There is no versioning of handlers/clients, and incompatible changes should be handled by adding new handlers. The request path can be changed to a new one for any protocol changes. First, all servers create a "Manager." The manager must know its address as well as all remote addresses. This will manage all connections. To get a connection to any remote, ask the manager to provide it given the remote address using. ``` func (m Manager) Connection(host string) Connection ``` All serverside handlers must also be registered on the manager. This will make sure that all incoming requests are served. The number of in-flight requests and responses must also be given for streaming requests. The "Connection" returned manages the mux-clients. Requests issued to the connection will be sent to the remote. * `func (c Connection) Request(ctx context.Context, h HandlerID, req []byte) ([]byte, error)` performs a single request and returns the result. Any deadline provided on the request is forwarded to the server, and canceling the context will make the function return at once. `func (c Connection) NewStream(ctx context.Context, h HandlerID, payload []byte) (st Stream, err error)` will initiate a remote call and send the initial payload. ```Go // A Stream is a two-way stream. // All responses must be read by the caller. // If the call is canceled through the context, //The appropriate error will be returned. type Stream struct { // Responses from the remote server. // Channel will be closed after an error or when the remote closes. // All responses must be read by the caller until either an error is returned or the channel is closed. // Canceling the context will cause the context cancellation error to be returned. Responses <-chan Response // Requests sent to the server. // If the handler is defined with 0 incoming capacity this will be nil. // Channel must be closed to signal the end of the stream. // If the request context is canceled, the stream will no longer process requests. Requests chan<- []byte } type Response struct { Msg []byte Err error } ``` There are generic versions of the server/client handlers that allow the use of type safe implementations for data types that support msgpack marshal/unmarshal.	2023-11-20 17:09:35 -08:00
Harshavardhana	6829ae5b13	completely remove drive caching layer from gateway days (#18217 ) This has already been deprecated for close to a year now.	2023-10-11 21:18:17 -07:00
Harshavardhana	b0f0e53bba	fix: make sure to correctly initialize health checks (#17765 ) health checks were missing for drives replaced since - HealFormat() would replace the drives without a health check - disconnected drives when they reconnect via connectEndpoint() the loop also loses health checks for local disks and merges these into a single code. - other than this separate cleanUp, health check variables to avoid overloading them with similar requirements. - also ensure that we compete via context selector for disk monitoring such that the canceled disks don't linger around longer waiting for the ticker to trigger. - allow disabling active monitoring.	2023-08-01 10:54:26 -07:00
Klaus Post	4f89e5bba9	Add active disk health checks (#17539 ) Add check every 2 minutes to see if a write+read operation can complete. If disk is unresponsive for 2 minutes or returns errFaultyDisk, take it offline.	2023-07-13 11:41:55 -07:00
Klaus Post	ecc932d5dd	Clean entire tmp-old on restart (#15979 )	2022-10-31 07:27:50 -07:00
Harshavardhana	23b329b9df	remove gateway completely (#15929 )	2022-10-24 17:44:15 -07:00
Poorna	426c902b87	site replication: fix healing of bucket deletes. (#15377 ) This PR changes the handling of bucket deletes for site replicated setups to hold on to deleted bucket state until it syncs to all the clusters participating in site replication.	2022-07-25 17:51:32 -07:00
Klaus Post	e1a0a1e73c	fs: Return prefix as listing marker if no objects (#14143 ) Fixes #14132	2022-01-20 10:55:18 -08:00
Harshavardhana	f6190d6751	Add single drive support for directory prefixes in Listing (#13829 ) This fixes the compatibility issue with Hadoop 3.3.1 fixes #13710	2021-12-03 18:08:40 -08:00
Harshavardhana	661b263e77	add gocritic/ruleguard checks back again, cleanup code. (#13665 ) - remove some duplicated code - reported a bug, separately fixed in #13664 - using strings.ReplaceAll() when needed - using filepath.ToSlash() use when needed - remove all non-Go style comments from the codebase Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>	2021-11-16 09:28:29 -08:00
Harshavardhana	4545ecad58	ignore swapped drives instead of throwing errors (#13655 ) - add checks such that swapped disks are detected and ignored - never used for normal operations. - implement `unrecognizedDisk` to be ignored with all operations returning `errDiskNotFound`. - also add checks such that we do not load unexpected disks while connecting automatically. - additionally humanize the values when printing the errors. Bonus: fixes handling of non-quorum situations in getLatestFileInfo(), that does not work when 2 drives are down, currently this function would return errors incorrectly.	2021-11-15 09:46:55 -08:00
Harshavardhana	951b1e6a7a	fix: Optimize listing calls for NFS mounts (#13159 ) --no-compat should allow for some optimized behavior for NFS mounts by removing Stat() operations.	2021-09-08 08:15:42 -07:00
Klaus Post	ef99438695	fs: Return faster on no ListObjects results (#12525 ) When no results are sent `result.end` is never sent, so the list becomes hot until the list is full. Break immediately when channel is closed. Fixes #12518	2021-06-17 08:16:31 -07:00
Harshavardhana	1f262daf6f	rename all remaining packages to internal/ (#12418 ) This is to ensure that there are no projects that try to import `minio/minio/pkg` into their own repo. Any such common packages should go to `https://github.com/minio/pkg`	2021-06-01 14:59:40 -07:00
Harshavardhana	069432566f	update license change for MinIO Signed-off-by: Harshavardhana <harsha@minio.io>	2021-04-23 11:58:53 -07:00
Harshavardhana	75741dbf4a	xl: remove cleanupDir instead use Delete() (#11880 ) use a single call to remove directly at disk instead of doing recursively at network layer.	2021-03-24 09:08:05 -07:00
Anis Elleuch	0eb146e1b2	add additional metrics per disk API latency, API call counts #11250 ) ``` mc admin info --json ``` provides these details, for now, we shall eventually expose this at Prometheus level eventually. Co-authored-by: Harshavardhana <harsha@minio.io>	2021-03-16 20:06:57 -07:00
Klaus Post	3ff5f55dcb	Fetch fileinfo concurrently (#11700 ) For non-erasure setups fetch up to 10 fileinfos concurrently. Fixes #11625	2021-03-08 11:30:43 -08:00
Harshavardhana	9ccc483df6	[feat]: change erasure coding default block size from 10MiB to 1MiB (#11721 ) major performance improvements in range GETs to avoid large read amplification when ranges are tiny and random ``` ------------------- Operation: GET Operations: 142014 -> 339421 Duration: 4m50s -> 4m56s * Average: +139.41% (+1177.3 MiB/s) throughput, +139.11% (+658.4) obj/s * Fastest: +125.24% (+1207.4 MiB/s) throughput, +132.32% (+612.9) obj/s * 50% Median: +139.06% (+1175.7 MiB/s) throughput, +133.46% (+660.9) obj/s * Slowest: +203.40% (+1267.9 MiB/s) throughput, +198.59% (+753.5) obj/s ``` TTFB from 10MiB BlockSize ``` * First Access TTFB: Avg: 81ms, Median: 61ms, Best: 20ms, Worst: 2.056s ``` TTFB from 1MiB BlockSize ``` * First Access TTFB: Avg: 22ms, Median: 21ms, Best: 8ms, Worst: 91ms ``` Full object reads however do see a slight change which won't be noticeable in real world, so not doing any comparisons TTFB still had improvements with full object reads with 1MiB ``` * First Access TTFB: Avg: 68ms, Median: 35ms, Best: 11ms, Worst: 1.16s ``` v/s TTFB with 10MiB ``` * First Access TTFB: Avg: 388ms, Median: 98ms, Best: 20ms, Worst: 4.156s ``` This change should affect all new uploads, previous uploads should continue to work with business as usual. But dramatic improvements can be seen with these changes.	2021-03-06 14:09:34 -08:00
Harshavardhana	76e2713ffe	fix: use buffers only when necessary for io.Copy() (#11229 ) Use separate sync.Pool for writes/reads Avoid passing buffers for io.CopyBuffer() if the writer or reader implement io.WriteTo or io.ReadFrom respectively then its useless for sync.Pool to allocate buffers on its own since that will be completely ignored by the io.CopyBuffer Go implementation. Improve this wherever we see this to be optimal. This allows us to be more efficient on memory usage. ``` 385 // copyBuffer is the actual implementation of Copy and CopyBuffer. 386 // if buf is nil, one is allocated. 387 func copyBuffer(dst Writer, src Reader, buf []byte) (written int64, err error) { 388 // If the reader has a WriteTo method, use it to do the copy. 389 // Avoids an allocation and a copy. 390 if wt, ok := src.(WriterTo); ok { 391 return wt.WriteTo(dst) 392 } 393 // Similarly, if the writer has a ReadFrom method, use it to do the copy. 394 if rt, ok := dst.(ReaderFrom); ok { 395 return rt.ReadFrom(src) 396 } ``` From readahead package ``` // WriteTo writes data to w until there's no more data to write or when an error occurs. // The return value n is the number of bytes written. // Any error encountered during the write is also returned. func (a *reader) WriteTo(w io.Writer) (n int64, err error) { if a.err != nil { return 0, a.err } n = 0 for { err = a.fill() if err != nil { return n, err } n2, err := w.Write(a.cur.buffer()) a.cur.inc(n2) n += int64(n2) if err != nil { return n, err } ```	2021-01-06 09:36:55 -08:00
Harshavardhana	17a5ff51ff	fix: move context timeout closer to network for Delete calls (#10897 ) allowing for disconnects to be limited to the drive themselves instead of disconnecting all drives.	2020-11-13 16:56:45 -08:00
Klaus Post	a982baff27	ListObjects Metadata Caching (#10648 ) Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c Gist of improvements: * Cross-server caching and listing will use the same data across servers and requests. * Lists can be arbitrarily resumed at a constant speed. * Metadata for all files scanned is stored for streaming retrieval. * The existing bloom filters controlled by the crawler is used for validating caches. * Concurrent requests for the same data (or parts of it) will not spawn additional walkers. * Listing a subdirectory of an existing recursive cache will use the cache. * All listing operations are fully streamable so the number of objects in a bucket no longer dictates the amount of memory. * Listings can be handled by any server within the cluster. * Caches are cleaned up when out of date or superseded by a more recent one.	2020-10-28 09:18:35 -07:00
Harshavardhana	029758cb20	fix: retain the previous UUID for newly replaced drives (#10759 ) only newly replaced drives get the new `format.json`, this avoids disks reloading their in-memory reference format, ensures that drives are online without reloading the in-memory reference format. keeping reference format in-tact means UUIDs never change once they are formatted.	2020-10-26 10:29:29 -07:00
Harshavardhana	6a8c62f9fd	make sure to preserve UUID from reference format (#10748 ) reference format should be source of truth for inconsistent drives which reconnect, add them back to their original position remove automatic fix for existing offline disk uuids	2020-10-24 13:23:08 -07:00
Harshavardhana	18063bf25c	fix: cleanup old directory handling code (#10633 ) we don't need them anymore, remove legacy code.	2020-10-06 12:03:57 -07:00
Klaus Post	2d58a8d861	Add storage layer contexts (#10321 ) Add context to all (non-trivial) calls to the storage layer. Contexts are propagated through the REST client. - `context.TODO()` is left in place for the places where it needs to be added to the caller. - `endWalkCh` could probably be removed from the walkers, but no changes so far. The "dangerous" part is that now a caller disconnecting will propagate down, so a "delete" operation will now be interrupted. In some cases we might want to disconnect this functionality so the operation completes if it has started, leaving the system in a cleaner state.	2020-09-04 09:45:06 -07:00
Harshavardhana	d19b434ffc	fix: bring back delayed leaf detection in listing (#10346 )	2020-08-25 12:26:48 -07:00
Klaus Post	17a1eda702	Disregard healing disks in crawling (#10349 ) When crawling never use a disk we know is healing. Most of the change involves keeping track of the original endpoint on xlStorage and this also fixes DiskInfo.Endpoint never being populated. Heal master will print `data-crawl: Disk "http://localhost:9001/data/mindev/data2/xl1" is Healing, skipping` once on a cycle (no more often than every 5m).	2020-08-25 10:55:15 -07:00
Harshavardhana	35212b673e	add unformatted disk as part of the error list (#10128 ) these errors should be ignored for quorum error calculation to ensure that we don't prematurely return unformatted disk error as part of API calls	2020-07-24 13:16:11 -07:00
Harshavardhana	4915433bd2	Support bucket versioning (#9377 ) - Implement a new xl.json 2.0.0 format to support, this moves the entire marshaling logic to POSIX layer, top layer always consumes a common FileInfo construct which simplifies the metadata reads. - Implement list object versions - Migrate to siphash from crchash for new deployments for object placements. Fixes #2111	2020-06-12 20:04:01 -07:00
Anis Elleuch	9baeda781a	fix storage info output with unordered endpoints arguments (#9610 ) Shuffling arguments that we pass to MinIO server are supported. However, when that happens, Prometheus returns wrong information about disks usage and online/offline status. The commit fixes the issue by avoiding relying on xl.endpoints since it is not ordered.	2020-05-19 14:27:20 -07:00
Harshavardhana	bd032d13ff	migrate all bucket metadata into a single file (#9586 ) this is a major overhaul by migrating off all bucket metadata related configs into a single object '.metadata.bin' this allows us for faster bootups across 1000's of buckets and as well as keeps the code simple enough for future work and additions. Additionally also fixes #9396, #9394	2020-05-19 13:53:54 -07:00
Harshavardhana	6ac48a65cb	fix: use unused cacheMetrics code in prometheus (#9588 ) remove all other unusued/deadcode	2020-05-13 08:15:26 -07:00
Harshavardhana	a1de9cec58	cleanup object-lock/bucket tagging for gateways (#9548 ) This PR is to ensure that we call the relevant object layer APIs for necessary S3 API level functionalities allowing gateway implementations to return proper errors as NotImplemented{} This allows for all our tests in mint to behave appropriately and can be handled appropriately as well.	2020-05-08 13:44:44 -07:00
Harshavardhana	27d716c663	simplify usage of mutexes and atomic constants (#9501 )	2020-05-03 22:35:40 -07:00
Anis Elleuch	0af62d35a0	xl: Implement posix.DeletePrefixes to enhance delete perf (#9100 ) Bulk delete API was using cleanupObjectsBulk() which calls posix listing and delete API to remove objects internal files in the backend (xl.json and parts) one by one. Add DeletePrefixes in the storage API to remove the content of a directory in a single call. Also use a remove goroutine for each disk to accelerate removal.	2020-03-11 08:56:36 -07:00
Harshavardhana	6f66f1a910	close channel upon error in Walk()'er (#9042 )	2020-02-25 19:58:58 -08:00
Harshavardhana	23a8411732	Add a generic Walk()'er to list a bucket, optinally prefix (#9026 ) This generic Walk() is used by likes of Lifecyle, or KMS to rotate keys or any other functionality which relies on this functionality.	2020-02-25 21:22:28 +05:30
Klaus Post	9990464cd5	Fix recursive deep scan of buckets (#8900 )	2020-01-30 17:20:07 +05:30
Harshavardhana	cc02bf0442	Remove old ListenBucketNotification API (#8645 )	2019-12-13 11:33:11 -08:00
Anis Elleuch	555969ee42	Add data usage collect with its new admin API (#8553 ) Admin data usage info API returns the following (Only FS & XL, for now) - Number of buckets - Number of objects - The total size of objects - Objects histogram - Bucket sizes	2019-12-12 06:02:37 -08:00
Nitish Tiwari	3df7285c3c	Add Support for Cache and S3 related metrics in Prometheus endpoint (#8591 ) This PR adds support below metrics - Cache Hit Count - Cache Miss Count - Data served from Cache (in Bytes) - Bytes received from AWS S3 - Bytes sent to AWS S3 - Number of requests sent to AWS S3 Fixes #8549	2019-12-05 23:16:06 -08:00
Harshavardhana	e9b2bf00ad	Support MinIO to be deployed on more than 32 nodes (#8492 ) This PR implements locking from a global entity into a more localized set level entity, allowing for locks to be held only on the resources which are writing to a collection of disks rather than a global level. In this process this PR also removes the top-level limit of 32 nodes to an unlimited number of nodes. This is a precursor change before bring in bucket expansion.	2019-11-13 12:17:45 -08:00
Harshavardhana	9e7a3e6adc	Extend further validation of config values (#8469 ) - This PR allows config KVS to be validated properly without being affected by ENV overrides, rejects invalid values during set operation - Expands unit tests and refactors the error handling for notification targets, returns error instead of ignoring targets for invalid KVS - Does all the prep-work for implementing safe-mode style operation for MinIO server, introduces a new global variable to toggle safe mode based operations NOTE: this PR itself doesn't provide safe mode operations	2019-10-30 23:39:09 -07:00
Krishna Srinivas	980bf78b4d	Detect underlying disk mount/unmount (#8408 )	2019-10-25 10:37:53 -07:00
Harshavardhana	d48fd6fde9	Remove unusued params and functions (#8399 )	2019-10-15 18:35:41 -07:00
Nitish Tiwari	1cd801b2e9	Fix DeleteObjects() to remove renamed objects inside (#8072 )	2019-08-14 11:15:25 -07:00
Harshavardhana	e6d8e272ce	Use const slashSeparator instead of "/" everywhere (#8028 )	2019-08-06 12:08:58 -07:00
Harshavardhana	54eded2e6f	Do not assume all HTTP errors as Network errors (#7983 ) In situations such as when client uploading data, prematurely disconnects from server such as pressing ctrl-c before uploading all the data. Under this situation in distributed setup we prematurely disconnect disks causing a reconnect loop. This has an adverse affect we end up leaving a lot of files in temporary location which ideally should have been cleaned up when Put() prematurely fails. This is also a regression which got introduced in #7610	2019-07-29 14:48:18 -07:00
Krishna Srinivas	a2e904b966	Support any string as delimiter for listing (#7882 )	2019-07-05 14:06:12 -07:00

1 2

82 Commits