minio

mirror of https://github.com/minio/minio.git synced 2024-12-27 15:45:55 -05:00

Author	SHA1	Message	Date
Harshavardhana	fde3299bf3	re-use optimized readdir for isDirEmpty() (#10829 ) reduces effective memory usage by an order of magnitude, also increases performance for small objects	2020-11-04 13:05:21 -08:00
Harshavardhana	1a1f00fa15	fix: use internode data for DisksInfo, VolsInfo in message pack (#10821 ) Similar to #10775 for fewer memory allocations, since we use getOnlineDisks() extensively for listing we should optimize it further. Additionally, remove all unused walkers from the storage layer	2020-11-04 10:10:54 -08:00
Klaus Post	37749f4623	Optimize FileInfo(Version) transfer (#10775 ) File Info decoding, in particular, is showing up as a major allocator and time consumer for internode data transfers Switch to message pack for cross-server transfers: ``` MSGP: Size: 945 bytes BenchmarkEncodeFileInfoMsgp-32 1558444 866 ns/op 1.16 MB/s 0 B/op 0 allocs/op BenchmarkDecodeFileInfoMsgp-32 479968 2487 ns/op 0.40 MB/s 848 B/op 18 allocs/op GOB: Size: 1409 bytes BenchmarkEncodeFileInfoGOB-32 333339 3237 ns/op 0.31 MB/s 576 B/op 19 allocs/op BenchmarkDecodeFileInfoGOB-32 20869 57837 ns/op 0.02 MB/s 16439 B/op 428 allocs/op ```	2020-11-02 17:07:52 -08:00
Klaus Post	86e0d272f3	Reduce WriteAll allocs (#10810 ) WriteAll saw 127GB allocs in a 5 minute timeframe for 4MiB buffers used by `io.CopyBuffer` even if they are pooled. Since all writers appear to write byte buffers, just send those instead and write directly. The files are opened through the `os` package so they have no special properties anyway. This removes the alloc and copy for each operation. REST sends content length so a precise alloc can be made.	2020-11-02 16:14:31 -08:00
Krishna Srinivas	3a2f89b3c0	fix: add support for O_DIRECT reads for erasure backends (#10718 )	2020-10-30 11:04:29 -07:00
Klaus Post	a982baff27	ListObjects Metadata Caching (#10648 ) Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c Gist of improvements: * Cross-server caching and listing will use the same data across servers and requests. * Lists can be arbitrarily resumed at a constant speed. * Metadata for all files scanned is stored for streaming retrieval. * The existing bloom filters controlled by the crawler is used for validating caches. * Concurrent requests for the same data (or parts of it) will not spawn additional walkers. * Listing a subdirectory of an existing recursive cache will use the cache. * All listing operations are fully streamable so the number of objects in a bucket no longer dictates the amount of memory. * Listings can be handled by any server within the cluster. * Caches are cleaned up when out of date or superseded by a more recent one.	2020-10-28 09:18:35 -07:00
Anis Elleuch	eb95353cb1	fix: Get/HeadObject return 404 on non quorum objects (#10753 )	2020-10-26 10:30:46 -07:00
Anis Elleuch	00124c56d9	erasure: Commit data before xl.meta in RenameData() (#10734 ) This will reduce the chance to have updated xl.meta without data.	2020-10-23 21:54:58 -07:00
Harshavardhana	2042d4873c	rename crawler config option to heal (#10678 )	2020-10-14 13:51:51 -07:00
Klaus Post	03991c5d41	crawler: Remove waitForLowActiveIO (#10667 ) Only use dynamic delays for the crawler. Even though the max wait was 1 second the number of waits could severely impact crawler speed. Instead of relying on a global metric, we use the stateless local delays to keep the crawler running at a speed more adjusted to current conditions. The only case we keep it is before bitrot checks when enabled.	2020-10-13 13:45:08 -07:00
Harshavardhana	a0d0645128	remove safeMode behavior in startup (#10645 ) In almost all scenarios MinIO now is mostly ready for all sub-systems independently, safe-mode is not useful anymore and do not serve its original intended purpose. allow server to be fully functional even with config partially configured, this is to cater for availability of actual I/O v/s manually fixing the server. In k8s like environments it will never make sense to take pod into safe-mode state, because there is no real access to perform any remote operation on them.	2020-10-09 09:59:52 -07:00
Harshavardhana	736e58dd68	fix: handle concurrent lockers with multiple optimizations (#10640 ) - select lockers which are non-local and online to have affinity towards remote servers for lock contention - optimize lock retry interval to avoid sending too many messages during lock contention, reduces average CPU usage as well - if bucket is not set, when deleteObject fails make sure setPutObjHeaders() honors lifecycle only if bucket name is set. - fix top locks to list out always the oldest lockers always, avoid getting bogged down into map's unordered nature.	2020-10-08 12:32:32 -07:00
Harshavardhana	2b4eb87d77	pick disks which are common maximally used (#10600 ) further optimization to ensure that good disks are always used for listing, other than healing we only use disks that are maximally used.	2020-09-29 22:54:02 -07:00
Harshavardhana	00eb6f6bc9	cache DiskInfo at storage layer for performance (#10586 ) `mc admin info` on busy setups will not move HDD heads unnecessarily for repeated calls, provides a better responsiveness for the call overall. Bonus change allow listTolerancePerSet be N-1 for good entries, to avoid skipping entries for some reason one of the disk went offline.	2020-09-29 09:54:41 -07:00
Harshavardhana	66174692a2	add '.healing.bin' for tracking currently healing disk (#10573 ) add a hint on the disk to allow for tracking fresh disk being healed, to allow for restartable heals, and also use this as a way to track and remove disks. There are more pending changes where we should move all the disk formatting logic to backend drives, this PR doesn't deal with this refactor instead makes it easier to track healing in the future.	2020-09-28 19:39:32 -07:00
Harshavardhana	7f9498f43f	fix: ignore faulty drives and continue (#10511 ) drives might return different types of errors handle them individually, and for some errors just log an error and continue	2020-09-18 12:09:05 -07:00
Klaus Post	34859c6d4b	Preallocate (safe) slices when we know the size (#10459 )	2020-09-14 20:44:18 -07:00
Klaus Post	fa01e640f5	Continous healing: add optional bitrot check (#10417 )	2020-09-12 00:08:12 -07:00
Anis Elleuch	af88772a78	lifecycle: NoncurrentVersionExpiration considers noncurrent version age (#10444 ) From https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html#intro-lifecycle-rules-actions ``` When specifying the number of days in the NoncurrentVersionTransition and NoncurrentVersionExpiration actions in a Lifecycle configuration, note the following: It is the number of days from when the version of the object becomes noncurrent (that is, when the object is overwritten or deleted), that Amazon S3 will perform the action on the specified object or objects. Amazon S3 calculates the time by adding the number of days specified in the rule to the time when the new successor version of the object is created and rounding the resulting time to the next day midnight UTC. For example, in your bucket, suppose that you have a current version of an object that was created at 1/1/2014 10:30 AM UTC. If the new version of the object that replaces the current version is created at 1/15/2014 10:30 AM UTC, and you specify 3 days in a transition rule, the transition date of the object is calculated as 1/19/2014 00:00 UTC. ```	2020-09-09 18:11:24 -07:00
Klaus Post	2d58a8d861	Add storage layer contexts (#10321 ) Add context to all (non-trivial) calls to the storage layer. Contexts are propagated through the REST client. - `context.TODO()` is left in place for the places where it needs to be added to the caller. - `endWalkCh` could probably be removed from the walkers, but no changes so far. The "dangerous" part is that now a caller disconnecting will propagate down, so a "delete" operation will now be interrupted. In some cases we might want to disconnect this functionality so the operation completes if it has started, leaving the system in a cleaner state.	2020-09-04 09:45:06 -07:00
Harshavardhana	37da0c647e	fix: delete marker compatibility behavior for suspended bucket (#10395 ) - delete-marker should be created on a suspended bucket as `null` - delete-marker should delete any pre-existing `null` versioned object and create an entry `null`	2020-09-02 00:19:03 -07:00
Klaus Post	3e1fb17b70	heal: Check for truncated files (#10399 ) When checking parts we already do a stat for each part. Since we have the on disk size check if it is at least what we expect. When checking metadata check if metadata is 0 bytes.	2020-09-01 12:06:45 -07:00
Harshavardhana	a359e36e35	tolerate listing with only readQuorum disks (#10357 ) We can reduce this further in the future, but this is a good value to keep around. With the advent of continuous healing, we can be assured that namespace will eventually be consistent so we are okay to avoid the necessity to a list across all drives on all sets. Bonus Pop()'s in parallel seem to have the potential to wait too on large drive setups and cause more slowness instead of gaining any performance remove it for now. Also, implement load balanced reply for local disks, ensuring that local disks have an affinity for - cleanupStaleMultipartUploads()	2020-08-26 19:29:35 -07:00
Harshavardhana	d19b434ffc	fix: bring back delayed leaf detection in listing (#10346 )	2020-08-25 12:26:48 -07:00
Klaus Post	17a1eda702	Disregard healing disks in crawling (#10349 ) When crawling never use a disk we know is healing. Most of the change involves keeping track of the original endpoint on xlStorage and this also fixes DiskInfo.Endpoint never being populated. Heal master will print `data-crawl: Disk "http://localhost:9001/data/mindev/data2/xl1" is Healing, skipping` once on a cycle (no more often than every 5m).	2020-08-25 10:55:15 -07:00
Harshavardhana	74116204ce	handle fresh setup with mixed drives (#10273 ) fresh drive setups when one of the drive is a root drive, we should ignore such a root drive and not proceed to format. This PR handles this properly by marking the disks which are root disk and they are taken offline.	2020-08-18 14:37:26 -07:00
Harshavardhana	e4a44f6224	fix: commonPrefixes behavior in ListObjectVersions (#10286 ) ``` $ aws s3api --profile minio --endpoint-url http://localhost:9003 \ list-object-versions --bucket testbucket \ --delimiter / --prefix Veeam/Archive/ { "CommonPrefixes": [ { "Prefix": "Veeam/Archive/003/" } ] } ``` Also add coverage tests similar to ListObjects to catch errors in future, skip these tests in FS mode	2020-08-18 12:19:44 -07:00
Anis Elleuch	51ba1dac49	listing: Fix result when prefix is an object with a slash (#10267 ) In a non recursive mode, issuing a list request where prefix is an existing object with a slash and delimiter is a slash will return entries in the object directory (data dir IDs) ``` $ aws s3api --profile minioadmin --endpoint-url http://localhost:9000 \ list-objects-v2 --bucket testbucket --prefix code_of_conduct.md/ --delimiter '/' { "CommonPrefixes": [ { "Prefix": "code_of_conduct.md/ec750fe0-ea7e-4b87-bbec-1e32407e5e47/" } ] } ``` This commit adds a fast exit track in Walk() in this specific case.	2020-08-14 20:13:24 -07:00
Harshavardhana	6c6137b2e7	add cluster maintenance healthcheck drive heal affinity (#10218 )	2020-08-07 13:22:53 -07:00
Harshavardhana	a20d4568a2	fix: make sure to use uniform drive count calculation (#10208 ) It is possible in situations when server was deployed in asymmetric configuration in the past such as ``` minio server ~/fs{1...4}/disk{1...5} ``` Results in setDriveCount of 10 in older releases but with fairly recent releases we have moved to having server affinity which means that a set drive count ascertained from above config will be now '4' While the object layer make sure that we honor `format.json` the storageClass configuration however was by mistake was using the global value obtained by heuristics. Which leads to prematurely using lower parity without being requested by the an administrator. This PR fixes this behavior.	2020-08-05 13:31:12 -07:00
Harshavardhana	0b8255529a	fix: proxies set keep-alive timeouts to be system dependent (#10199 ) Split the DialContext's one for internode and another for all other external communications especially proxy forwarders, gateway transport etc.	2020-08-04 14:55:53 -07:00
Harshavardhana	019fe69a57	fix: reduce an extra system call for writes instead fail later (#10187 )	2020-08-04 12:09:41 -07:00
Harshavardhana	b16781846e	allow server to start even with corrupted/faulty disks (#10175 )	2020-08-03 18:17:48 -07:00
poornas	c43da3005a	Add support for server side bucket replication (#9882 )	2020-07-21 17:49:56 -07:00
Harshavardhana	a880283593	Send the lower level error directly from GetDiskID() (#10095 ) this is to detect situations of corruption disk format etc errors quickly and keep the disk online in such scenarios for requests to fail appropriately.	2020-07-21 13:54:06 -07:00
Harshavardhana	17747db93f	fix: support healing older content (#10076 ) This PR adds support for healing older content i.e from 2yrs, 1yr. Also handles other situations where our config was not encrypted yet. This PR also ensures that our Listing is consistent and quorum friendly, such that we don't list partial objects	2020-07-17 17:41:29 -07:00
Harshavardhana	e7d7d5232c	fix: admin info output and improve overall performance (#10015 ) - admin info node offline check is now quicker - admin info now doesn't duplicate the code across doing the same checks for disks - rely on StorageInfo to return appropriate errors instead of calling locally. - diskID checks now return proper errors when disk not found v/s format.json missing. - add more disk states for more clarity on the underlying disk errors.	2020-07-13 09:51:07 -07:00
Harshavardhana	1d65ef3201	fix: deletes on older format properly (#10029 ) while we handle all situations for writes and reads on older format, what we didn't cater for properly yet was delete where we only ended up deleting just `xl.meta` - instead we should allow all the deletes to go through for older format without versioning enabled buckets.	2020-07-13 09:01:17 -07:00
Harshavardhana	c0adb52213	sync to disk only upon successful legacy metadata rename (#10018 )	2020-07-11 09:37:34 -07:00
Anis Elleuch	d4af132fc4	lifecycle: Expiry should not delete versions (#9972 ) Currently, lifecycle expiry is deleting all object versions which is not correct, unless noncurrent versions field is specified. Also, only delete the delete marker if it is the only version of the given object.	2020-07-04 20:56:02 -07:00
Anis Elleuch	21a37e3393	fix: ListObjectVersions should return ordered Version & DeleteMarker (#9959 ) The S3 specification says that versions are ordered in the response of list object versions. mc snapshot needs this to know which version comes first especially when two versions have the same exact last-modified field.	2020-07-03 09:15:44 -07:00
Klaus Post	abd999f64a	fix: list object versions in distributed setup (#9958 ) Remove calls to `WalkVersions` was calling the wrong endpoint, so unless quorum could be reached with local disks no results would ever be returned.	2020-07-02 10:29:50 -07:00
Harshavardhana	174f428571	add additional fdatasync before close() on writes (#9947 )	2020-07-01 10:57:23 -07:00
Klaus Post	cae09d8b84	crawler: Wait max 1 second (#9894 ) Add 1-second timeout to crawler wait. This will make the crawler able to run, albeit very, very slowly on high load servers.	2020-06-22 11:57:22 -07:00
Klaus Post	972d876ca9	Do not select zones with <5% free after upload (#9877 ) Looking into full disk errors on zoned setup. We don't take the 5% space requirement into account when selecting a zone. The interesting part is that even considering this we don't know the size of the object the user wants to upload when they do multipart uploads. It seems quite defensive to always upload multiparts to the zone where there is the most space since all load will be directed to a part of the cluster. In these cases we make sure it can at least hold a 1GiB file and we disadvantage fuller zones more by subtracting the expected size before weighing.	2020-06-20 06:36:44 -07:00
Harshavardhana	9626a981bc	fix: Preserve old data appropriately (#9873 ) This PR fixes all the below scenarios and handles them correctly. - existing data/bucket is replaced with new content, no versioning enabled old structure vanishes. - existing data/bucket - enable versioning before uploading any data, once versioning enabled upload new content, old content is preserved. - suspend versioning on the bucket again, now upload content again the old content is purged since that is the default "null" version. Additionally sync data after xl.json -> xl.meta rename(), to avoid any surprises if there is a crash during this rename operation.	2020-06-19 10:58:17 -07:00
Harshavardhana	94424e14d7	fix: rename legacy xl.json to xl.meta properly in ListDir() (#9863 )	2020-06-17 13:58:38 -07:00
Harshavardhana	4915433bd2	Support bucket versioning (#9377 ) - Implement a new xl.json 2.0.0 format to support, this moves the entire marshaling logic to POSIX layer, top layer always consumes a common FileInfo construct which simplifies the metadata reads. - Implement list object versions - Migrate to siphash from crchash for new deployments for object placements. Fixes #2111	2020-06-12 20:04:01 -07:00

48 Commits