minio

Commit Graph

Author	SHA1	Message	Date
Harshavardhana	0287711dc9	fix: implement readMetadata common function for re-use (#12353 ) Previous PR #12351 added functions to read from the reader stream to reduce memory usage, use the same technique in few other places where we are not interested in reading the data part.	2021-05-21 11:41:25 -07:00
Klaus Post	9d1b6fb37d	Add XL reader without data (#12351 ) Add XL metadata reader that reads metadata only on larger files. Use for scanning and listing for now.	2021-05-21 09:10:54 -07:00
Klaus Post	2ca9c533ef	feat: implement in-progress partial bucket updates (#12279 )	2021-05-19 14:38:30 -07:00
Harshavardhana	2daba018d6	reduce allocations on multi-disk clusters (#12311 ) multi-disk clusters initialize buffer pools per disk, this is perhaps expensive and perhaps not useful, for a running server instance. As this may disallow re-use of buffers across sets, this change ensures that buffers across sets can be re-used at drive level, this can reduce quite a lot of memory on large drive setups.	2021-05-17 17:49:48 -07:00
Harshavardhana	2ab9dc7609	do not update bloomFilters for temporary objects	2021-05-15 19:54:07 -07:00
Harshavardhana	4d876d03e8	fix: do not fail upon faulty/non-writable drives gracefully start the server, if there are other drives available - print enough information for administrator to notice the errors in console. Bonus: for really large streams use larger buffer for writes.	2021-05-15 12:57:18 -07:00
Klaus Post	229d83bb75	feat: add dynamic usage cache (#12229 ) A cache structure will be kept with a tree of usages. The cache is a tree structure where each keeps track of its children. An uncompacted branch contains a count of the files only directly at the branch level, and contains link to children branches or leaves. The leaves are "compacted" based on a number of properties. A compacted leaf contains the totals of all files beneath it. A leaf is only scanned once every dataUsageUpdateDirCycles, rarer if the bloom filter for the path is clean and no lifecycles are applied. Skipped leaves have their totals transferred from the previous cycle. A clean leaf will be included once every healFolderIncludeProb for partial heal scans. When selected there is a one in healObjectSelectProb that any object will be chosen for heal scan. Compaction happens when either: - The folder (and subfolders) contains less than dataScannerCompactLeastObject objects. - The folder itself contains more than dataScannerCompactAtFolders folders. - The folder only contains objects and no subfolders. - A bucket root will never be compacted. Furthermore, if a has more than dataScannerCompactAtChildren recursive children (uncompacted folders) the tree will be recursively scanned and the branches with the least number of objects will be compacted until the limit is reached. This ensures that any branch will never contain an unreasonable amount of other branches, and also that small branches with few objects don't take up unreasonable amounts of space. Whenever a branch is scanned, it is assumed that it will be un-compacted before it hits any of the above limits. This will make the branch rebalance itself when scanned if the distribution of objects has changed. TLDR; With current values: No bucket will ever have more than 10000 child nodes recursively. No single folder will have more than 2500 child nodes by itself. All subfolders are compacted if they have less than 500 objects in them recursively. We accumulate the (non-deletemarker) version count for paths as well, since we are changing the structure anyway.	2021-05-11 18:36:15 -07:00
Anis Elleuch	56d4d7b8b1	MRF: Better detection of non stable disks (#12252 ) MRF does not detect when a node is disconnected and reconnected quickly this change will ensure that MRF is alerted by comparing the last disk reconnection timestamp with the last MRF check time. Signed-off-by: Anis Elleuch <anis@min.io> Co-authored-by: Klaus Post <klauspost@gmail.com>	2021-05-11 09:19:15 -07:00
Harshavardhana	764721e2c6	add root_disk threshold detection (#12259 ) as there is no automatic way to detect if there is a root disk mounted on / or /var for the container environments due to how the root disk information is masked inside overlay root inside container. this PR brings an environment variable to set root disk size threshold manually to detect the root disks in such situations.	2021-05-08 15:40:29 -07:00
Klaus Post	254698f126	fix: minor allocation improvements in xlMetaV2 (#12133 )	2021-05-07 09:11:05 -07:00
Nitish Tiwari	776589f0da	Add free inode metric for Prometheus (#12225 )	2021-05-06 12:50:48 -07:00
Harshavardhana	c8050bc079	fix: sleeper behavior in data scanner (#12164 ) do not apply healReplication() for ILM expired, transitioned objects	2021-04-27 08:24:44 -07:00
Poorna Krishnamoorthy	4be0f92067	Fix multipart restore to remove part match (#12161 ) Part ETags are not available after multipart finalizes, removing this check as not useful. Signed-off-by: Poorna Krishnamoorthy <poorna@minio.io> Co-authored-by: Harshavardhana <harsha@minio.io>	2021-04-26 18:24:06 -07:00
Krishnan Parthasarathi	c829e3a13b	Support for remote tier management (#12090 ) With this change, MinIO's ILM supports transitioning objects to a remote tier. This change includes support for Azure Blob Storage, AWS S3 compatible object storage incl. MinIO and Google Cloud Storage as remote tier storage backends. Some new additions include: - Admin APIs remote tier configuration management - Simple journal to track remote objects to be 'collected' This is used by object API handlers which 'mutate' object versions by overwriting/replacing content (Put/CopyObject) or removing the version itself (e.g DeleteObjectVersion). - Rework of previous ILM transition to fit the new model In the new model, a storage class (a.k.a remote tier) is defined by the 'remote' object storage type (one of s3, azure, GCS), bucket name and a prefix. * Fixed bugs, review comments, and more unit-tests - Leverage inline small object feature - Migrate legacy objects to the latest object format before transitioning - Fix restore to particular version if specified - Extend SharedDataDirCount to handle transitioned and restored objects - Restore-object should accept version-id for version-suspended bucket (#12091) - Check if remote tier creds have sufficient permissions - Bonus minor fixes to existing error messages Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io> Co-authored-by: Krishna Srinivas <krishna@minio.io> Signed-off-by: Harshavardhana <harsha@minio.io>	2021-04-23 11:58:53 -07:00
Harshavardhana	069432566f	update license change for MinIO Signed-off-by: Harshavardhana <harsha@minio.io>	2021-04-23 11:58:53 -07:00
Harshavardhana	2ef824bbb2	collapse two distinct calls into single RenameData() call (#12093 ) This is an optimization by reducing one extra system call, and many network operations. This reduction should increase the performance for small file workloads.	2021-04-20 10:44:39 -07:00
Harshavardhana	1456f9f090	fix: preserve shared dataDir during suspend overwrites (#12058 ) CopyObject() when shares dataDir needs to be preserved, and upon versioning suspended overwrites should still preserve the dataDir.	2021-04-15 08:44:05 -07:00
Harshavardhana	928ee1a7b2	remove null version dataDir upon overwrites (#12023 )	2021-04-08 19:55:44 -07:00
Klaus Post	d2ac2f758e	odirectReader: handle EOF correctly (#11998 ) EOF may be sent along with data so queue it up and return it when the buffer is empty. Also, when reading data without direct io don't add a buffer that only results in extra memcopy.	2021-04-07 08:32:59 -07:00
Klaus Post	788a8bc254	Fix disk info race (#11984 ) Protect updated members in xlStorage. ``` WARNING: DATA RACE Write at 0x00c004b4ee78 by goroutine 1491: github.com/minio/minio/cmd.(xlStorage).GetDiskID() d:/minio/minio/cmd/xl-storage.go:590 +0x1078 github.com/minio/minio/cmd.(xlStorageDiskIDCheck).checkDiskStale() d:/minio/minio/cmd/xl-storage-disk-id-check.go:195 +0x84 github.com/minio/minio/cmd.(xlStorageDiskIDCheck).StatVol() d:/minio/minio/cmd/xl-storage-disk-id-check.go:284 +0x16a github.com/minio/minio/cmd.erasureObjects.getBucketInfo.func1() d:/minio/minio/cmd/erasure-bucket.go:100 +0x1a5 github.com/minio/minio/pkg/sync/errgroup.(Group).Go.func1() d:/minio/minio/pkg/sync/errgroup/errgroup.go:122 +0xd7 Previous read at 0x00c004b4ee78 by goroutine 1087: github.com/minio/minio/cmd.(xlStorage).CheckFile.func1() d:/minio/minio/cmd/xl-storage.go:1699 +0x384 github.com/minio/minio/cmd.(xlStorage).CheckFile() d:/minio/minio/cmd/xl-storage.go:1726 +0x13c github.com/minio/minio/cmd.(xlStorageDiskIDCheck).CheckFile() d:/minio/minio/cmd/xl-storage-disk-id-check.go:446 +0x23b github.com/minio/minio/cmd.erasureObjects.parentDirIsObject.func1() d:/minio/minio/cmd/erasure-common.go:173 +0x194 github.com/minio/minio/pkg/sync/errgroup.(Group).Go.func1() d:/minio/minio/pkg/sync/errgroup/errgroup.go:122 +0xd7 ```	2021-04-06 11:33:42 -07:00
Harshavardhana	5cce9361bc	fix: avoid an extra rename when there is no dataDir (#11964 ) also perform globalSync() in defer when enabled for RenameData(), to ensure all calls are flushed to disk.	2021-04-05 08:52:28 -07:00
Harshavardhana	d46386246f	api: Introduce metadata update APIs to update only metadata (#11962 ) Current implementation heavily relies on readAllFileInfo but with the advent of xl.meta inlined with data, we cannot easily avoid reading data when we are only interested is updating metadata, this leads to invariably write amplification during metadata updates, repeatedly reading data when we are only interested in updating metadata. This PR ensures that we implement a metadata only update API at storage layer, that handles updates to metadata alone for any given version - given the version is valid and present. This helps reduce the chattiness for following calls.. - PutObjectTags - DeleteObjectTags - PutObjectLegalHold - PutObjectRetention - ReplicateObject (updates metadata on replication status)	2021-04-04 13:32:31 -07:00
Poorna Krishnamoorthy	47c09a1e6f	Various improvements in replication (#11949 ) - collect real time replication metrics for prometheus. - add pending_count, failed_count metric for total pending/failed replication operations. - add API to get replication metrics - add MRF worker to handle spill-over replication operations - multiple issues found with replication - fixes an issue when client sends a bucket name with `/` at the end from SetRemoteTarget API call make sure to trim the bucket name to avoid any extra `/`. - hold write locks in GetObjectNInfo during replication to ensure that object version stack is not overwritten while reading the content. - add additional protection during WriteMetadata() to ensure that we always write a valid FileInfo{} and avoid ever writing empty FileInfo{} to the lowest layers. Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io> Co-authored-by: Harshavardhana <harsha@minio.io>	2021-04-03 09:03:42 -07:00
Harshavardhana	434e5c0cfe	allow preserving legacyXLv1 with inline data format (#11951 ) current master breaks this important requirement we need to preserve legacyXLv1 format, this is simply ignored and overwritten causing a myriad of issues by leaving stale files on the namespace etc. for now lets still use the two-phase approach of writing to `tmp` and then renaming the content to the actual namespace.	2021-04-01 22:12:03 -07:00
Harshavardhana	204c610d84	do not use dataDir to reference inline data use versionID (#11942 ) versionID is the one that needs to be preserved and as well as overwritten in case of replication, transition etc - dataDir is an ephemeral entity that changes during overwrites - make sure that versionID is used to save the object content. this would break things if you are already running the latest master, please wipe your current content and re-do your setup after this change.	2021-04-01 13:09:23 -07:00
Klaus Post	2623338dc5	Inline small file data in xl.meta file (#11758 )	2021-03-29 17:00:55 -07:00
Harshavardhana	d93c6cb9c7	use Access() instead of Lstat() for frequent use (#11911 ) using Lstat() is causing tiny memory allocations, that are usually wasted and never used, instead we can simply uses Access() call that does 0 memory allocations.	2021-03-29 08:07:23 -07:00
Harshavardhana	d7f32ad649	xl: avoid sending Delete() remote call for fully successful runs an optimization to avoid extra syscalls in PutObject(), adds up to our PutObject response times.	2021-03-24 17:32:12 -07:00
Harshavardhana	410e84d273	xl: add checks for minioTmpMetaBucket in CreateFile	2021-03-24 09:36:10 -07:00
Harshavardhana	75741dbf4a	xl: remove cleanupDir instead use Delete() (#11880 ) use a single call to remove directly at disk instead of doing recursively at network layer.	2021-03-24 09:08:05 -07:00
Ritesh H Shukla	6a2ed44095	fix: optionally enable tracing posix calls	2021-03-23 22:23:08 -07:00
Anis Elleuch	98ff91b484	xl: Reduce usage of isDirEmpty() (#11838 ) When an object is removed, its parent directory is inspected to check if it is empty to remove if that is the case. However, we can use os.Remove() directly since it is only able to remove a file or an empty directory.	2021-03-19 15:42:01 -07:00
Anis Elleuch	4d86384dc7	xl: Remove non needed check for empty dir (#11835 ) RenameData renames xl.meta and data dir and removes the parent directory if empty, however, there is a duplicate check for empty dir, since the parent dir of xl.meta is always the same as the data-dir.	2021-03-19 12:26:53 -07:00
Harshavardhana	b92a220db1	fix: handle weird drives sporadic read O_DIRECT behavior (#11832 ) on freshReads if drive returns errInvalidArgument, we should simply turn-off DirectIO and read normally, there are situations in k8s like environments where the drives behave sporadically in a single deployment and may not have been implemented properly to handle O_DIRECT for reads.	2021-03-18 20:16:50 -07:00
Harshavardhana	51a8619a79	[feat] Add configurable deadline for writers (#11822 ) This PR adds deadlines per Write() calls, such that slow drives are timed-out appropriately and the overall responsiveness for Writes() is always up to a predefined threshold providing applications sustained latency even if one of the drives is slow to respond.	2021-03-18 14:09:55 -07:00
Harshavardhana	60b0f2324e	storage write call path optimizations (#11805 ) - write in o_dsync instead of o_direct for smaller objects to avoid unaligned double Write() situations that may arise for smaller objects < 128KiB - avoid fallocate() as its not useful since we do not use Append() semantics anymore, fallocate is not useful for streaming I/O we can save on a syscall - createFile() doesn't need to validate `bucket` name with a Lstat() call since createFile() is only used to write at `minioTmpBucket` - use io.Copy() when writing unAligned writes to allow usage of ReadFrom() from *os.File providing zero buffer writes().	2021-03-17 09:38:38 -07:00
Anis Elleuch	0eb146e1b2	add additional metrics per disk API latency, API call counts #11250 ) ``` mc admin info --json ``` provides these details, for now, we shall eventually expose this at Prometheus level eventually. Co-authored-by: Harshavardhana <harsha@minio.io>	2021-03-16 20:06:57 -07:00
Klaus Post	fdc2f69218	truncate xl.meta files upon rewrites #11749 ) If the destination files exist and is larger - junk data will be left at the end of the file.	2021-03-09 14:42:24 -08:00
Harshavardhana	d971061305	use listPathRaw for HealObjects() instead of expensive WalkVersions() (#11675 )	2021-03-06 09:25:48 -08:00
Klaus Post	fa9cf1251b	Imporve healing and reporting (#11312 ) * Provide information on actively healing, buckets healed/queued, objects healed/failed. * Add concurrent healing of multiple sets (typically on startup). * Add bucket level resume, so restarts will only heal non-healed buckets. * Print summary after healing a disk is done.	2021-03-04 14:36:23 -08:00
Harshavardhana	c6a120df0e	fix: Prometheus metrics to re-use storage disks (#11647 ) also re-use storage disks for all `mc admin server info` calls as well, implement a new LocalStorageInfo() API call at ObjectLayer to lookup local disks storageInfo also fixes bugs where there were double calls to StorageInfo()	2021-03-02 17:28:04 -08:00
Harshavardhana	2f4af09c01	fix: alow changes to readAllData to decrement activeCount()	2021-02-28 20:09:23 -08:00
Harshavardhana	37960cbc2f	fix: avoid writing more content on network with O_DIRECT reads (#11659 ) There was an io.LimitReader was missing for the 'length' parameter for ranged requests, that would cause client to get truncated responses and errors. fixes #11651	2021-02-28 15:33:03 -08:00
Harshavardhana	9171d6ef65	rename all references from crawl -> scanner (#11621 )	2021-02-26 15:11:42 -08:00
Harshavardhana	6386b45c08	[feat] use rename instead of recursive deletes (#11641 ) most of the delete calls today spend time in a blocking operation where multiple calls need to be recursively sent to delete the objects, instead we can use rename operation to atomically move the objects from the namespace to `tmp/.trash` we can schedule deletion of objects at this location once in 15, 30mins and we can also add wait times between each delete operation. this allows us to make delete's faster as well less chattier on the drives, each server runs locally a groutine which would clean this up regularly.	2021-02-26 09:52:27 -08:00
Harshavardhana	b517c791e9	[feat]: use DSYNC for xl.meta writes and NOATIME for reads (#11615 ) Instead of using O_SYNC, we are better off using O_DSYNC instead since we are only ever interested in data to be persisted to disk not the associated filesystem metadata. For reads we ask customers to turn off noatime, but instead we can proactively use O_NOATIME flag to avoid atime updates upon reads.	2021-02-24 00:14:16 -08:00
Harshavardhana	18ec933085	fix: for containers use root-disk detection cleverly (#11593 ) root-disk implemented currently had issues where root disk partitions getting modified might race and provide incorrect results, to avoid this lets rely again back on DeviceID and match it instead. In-case of containers `/data` is one such extra entity that needs to be verified for root disk, due to how 'overlay' filesystem works and the 'overlay' presents a completely different 'device' id - using `/data` as another entity for fallback helps because our containers describe 'VOLUME' parameter that allows containers to automatically have a virtual `/data` that points to the container root path this can either be at `/` or `/var/lib/` (on different partition)	2021-02-22 10:32:21 -08:00
Harshavardhana	8778828a03	fix: read metadata in O_DIRECT if configured and supported (#11594 ) reduce the page-cache pressure completely by moving the entire read-phase of our operations to O_DIRECT, primarily this is going to be very useful for chatty metadata operations such as listing, scanner, ilm, healing like operations to avoid filling up the page-cache upon repeated runs.	2021-02-22 01:36:17 -08:00
Harshavardhana	7875d472bc	avoid notification for non-existent delete objects (#11514 ) Skip notifications on objects that might have had an error during deletion, this also avoids unnecessary replication attempt on such objects. Refactor some places to make sure that we have notified the client before we - notify - schedule for replication - lifecycle etc.	2021-02-10 22:00:42 -08:00
Harshavardhana	0e3211f4ad	fix: server upgrades should have more descriptive error messages (#11476 ) during rolling upgrade, provide a more descriptive error message and discourage rolling upgrade in such situations, allowing users to take action. additionally also rename `slashpath -> pathutil` to avoid a slighly mis-pronounced usage of `path` package.	2021-02-08 10:15:12 -08:00

1 2 3

137 Commits