minio

Commit Graph

Author	SHA1	Message	Date
Klaus Post	229d83bb75	feat: add dynamic usage cache (#12229 ) A cache structure will be kept with a tree of usages. The cache is a tree structure where each keeps track of its children. An uncompacted branch contains a count of the files only directly at the branch level, and contains link to children branches or leaves. The leaves are "compacted" based on a number of properties. A compacted leaf contains the totals of all files beneath it. A leaf is only scanned once every dataUsageUpdateDirCycles, rarer if the bloom filter for the path is clean and no lifecycles are applied. Skipped leaves have their totals transferred from the previous cycle. A clean leaf will be included once every healFolderIncludeProb for partial heal scans. When selected there is a one in healObjectSelectProb that any object will be chosen for heal scan. Compaction happens when either: - The folder (and subfolders) contains less than dataScannerCompactLeastObject objects. - The folder itself contains more than dataScannerCompactAtFolders folders. - The folder only contains objects and no subfolders. - A bucket root will never be compacted. Furthermore, if a has more than dataScannerCompactAtChildren recursive children (uncompacted folders) the tree will be recursively scanned and the branches with the least number of objects will be compacted until the limit is reached. This ensures that any branch will never contain an unreasonable amount of other branches, and also that small branches with few objects don't take up unreasonable amounts of space. Whenever a branch is scanned, it is assumed that it will be un-compacted before it hits any of the above limits. This will make the branch rebalance itself when scanned if the distribution of objects has changed. TLDR; With current values: No bucket will ever have more than 10000 child nodes recursively. No single folder will have more than 2500 child nodes by itself. All subfolders are compacted if they have less than 500 objects in them recursively. We accumulate the (non-deletemarker) version count for paths as well, since we are changing the structure anyway.	2021-05-11 18:36:15 -07:00
Harshavardhana	069432566f	update license change for MinIO Signed-off-by: Harshavardhana <harsha@minio.io>	2021-04-23 11:58:53 -07:00
Harshavardhana	a334554f99	fix: add helper for expected path.Clean behavior (#12068 ) current usage of path.Clean returns "." for empty strings instead we need `""` string as-is, make relevant changes as needed.	2021-04-15 16:32:13 -07:00
Poorna Krishnamoorthy	47c09a1e6f	Various improvements in replication (#11949 ) - collect real time replication metrics for prometheus. - add pending_count, failed_count metric for total pending/failed replication operations. - add API to get replication metrics - add MRF worker to handle spill-over replication operations - multiple issues found with replication - fixes an issue when client sends a bucket name with `/` at the end from SetRemoteTarget API call make sure to trim the bucket name to avoid any extra `/`. - hold write locks in GetObjectNInfo during replication to ensure that object version stack is not overwritten while reading the content. - add additional protection during WriteMetadata() to ensure that we always write a valid FileInfo{} and avoid ever writing empty FileInfo{} to the lowest layers. Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io> Co-authored-by: Harshavardhana <harsha@minio.io>	2021-04-03 09:03:42 -07:00
Harshavardhana	9171d6ef65	rename all references from crawl -> scanner (#11621 )	2021-02-26 15:11:42 -08:00
Harshavardhana	cc457f1798	fix: enhance logging in crawler use console.Debug instead of logger.Info (#11179 )	2020-12-29 01:57:28 -08:00
Harshavardhana	f714840da7	add _MINIO_SERVER_DEBUG env for enabling debug messages (#11128 )	2020-12-17 16:52:47 -08:00
Harshavardhana	df93102235	fix: unwrapping issues with os.Is* functions (#10949 ) reduces 3 stat calls, reducing the overall startup time significantly.	2020-11-23 08:36:49 -08:00
Klaus Post	6135f072d2	Fix invalidated metacaches (#10784 ) * Fix caches having EOF marked as a failure. * Simplify cache updates. * Provide context for checkMetacacheState failures. * Log 499 when the client disconnects.	2020-10-30 09:33:16 -07:00
Harshavardhana	4bf90ca67f	fix: handle a crash when AskDisks is set to -1 (#10777 )	2020-10-29 09:25:43 -07:00
Klaus Post	a982baff27	ListObjects Metadata Caching (#10648 ) Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c Gist of improvements: * Cross-server caching and listing will use the same data across servers and requests. * Lists can be arbitrarily resumed at a constant speed. * Metadata for all files scanned is stored for streaming retrieval. * The existing bloom filters controlled by the crawler is used for validating caches. * Concurrent requests for the same data (or parts of it) will not spawn additional walkers. * Listing a subdirectory of an existing recursive cache will use the cache. * All listing operations are fully streamable so the number of objects in a bucket no longer dictates the amount of memory. * Listings can be handled by any server within the cluster. * Caches are cleaned up when out of date or superseded by a more recent one.	2020-10-28 09:18:35 -07:00
Klaus Post	3047121255	dataupdate: Bump to force rescan (#10609 ) After #10594 let's invalidate the bloom filters to force the next cycles to go through all data. There is a small chance that the linked PR could have caused missing bloom filter data. This will invalidate the current bloom filters and make the crawler go through everything.	2020-09-30 16:10:40 -07:00
Klaus Post	fdf0ae9167	exit data update tracker only upon context completion (#10594 ) The data update tracker saver would exit if data wasn't updated for between cycles.	2020-09-29 13:23:53 -07:00
Harshavardhana	02c1a08a5b	fix: make sure to lock CopyObject for in-place updates (#10492 )	2020-09-15 20:44:48 -07:00
Klaus Post	c097ce9c32	continous healing based on crawler (#10103 ) Design: https://gist.github.com/klauspost/792fe25c315caf1dd15c8e79df124914	2020-08-24 13:47:01 -07:00
Klaus Post	8e6787a302	Fix TestDataUpdateTracker hanging (#10302 ) Keep dataUpdateTracker while goroutine is starting. This will ensure the object is updated one `start` returns Tested with ``` λ go test -cpu=1,2,4,8 -test.run TestDataUpdateTracker -count=1000 PASS ok github.com/minio/minio/cmd 8.913s ``` Fixes #10295	2020-08-20 13:17:42 -07:00
Harshavardhana	9fd836e51f	add dnsStore interface for upcoming operator webhook (#10077 )	2020-07-20 12:28:48 -07:00
Klaus Post	1813ff9dfa	Re-add missing bucket bloom filters (#9861 )	2020-06-17 08:54:41 -07:00
Klaus Post	43d6e3ae06	merge object lifecycle checks into usage crawler (#9579 )	2020-06-12 10:28:21 -07:00
Klaus Post	56e0c6adf8	Track if bloom filter is dirty (#9601 ) Only save bloom filter on cycles and updates. Fixes #9600	2020-05-14 21:46:36 -07:00
Harshavardhana	b768645fde	fix: unexpected logging with bucket metadata conversions (#9519 )	2020-05-04 20:04:06 -07:00
Harshavardhana	5205c9591f	print proper certinfo on console when starting up (#9479 ) also potentially fix a race in certs.go implementation while accessing tls.Certificate concurrently.	2020-04-30 16:15:29 -07:00
Harshavardhana	498389123e	avoid unnecessary logging on fresh/newly replaced drives (#9470 ) data usage tracker and crawler seem to be logging non-actionable information on console, which is not useful and is fixed on its own in almost all deployments, lets keep this logging to minimal.	2020-04-28 01:16:57 -07:00
Klaus Post	073aac3d92	add data update tracking using bloom filter (#9208 ) By monitoring PUT/DELETE and heal operations it is possible to track changed paths and keep a bloom filter for this data. This can help prioritize paths to scan. The bloom filter can identify paths that have not changed, and the few collisions will only result in a marginal extra workload. This can be implemented on either a bucket+(1 prefix level) with reasonable performance. The bloom filter is set to have a false positive rate at 1% at 1M entries. A bloom table of this size is about ~2500 bytes when serialized. To not force a full scan of all paths that have changed cycle bloom filters would need to be kept, so we guarantee that dirty paths have been scanned within cycle runs. Until cycle bloom filters have been collected all paths are considered dirty.	2020-04-27 10:06:21 -07:00

24 Commits