Commit Graph

170 Commits

Author SHA1 Message Date
Poorna
b48bbe08b2 Add additional info for replication metrics API (#17293)
to track the replication transfer rate across different nodes,
number of active workers in use and in-queue stats to get
an idea of the current workload.

This PR also adds replication metrics to the site replication
status API. For site replication, prometheus metrics are
no longer at the bucket level - but at the cluster level.

Add prometheus metric to track credential errors since uptime
2023-08-30 01:00:59 -07:00
Krishnan Parthasarathi
87cb0081ec Retain current and upto NewerNoncurrentVersions versions (#17909)
applyNewerNoncurrentVersionLimit method should pass along versions
unaffected by NewerNoncurrentVersions rule for further ILM evaluation.
2023-08-24 09:26:29 -07:00
Harshavardhana
3a0125fa1f remove unexpected logging from peer calls (#17888)
also make sure RequestID is set for system logs
2023-08-21 14:25:24 -07:00
Anis Eleuch
4c6869cd9a ilm: Fix cleaning non current null versions (#17876) 2023-08-18 12:55:47 -07:00
Harshavardhana
6e860b6dc5 count all versions as part of DeleteAllVersionsAction (#17821) 2023-08-09 08:55:19 -07:00
Krishnan Parthasarathi
0120ff93bc admin-info: add DeleteMarkers count (#17659) 2023-07-18 10:49:40 -07:00
Harshavardhana
24e86d0c59 avoid passing around poolIdx, setIdx instead pass the relevant disks (#17660) 2023-07-17 09:52:05 -07:00
Harshavardhana
3e196fa7b3 fix: ILM newer noncurrent version limit must return correct versions (#17652)
objects/versions that are not expired via NewerNoncurrentVersions
must be properly returned to be applied under further ILM actions.

this would cause legitimately expired objects to be missed
from expiration.
2023-07-14 16:42:35 -07:00
Poorna
5e2f8d7a42 replication: Simplify mrf requeueing and add backlog handler (#17171)
Simplify MRF queueing and add backlog handler

- Limit re-tries to 3 to avoid repeated re-queueing. Fall offs
to be re-tried when the scanner revisits this object or upon access.

- Change MRF to have each node process only its MRF entries.

- Collect MRF backlog by the node to allow for current backlog visibility
2023-07-12 23:51:33 -07:00
Harshavardhana
82075e8e3a use strconv variants to improve on performance per 'op' (#17626)
```
BenchmarkItoa
BenchmarkItoa-8         	673628088	         1.946 ns/op	       0 B/op	       0 allocs/op
BenchmarkFormatInt
BenchmarkFormatInt-8    	592919769	         2.012 ns/op	       0 B/op	       0 allocs/op
BenchmarkSprint
BenchmarkSprint-8       	26149144	        49.06 ns/op	       2 B/op	       1 allocs/op
BenchmarkSprintBool
BenchmarkSprintBool-8   	26440180	        45.92 ns/op	       4 B/op	       1 allocs/op
BenchmarkFormatBool
BenchmarkFormatBool-8   	1000000000	         0.2558 ns/op	       0 B/op	       0 allocs/op
```
2023-07-11 07:46:58 -07:00
Harshavardhana
f6186965c3 honor DeleteAllVersions in list(), head() calls (#17604) 2023-07-08 15:42:10 -07:00
Harshavardhana
aae6846413 feat: allow expiration of all versions via ILM Expiration action (#17521)
Following extension allows users to specify immediate purge of
all versions as soon as the latest version of this object has
expired.

```
<LifecycleConfiguration>
    <Rule>
        <ID>ClassADocRule</ID>
        <Filter>
           <Prefix>classA/</Prefix>
        </Filter>
        <Status>Enabled</Status>
        <Expiration>
             <Days>3650</Days>
	     <ExpiredObjectAllVersions>true</ExpiredObjectAllVersions>
        </Expiration>
    </Rule>
    ...
```
2023-06-28 22:12:28 -07:00
Kaan Kabalak
21fbe88e1f Print certain log messages once per error (#17484) 2023-06-24 20:29:13 -07:00
Klaus Post
bf8a68879c fix: Time ILM Actions for scanner info (#17493)
ILM Actions were not timed fix it.
2023-06-23 07:48:36 -07:00
Aditya Manthramurthy
5a1612fe32 Bump up madmin-go and pkg deps (#17469) 2023-06-19 17:53:08 -07:00
Klaus Post
6f2406b0b6 fix: protect ReplicationStats against concurrent map iteration and write crash (#17403) 2023-06-12 09:17:11 -07:00
Krishnan Parthasarathi
3e128c116e Add lifecycle event source to audit log tags (#17248) 2023-05-22 15:28:56 -07:00
jiuker
7d433f16c4 before return make globalScannerMetrics.incTime call (#17230) 2023-05-18 13:45:05 -07:00
Krishnan Parthasarathi
0ec722bc54 Add tags to NewerNoncurrentVersions audit event (#17110) 2023-05-02 12:56:33 -07:00
Krishnan Parthasarathi
e7cac8acef Add tags to auditLogLifecycle (#17081) 2023-04-26 17:49:00 -07:00
Poorna
cd6dec49c0 Add trace support for ilm activity (#16993) 2023-04-11 19:22:32 -07:00
Shubhendu
4c204707fd Correct to remove null version while ILM rule application (#16971)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
2023-04-06 14:10:01 -07:00
Harshavardhana
c06e0bfef9 set correct Host: value for replication event notification (#16984) 2023-04-06 10:20:53 -07:00
ferhat elmas
714283fae2 cleanup ignored static analysis (#16767) 2023-03-06 08:56:10 -08:00
Klaus Post
9acf1024e4 Remove bloom filter (#16682)
Removes the bloom filter since it has so limited usability, often gets saturated anyway and adds a bunch of complexity to the scanner.

Also removes a tiny bit of CPU by each write operation.
2023-02-24 09:03:31 +05:30
Klaus Post
fd6622458b Add detailed scanner trace output and notifications (#16668) 2023-02-21 09:33:33 -08:00
Harshavardhana
b66d7dc708 add missing x-amz-id-2 to event notification date (#16646) 2023-02-20 15:41:47 +05:30
Krishnan Parthasarathi
2fa35def2c Fix DeleteObject when only free versions remain (#16289) 2022-12-21 16:24:07 -08:00
Harshavardhana
5d7e8f79ed fix: remove scanner healing with unnecessary logs (#16260) 2022-12-14 16:39:18 -08:00
Aditya Manthramurthy
a30cfdd88f Bump up madmin-go to v2 (#16162) 2022-12-06 13:46:50 -08:00
Klaus Post
a713aee3d5 Run staticcheck on CI (#16170) 2022-12-05 11:18:50 -08:00
Klaus Post
cc1d8f0057 Check for abandoned data when healing (#16122) 2022-11-28 10:20:55 -08:00
Krishnan Parthasarathi
6eef9b4a23 lifecycle: simplify Eval and HasActiveRules (#16036) 2022-11-10 07:17:45 -08:00
Anis Elleuch
3b1a9b9fdf Use the same lock for the scanner and site replication healing (#15985) 2022-11-08 08:55:55 -08:00
Harshavardhana
b57fbff7c1 ignore background healInfo in single drive setup (#15968) 2022-10-31 07:26:10 -07:00
Anis Elleuch
fc6c794972 Audit dangling object removal (#15933) 2022-10-24 11:35:07 -07:00
Anis Elleuch
ac85c2af76 lifecycle: refactor rules filtering and tagging support (#15914) 2022-10-21 10:46:53 -07:00
Harshavardhana
41e1654f9a remove spurious logging for object not found (#15842) 2022-10-12 04:28:21 -07:00
Harshavardhana
928feb0889 remove unused debug param from evalActionFromLifecycle (#15813) 2022-10-07 10:24:12 -07:00
Harshavardhana
ae4ee95d25 change default lock retry interval to 50ms (#15560)
competing calls on the same object on versioned bucket
mutating calls on the same object may unexpected have
higher delays.

This can be reproduced with a replicated bucket
overwriting the same object writes, deletes repeatedly.

For longer locks like scanner keep the 1sec interval
2022-08-19 16:21:05 -07:00
Poorna
21bf5b4db7 replication: heal proactively upon access (#15501)
Queue failed/pending replication for healing during listing and GET/HEAD
API calls. This includes healing of existing objects that were never
replicated or those in the middle of a resync operation.

This PR also fixes a bug in ListObjectVersions where lifecycle filtering
should be done.
2022-08-09 15:00:24 -07:00
Harshavardhana
0a8b78cb84 fix: simplify passing auditLog eventType (#15278)
Rename Trigger -> Event to be a more appropriate
name for the audit event.

Bonus: fixes a bug in AddMRFWorker() it did not
cancel the waitgroup, leading to waitgroup leaks.
2022-07-12 10:43:32 -07:00
Klaus Post
37a6b2da67 Allow compaction at bucket top level. (#15266)
If more than 1M folders (objects or prefixes) are found at the top level in a bucket allow it to be compacted.

While very suboptimal structure we should limit memory usage at some point.
2022-07-11 07:59:03 -07:00
Harshavardhana
ae92521310 remove unnecessary nAgreed value in partial() func (#15242) 2022-07-07 13:45:34 -07:00
Klaus Post
ac055b09e9 Add detailed scanner metrics (#15161) 2022-07-05 14:45:49 -07:00
Shireesh Anjal
4ce81fd07f Add periodic callhome functionality (#14918)
* Add periodic callhome functionality

Periodically (every 24hrs by default), fetch callhome information and
upload it to SUBNET.

New config keys under the `callhome` subsystem:

enable - Set to `on` for enabling callhome. Default `off`
frequency - Interval between callhome cycles. Default `24h`

* Improvements based on review comments

- Update `enableCallhome` safely
- Rename pctx to ctx
- Block during execution of callhome
- Store parsed proxy URL in global subnet config
- Store callhome URL(s) in constants
- Use existing global transport
- Pass auth token to subnetPostReq
- Use `config.EnableOn` instead of `"on"`

* Use atomic package instead of lock

* Use uber atomic package

* Use `Cancel` instead of `cancel`

Co-authored-by: Harshavardhana <harsha@minio.io>

Co-authored-by: Harshavardhana <harsha@minio.io>
Co-authored-by: Aditya Manthramurthy <donatello@users.noreply.github.com>
2022-06-06 16:14:52 -07:00
Harshavardhana
52221db7ef fix: for unexpected errors in reading versioning config panic (#14994)
We need to make sure if we cannot read bucket metadata
for some reason, and bucket metadata is not missing and
returning corrupted information we should panic such
handlers to disallow I/O to protect the overall state
on the system.

In-case of such corruption we have a mechanism now
to force recreate the metadata on the bucket, using
`x-minio-force-create` header with `PUT /bucket` API
call.

Additionally fix the versioning config updated state
to be set properly for the site replication healing
to trigger correctly.
2022-05-31 02:57:57 -07:00
Harshavardhana
6cfb1cb6fd fix: timer usage across codebase (#14935)
it seems in some places we have been wrongly using the
timer.Reset() function, nicely exposed by an example
shared by @donatello https://go.dev/play/p/qoF71_D1oXD

this PR fixes all the usage comprehensively
2022-05-17 22:42:59 -07:00
Krishnan Parthasarathi
88dd83a365 lifecycle: Set opts.VersionSuspended when expiring objects (#14902) 2022-05-12 06:09:24 -07:00
Krishnan Parthasarathi
ad8e611098 feat: implement prefix-level versioning exclusion (#14828)
Spark/Hadoop workloads which use Hadoop MR 
Committer v1/v2 algorithm upload objects to a 
temporary prefix in a bucket. These objects are 
'renamed' to a different prefix on Job commit. 
Object storage admins are forced to configure 
separate ILM policies to expire these objects 
and their versions to reclaim space.

Our solution:

This can be avoided by simply marking objects 
under these prefixes to be excluded from versioning, 
as shown below. Consequently, these objects are 
excluded from replication, and don't require ILM 
policies to prune unnecessary versions.

-  MinIO Extension to Bucket Version Configuration
```xml
<VersioningConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/"> 
        <Status>Enabled</Status>
        <ExcludeFolders>true</ExcludeFolders>
        <ExcludedPrefixes>
          <Prefix>app1-jobs/*/_temporary/</Prefix>
        </ExcludedPrefixes>
        <ExcludedPrefixes>
          <Prefix>app2-jobs/*/__magic/</Prefix>
        </ExcludedPrefixes>

        <!-- .. up to 10 prefixes in all -->     
</VersioningConfiguration>
```
Note: `ExcludeFolders` excludes all folders in a bucket 
from versioning. This is required to prevent the parent 
folders from accumulating delete markers, especially
those which are shared across spark workloads 
spanning projects/teams.

- To enable version exclusion on a list of prefixes

```
mc version enable --excluded-prefixes "app1-jobs/*/_temporary/,app2-jobs/*/_magic," --exclude-prefix-marker myminio/test
```
2022-05-06 19:05:28 -07:00