Commit Graph

154 Commits

Author SHA1 Message Date
Klaus Post 9667a170de
Add usage cache cleanup and lower forced top compaction (#19719)
Lower forced compaction to 250K entries.

If there is more than 250K entries on the top level force compact it and log an error.
2024-05-10 07:49:50 -07:00
Krishnan Parthasarathi 6c07bfee8a
With retention, skip actions expiring all versions (#19657)
ILM actions due to ExpiredObjectDeleteAllVersions and
DelMarkerExpiration are ignored when object locking is enabled on a
bucket.
Note: This applies to object versions which may not have retention
configured on them. This applies to all object versions in this bucket,
including those created before the retention config was applied.
2024-05-03 04:18:58 -07:00
Krishnan Parthasarathi 7926401cbd
ilm: Handle DeleteAllVersions action differently for DEL markers (#19481)
i.e., this rule element doesn't apply to DEL markers.

This is a breaking change to how ExpiredObejctDeleteAllVersions
functions today. This is necessary to avoid the following highly probable
footgun scenario in the future.

Scenario:
The user uses tags-based filtering to select an object's time to live(TTL). 
The application sometimes deletes objects, too, making its latest
version a DEL marker. The previous implementation skipped tag-based filters
if the newest version was DEL marker, voiding the tag-based TTL. The user is
surprised to find objects that have expired sooner than expected.

* Add DelMarkerExpiration action

This ILM action removes all versions of an object if its
the latest version is a DEL marker.

```xml
<DelMarkerObjectExpiration>
    <Days> 10 </Days>
</DelMarkerObjectExpiration>
```

1. Applies only to objects whose,
  • The latest version is a DEL marker.
  • satisfies the number of days criteria
2. Deletes all versions of this object
3. Associated rule can't have tag-based filtering

Includes,
- New bucket event type for deletion due to DelMarkerExpiration
2024-04-30 18:11:10 -07:00
Harshavardhana 8161411c5d
fix: a crash in RemoveReplication target (#19640)
calling a remote target remove with a perfectly
well constructed ARN can lead to a crash for a bucket
with no replication configured.

This PR fixes, and adds a crash check for ImportMetadata
as well.
2024-04-30 18:09:56 -07:00
Harshavardhana 95c65f4e8f
do not panic on rebalance during server restarts (#19563)
This PR makes a feasible approach to handle all the scenarios
that we must face to avoid returning "panic."

Instead, we must return "errServerNotInitialized" when a
bucketMetadataSys.Get() is called, allowing the caller to
retry their operation and wait.

Bonus fix the way data-usage-cache stores the object.
Instead of storing usage-cache.bin with the bucket as
`.minio.sys/buckets`, the `buckets` must be relative
to the bucket `.minio.sys` as part of the object name.

Otherwise, there is no way to decommission entries at
`.minio.sys/buckets` and their final erasure set positions.

A bucket must never have a `/` in it. Adds code to read()
from existing data-usage.bin upon upgrade.
2024-04-22 10:49:30 -07:00
jiuker f3d6a2dd37
code clean for dynamicSleeper (#19499) 2024-04-15 02:40:19 -07:00
Anis Eleuch c6f8dc431e
Add a warning when the total size of an object versions exceeds 1 TiB (#19435) 2024-04-08 10:45:03 -07:00
Anis Eleuch 95bf4a57b6
logging: Add subsystem to log API (#19002)
Create new code paths for multiple subsystems in the code. This will
make maintaing this easier later.

Also introduce bugLogIf() for errors that should not happen in the first
place.
2024-04-04 05:04:40 -07:00
Harshavardhana c61dd16a1e
fix: avoid fan-out DeletePrefix calls for batch-expire and ILM (#19365) 2024-03-27 20:18:15 -07:00
Krishnan Parthasarathi da81c6cc27
Encode dir obj names before expiration (#19305)
Object names of directory objects qualified for ExpiredObjectAllVersions
must be encoded appropriately before calling on deletePrefix on their
erasure set.

e.g., a directory object and regular objects with overlapping prefixes
could lead to the expiration of regular objects, which is not the 
intention of ILM. 

```
bucket/dir/ ---> directory object
bucket/dir/obj-1
```

When `bucket/dir/` qualifies for expiration, the current implementation would
remove regular objects under the prefix `bucket/dir/`, in this case,
`bucket/dir/obj-1`.
2024-03-21 10:21:35 -07:00
Krishnan Parthasarathi 383489d5d9
Handle zero versions qualified for expiration (#19301)
When objects have more versions than their ILM policy expects to retain
via NewerNoncurrentVersions, but they don't qualify for expiry due to
NoncurrentDays are configured in that rule. 

In this case, applyNewerNoncurrentVersionsLimit method was enqueuing empty 
tasks, which lead to a panic (panic: runtime error: index out of range [0] with
length 0) in newerNoncurrentTask.OpHash method, which assumes the task
to contain at least one version to expire.
2024-03-19 20:10:58 -07:00
Krishnan Parthasarathi a7577da768
Improve expiration of tiered objects (#18926)
- Use a shared worker pool for all ILM expiry tasks
- Free version cleanup executes in a separate goroutine
- Add a free version only if removing the remote object fails
- Add ILM expiry metrics to the node namespace
- Move tier journal tasks to expiryState
- Remove unused on-disk journal for tiered objects pending deletion
- Distribute expiry tasks across workers such that the expiry of versions of
  the same object serialized
- Ability to resize worker pool without server restart
- Make scaling down of expiryState workers' concurrency safe; Thanks
  @klauspost
- Add error logs when expiryState and transition state are not
  initialized (yet)
* metrics: Add missed tier journal entry tasks
* Initialize the ILM worker pool after the object layer
2024-03-01 21:11:03 -08:00
Harshavardhana 9a012a53ef
initialize the disk healer early on (#19143)
This PR fixes a bug that perhaps has been long introduced,
with no visible workarounds. In any deployment, if an entire
erasure set is deleted, there is no way the cluster recovers.
2024-02-27 23:02:14 -08:00
Krishnan Parthasarathi ee158e1610
ilm: Update action count only on success (#19093)
It also fixes a long-standing bug in expiring transitioned objects.
The expiration action was deleting the current version in the case'
of tiered objects instead of adding a delete marker.
2024-02-22 15:00:32 -08:00
Anis Eleuch 8c53a4405a
Add audit for folder excess (#19109)
Also replace ilm:expiry with scanner to avoid user confusion
2024-02-22 08:18:13 -08:00
Harshavardhana effe21f3eb
send correct objectname in audit events for DeleteAll ILM (#19053) 2024-02-14 08:07:58 -08:00
Harshavardhana afd19de5a9
fix: allow configuring excess versions alerting (#19028)
Bonus: enable audit alerts for object versions
beyond the configured value, default is '100'
versions per object beyond which scanner will
alert for each such objects.
2024-02-11 23:41:53 -08:00
Harshavardhana 1d3bd02089
avoid close 'nil' panics if any (#18890)
brings a generic implementation that
prints a stack trace for 'nil' channel
closes(), if not safely closes it.
2024-01-28 10:04:17 -08:00
Harshavardhana dd2542e96c
add codespell action (#18818)
Original work here, #18474,  refixed and updated.
2024-01-17 23:03:17 -08:00
Anis Eleuch 7705605b5a
scanner: Add a config to disable short sleep between objects scan (#18734)
Add a hidden configuration under the scanner sub section to configure if
the scanner should sleep between two objects scan. The configuration has
only effect when there is no drive activity related to s3 requests or
healing.

By default, the code will keep the current behavior which is doing
sleep between objects.

To forcefully enable the full scan speed in idle mode, you can do this:

   `mc admin config set myminio scanner idle_speed=full`
2024-01-04 15:07:17 -08:00
Anis Eleuch 3f4488c589
scanner: Allow full throttle if there is no parallel disk ops (#18109) 2024-01-02 13:51:24 -08:00
Harshavardhana d521c84d55
reduce logging during permission denied errors (#18641)
log them if any only once
2023-12-12 16:11:17 -08:00
Harshavardhana 53ce92b9ca
fix: use the right channel to feed the data in (#18605)
this PR fixes a regression in batch replication
where we weren't sending any data from the Walk()
results due to incorrect channels being used.
2023-12-06 18:17:03 -08:00
Harshavardhana e98172d72d
avoid hot-tier SLA to be tied to warm-tier SLA (#18581)
it is okay if the warm-tier cannot keep up, we should continue
to take I/O at hot-tier, only fail hot-tier or block it when
we are disk full.

Bonus: add metrics counter for these missed tasks, we will
know for sure if one of the node is lagging behind or is
losing too many tasks during transitioning.
2023-12-02 13:02:12 -08:00
Krishnan Parthasarathi 9fbd931058
Skip versions expired by DeleteAllVersionsAction (#18537)
Object versions expired by DeleteAllVersionsAction must not be included
toward data-usage accounting.
2023-11-28 08:39:21 -08:00
Anis Eleuch 1bb7a2a295
Immediate transition ILM to avoid quick deferring to the scanner (#18475)
Immediate transition use case and is mostly used to fill warm
backend with a lot of data when a new deployment is created

Currently, if the transition queue is complete, the transition will be
deferred to the scanner; change this behavior by blocking the PUT request
until the transition queue has a new place for a transition task.
2023-11-17 16:16:46 -08:00
Harshavardhana 877e0cac03
fix: tiering statistics handling a bug in clone() implementation (#18342)
Tiering statistics have been broken for some time now, a regression
was introduced in 6f2406b0b6

Bonus fixes an issue where the objects are not assumed to be
of the 'STANDARD' storage-class for the objects that have
not yet tiered, this should be conditional based on the object's
metadata not a default assumption.

This PR also does some cleanup in terms of implementation,

fixes #18070
2023-10-30 09:59:51 -07:00
Poorna 9dc29d7687
Avoid ILM expiry on deleted versions that are yet to replicate (#18175)
Fixes #18167
2023-10-06 06:55:15 -06:00
Klaus Post 57f84a8b4c
Add abandoned folder scanning to metrics (#18076)
Include object and versions heal scan times when checking non-empty abandoned folders.

Furthermore don't add delay between healing versions, instead do one per object wait.
2023-09-24 22:15:31 -07:00
Harshavardhana 91ebac0a00
fix: move abandoned parts check after healing not in ILM path (#18087) 2023-09-22 12:07:52 -07:00
Harshavardhana 2add57cfed
apply healing per object at 1024 cycles (#18050)
- we already have MRF for most recent failures
- we trigger healing during HEAD/GET operation

These are enough, also change the default max wait
from 5sec to 1sec for default scanner speed.
2023-09-19 09:24:22 -07:00
Harshavardhana 36385010f5
use optimized pathJoin instead of path.Join (#18042)
this avoids allocations in scanner routine, they are tiny but 
they allocate a lot over many cycles of the scanner.
2023-09-16 19:08:59 -07:00
Aditya Manthramurthy 1c99fb106c
Update to minio/pkg/v2 (#17967) 2023-09-04 12:57:37 -07:00
Harshavardhana 9458485e43
avoid double logging from healing (#17950) 2023-08-30 18:46:04 -07:00
Poorna b48bbe08b2
Add additional info for replication metrics API (#17293)
to track the replication transfer rate across different nodes,
number of active workers in use and in-queue stats to get
an idea of the current workload.

This PR also adds replication metrics to the site replication
status API. For site replication, prometheus metrics are
no longer at the bucket level - but at the cluster level.

Add prometheus metric to track credential errors since uptime
2023-08-30 01:00:59 -07:00
Krishnan Parthasarathi 87cb0081ec
Retain current and upto NewerNoncurrentVersions versions (#17909)
applyNewerNoncurrentVersionLimit method should pass along versions
unaffected by NewerNoncurrentVersions rule for further ILM evaluation.
2023-08-24 09:26:29 -07:00
Harshavardhana 3a0125fa1f
remove unexpected logging from peer calls (#17888)
also make sure RequestID is set for system logs
2023-08-21 14:25:24 -07:00
Anis Eleuch 4c6869cd9a
ilm: Fix cleaning non current null versions (#17876) 2023-08-18 12:55:47 -07:00
Harshavardhana 6e860b6dc5
count all versions as part of DeleteAllVersionsAction (#17821) 2023-08-09 08:55:19 -07:00
Krishnan Parthasarathi 0120ff93bc
admin-info: add DeleteMarkers count (#17659) 2023-07-18 10:49:40 -07:00
Harshavardhana 24e86d0c59
avoid passing around poolIdx, setIdx instead pass the relevant disks (#17660) 2023-07-17 09:52:05 -07:00
Harshavardhana 3e196fa7b3
fix: ILM newer noncurrent version limit must return correct versions (#17652)
objects/versions that are not expired via NewerNoncurrentVersions
must be properly returned to be applied under further ILM actions.

this would cause legitimately expired objects to be missed
from expiration.
2023-07-14 16:42:35 -07:00
Poorna 5e2f8d7a42
replication: Simplify mrf requeueing and add backlog handler (#17171)
Simplify MRF queueing and add backlog handler

- Limit re-tries to 3 to avoid repeated re-queueing. Fall offs
to be re-tried when the scanner revisits this object or upon access.

- Change MRF to have each node process only its MRF entries.

- Collect MRF backlog by the node to allow for current backlog visibility
2023-07-12 23:51:33 -07:00
Harshavardhana 82075e8e3a
use strconv variants to improve on performance per 'op' (#17626)
```
BenchmarkItoa
BenchmarkItoa-8         	673628088	         1.946 ns/op	       0 B/op	       0 allocs/op
BenchmarkFormatInt
BenchmarkFormatInt-8    	592919769	         2.012 ns/op	       0 B/op	       0 allocs/op
BenchmarkSprint
BenchmarkSprint-8       	26149144	        49.06 ns/op	       2 B/op	       1 allocs/op
BenchmarkSprintBool
BenchmarkSprintBool-8   	26440180	        45.92 ns/op	       4 B/op	       1 allocs/op
BenchmarkFormatBool
BenchmarkFormatBool-8   	1000000000	         0.2558 ns/op	       0 B/op	       0 allocs/op
```
2023-07-11 07:46:58 -07:00
Harshavardhana f6186965c3
honor DeleteAllVersions in list(), head() calls (#17604) 2023-07-08 15:42:10 -07:00
Harshavardhana aae6846413
feat: allow expiration of all versions via ILM Expiration action (#17521)
Following extension allows users to specify immediate purge of
all versions as soon as the latest version of this object has
expired.

```
<LifecycleConfiguration>
    <Rule>
        <ID>ClassADocRule</ID>
        <Filter>
           <Prefix>classA/</Prefix>
        </Filter>
        <Status>Enabled</Status>
        <Expiration>
             <Days>3650</Days>
	     <ExpiredObjectAllVersions>true</ExpiredObjectAllVersions>
        </Expiration>
    </Rule>
    ...
```
2023-06-28 22:12:28 -07:00
Kaan Kabalak 21fbe88e1f
Print certain log messages once per error (#17484) 2023-06-24 20:29:13 -07:00
Klaus Post bf8a68879c
fix: Time ILM Actions for scanner info (#17493)
ILM Actions were not timed fix it.
2023-06-23 07:48:36 -07:00
Aditya Manthramurthy 5a1612fe32
Bump up madmin-go and pkg deps (#17469) 2023-06-19 17:53:08 -07:00
Klaus Post 6f2406b0b6
fix: protect ReplicationStats against concurrent map iteration and write crash (#17403) 2023-06-12 09:17:11 -07:00