Commit Graph

210 Commits

Author SHA1 Message Date
Klaus Post
c25816eabc
xl walk: Limit walk concurrent IO (#12885)
We are observing heavy system loads, potentially
locking the system up for periods when concurrent
listing operations are performed.

We place a per-disk lock on walk IO operations.
This will minimize the impact of concurrent listing
operations on the entire system and de-prioritize
them compared to other operations.

Single list operations should remain largely unaffected.
2021-08-18 18:10:36 -07:00
Klaus Post
24722ddd02
Remove inline data hack (#12946)
move the code down to the storage layer,
this logic decouples the inline data from the 
size parameter making it flexible and future
proof.
2021-08-13 08:25:54 -07:00
Klaus Post
89febdb3d6
Reuse small buffers (#12948)
When reading metadata allow reuse of buffers 
in certain cases. Take the low-hanging fruit.

Reduce GC overhead when listing.
2021-08-12 14:27:22 -07:00
Klaus Post
3eac02f676
Use metadata reader in ReadVersion (#12942)
Use `readMetadata` when reading version 
information without data requested. 

Reduces IO on inlined data.

Bonus: Inline compressed data as well when 
compression is enabled.
2021-08-12 10:05:24 -07:00
Harshavardhana
40a2fa8e81
fix: add more optimizations to putMetacacheObject() (#12916)
- avoid extra lookup for 'xl.meta' since we are
  definitely sure that it doesn't exist.

- use this in newMultipartUpload() as well

- also additionally do not write with O_DSYNC
  to avoid loading the drives, instead create
  'xl.meta' for listing operations without
  O_DSYNC since these are ephemeral objects.

- do the same with newMultipartUpload() since
  it gets synced when the PutObjectPart() is
  attempted, we do not need to tax newMultipartUpload()
  instead.
2021-08-10 11:12:22 -07:00
Harshavardhana
035882d292
fix: remove parentIsObject() check (#12851)
we will allow situations such as

```
a/b/1.txt
a/b
```

and

```
a/b
a/b/1.txt
```

we are going to document that this usecase is
not supported and we will never support it, if
any application does this users have to delete
the top level parent to make sure namespace is
accessible at lower level.

rest of the situations where the prefixes get
created across sets are supported as is.
2021-08-03 13:26:57 -07:00
Harshavardhana
bfbdb8f0a8
fix: incorrect O_DIRECT behavior for reads (#12811)
O_DIRECT behavior was broken and it was still
caching all the reads, this change properly fixes
this behavior.
2021-07-28 11:20:16 -07:00
Krishna Srinivas
aa0c28809b
Server side speedtest implementation (#12750) 2021-07-27 12:55:56 -07:00
Harshavardhana
0c666379fe
fix: avoid removing healed parts on dstDataPath (#12795)
destination path and old path will be similar
when healing occurs, this can lead to healed
parts being again purged leading to always an
inconsistent state on an object which might
further cause reduction in quorum eventually.
2021-07-26 15:15:34 -07:00
Harshavardhana
e124d88788
optimize listing operation concurrency (#12728)
- remove use of getOnlineDisks() instead rely on fallbackDisks()
  when disk return errors like diskNotFound, unformattedDisk
  use other fallback disks to list from, instead of paying the
  price for checking getOnlineDisks()

- optimize getDiskID() further to avoid large write locks when
  looking formatLastCheck time window

This new change allows for a more relaxed fallback for listing
allowing for more tolerance and also eventually gain more
consistency in results even if using '3' disks by default.
2021-07-24 22:03:38 -07:00
Krishnan Parthasarathi
29eea52e14
Skip transitioning of object versions if inlined (#12705) 2021-07-16 09:38:27 -07:00
Klaus Post
d6a2fe02d3
Add admin file inspector (#12635)
Download files from *any* bucket/path as an encrypted zip file.

The key is included in the response but can be separated so zip 
and the key doesn't have to be sent on the same channel.

Requires https://github.com/minio/pkg/pull/6
2021-07-09 11:29:16 -07:00
Krishnan Parthasarathi
a1df230518
Add a 'free' version to track deletion of tiered object content (#12470) 2021-06-30 19:32:07 -07:00
Harshavardhana
4669d19f2a
fix: simplify diskMap usage to keep certain checks predictable (#12519)
Bonus: also make sure that we Sanitize() the drives only during
startup of the server, but not during disk reconnects.
2021-06-16 14:26:26 -07:00
Anis Elleuch
f30c996d48
trace: Add bucket/prefix to WalkDir() tracing (#12510)
Bonus, replace os.* API with os-instrumented.go
2021-06-15 14:34:26 -07:00
Harshavardhana
31971906ff
fix: force-delete should just rename to .trash (#12499)
avoid blocking call for force-delete, instead
treat it lazily and delete in background.
2021-06-14 08:04:37 -07:00
Harshavardhana
dd2831c1a0
fix: remove parent dirs in RenameData upon failure (#12452)
- it is possible that during I/O failures we might
  leave partially written directories, make sure
  we purge them after.

- rename current data-dir (null) versionId only after
  the newer xl.meta has been written fully.

- attempt removal once for minioMetaTmpBucket/uuid/
  as this folder is empty if all previous operations
  were successful, this allows avoiding recursive os.Remove()
2021-06-07 09:35:08 -07:00
Anis Elleuch
810af07529
xl: Avoid multi-disks node to exit when one disk fails (#12423)
It makes sense that a node that has multiple disks starts when one
disk fails, returning an i/o error for example. This commit will make this
faulty tolerance available in this specific use case.
2021-06-05 09:10:32 -07:00
Poorna Krishnamoorthy
dbea8d2ee0
Add support for existing object replication. (#12109)
Also adding an API to allow resyncing replication when
existing object replication is enabled and the remote target
is entirely lost. With the `mc replicate reset` command, the
objects that are eligible for replication as per the replication
config will be resynced to target if existing object replication
is enabled on the rule.
2021-06-01 19:59:11 -07:00
Harshavardhana
1f262daf6f
rename all remaining packages to internal/ (#12418)
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
2021-06-01 14:59:40 -07:00
Harshavardhana
81d5688d56
move the dependency to minio/pkg for common libraries (#12397) 2021-05-28 15:17:01 -07:00
Harshavardhana
4840974d7a
fix: inline data upon overwrites should be readable (#12369)
This PR fixes two bugs

- Remove fi.Data upon overwrite of objects from inlined-data to non-inlined-data
- Workaround for an existing bug on disk with latest releases to ignore fi.Data
  and instead read from the disk for non-inlined-data
- Addtionally add a reserved metadata header to indicate data is inlined for
  a given version.
2021-05-25 16:33:06 -07:00
Harshavardhana
ebf75ef10d
fix: remove all unused code (#12360) 2021-05-24 09:28:19 -07:00
Harshavardhana
0287711dc9
fix: implement readMetadata common function for re-use (#12353)
Previous PR #12351 added functions to read from the reader
stream to reduce memory usage, use the same technique in
few other places where we are not interested in reading the
data part.
2021-05-21 11:41:25 -07:00
Klaus Post
9d1b6fb37d
Add XL reader without data (#12351)
Add XL metadata reader that reads metadata only on larger files.

Use for scanning and listing for now.
2021-05-21 09:10:54 -07:00
Klaus Post
2ca9c533ef
feat: implement in-progress partial bucket updates (#12279) 2021-05-19 14:38:30 -07:00
Harshavardhana
2daba018d6
reduce allocations on multi-disk clusters (#12311)
multi-disk clusters initialize buffer pools
per disk, this is perhaps expensive and perhaps
not useful, for a running server instance. As this
may disallow re-use of buffers across sets,

this change ensures that buffers across sets
can be re-used at drive level, this can reduce
quite a lot of memory on large drive setups.
2021-05-17 17:49:48 -07:00
Harshavardhana
2ab9dc7609 do not update bloomFilters for temporary objects 2021-05-15 19:54:07 -07:00
Harshavardhana
4d876d03e8 fix: do not fail upon faulty/non-writable drives
gracefully start the server, if there are other drives
available - print enough information for administrator
to notice the errors in console.

Bonus: for really large streams use larger buffer for
writes.
2021-05-15 12:57:18 -07:00
Klaus Post
229d83bb75
feat: add dynamic usage cache (#12229)
A cache structure will be kept with a tree of usages.
The cache is a tree structure where each keeps track 
of its children.

An uncompacted branch contains a count of the files 
only directly at the branch level, and contains link to 
children branches or leaves.

The leaves are "compacted" based on a number of properties.
A compacted leaf contains the totals of all files beneath it.

A leaf is only scanned once every dataUsageUpdateDirCycles,
rarer if the bloom filter for the path is clean and no lifecycles 
are applied. Skipped leaves have their totals transferred from 
the previous cycle.

A clean leaf will be included once every healFolderIncludeProb 
for partial heal scans. When selected there is a one in 
healObjectSelectProb that any object will be chosen for heal scan.

Compaction happens when either:

- The folder (and subfolders) contains less than dataScannerCompactLeastObject objects.
- The folder itself contains more than dataScannerCompactAtFolders folders.
- The folder only contains objects and no subfolders.
- A bucket root will never be compacted.

Furthermore, if a has more than dataScannerCompactAtChildren recursive 
children (uncompacted folders) the tree will be recursively scanned and the 
branches with the least number of objects will be compacted until the limit 
is reached.

This ensures that any branch will never contain an unreasonable amount 
of other branches, and also that small branches with few objects don't 
take up unreasonable amounts of space.

Whenever a branch is scanned, it is assumed that it will be un-compacted
before it hits any of the above limits. This will make the branch rebalance 
itself when scanned if the distribution of objects has changed.

TLDR; With current values: No bucket will ever have more than 10000 
child nodes recursively. No single folder will have more than 2500 child 
nodes by itself. All subfolders are compacted if they have less than 500 
objects in them recursively.

We accumulate the (non-deletemarker) version count for paths as well, 
since we are changing the structure anyway.
2021-05-11 18:36:15 -07:00
Anis Elleuch
56d4d7b8b1
MRF: Better detection of non stable disks (#12252)
MRF does not detect when a node is disconnected and reconnected quickly
this change will ensure that MRF is alerted by comparing the last disk
reconnection timestamp with the last MRF check time.

Signed-off-by: Anis Elleuch <anis@min.io>

Co-authored-by: Klaus Post <klauspost@gmail.com>
2021-05-11 09:19:15 -07:00
Harshavardhana
764721e2c6
add root_disk threshold detection (#12259)
as there is no automatic way to detect if there
is a root disk mounted on / or /var for the container
environments due to how the root disk information
is masked inside overlay root inside container.

this PR brings an environment variable to set
root disk size threshold manually to detect the
root disks in such situations.
2021-05-08 15:40:29 -07:00
Klaus Post
254698f126
fix: minor allocation improvements in xlMetaV2 (#12133) 2021-05-07 09:11:05 -07:00
Nitish Tiwari
776589f0da Add free inode metric for Prometheus (#12225) 2021-05-06 12:50:48 -07:00
Harshavardhana
c8050bc079
fix: sleeper behavior in data scanner (#12164)
do not apply healReplication() for ILM
expired, transitioned objects
2021-04-27 08:24:44 -07:00
Poorna Krishnamoorthy
4be0f92067
Fix multipart restore to remove part match (#12161)
Part ETags are not available after multipart finalizes, removing this
check as not useful.

Signed-off-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
2021-04-26 18:24:06 -07:00
Krishnan Parthasarathi
c829e3a13b Support for remote tier management (#12090)
With this change, MinIO's ILM supports transitioning objects to a remote tier.
This change includes support for Azure Blob Storage, AWS S3 compatible object
storage incl. MinIO and Google Cloud Storage as remote tier storage backends.

Some new additions include:

 - Admin APIs remote tier configuration management

 - Simple journal to track remote objects to be 'collected'
   This is used by object API handlers which 'mutate' object versions by
   overwriting/replacing content (Put/CopyObject) or removing the version
   itself (e.g DeleteObjectVersion).

 - Rework of previous ILM transition to fit the new model
   In the new model, a storage class (a.k.a remote tier) is defined by the
   'remote' object storage type (one of s3, azure, GCS), bucket name and a
   prefix.

* Fixed bugs, review comments, and more unit-tests

- Leverage inline small object feature
- Migrate legacy objects to the latest object format before transitioning
- Fix restore to particular version if specified
- Extend SharedDataDirCount to handle transitioned and restored objects
- Restore-object should accept version-id for version-suspended bucket (#12091)
- Check if remote tier creds have sufficient permissions
- Bonus minor fixes to existing error messages

Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Krishna Srinivas <krishna@minio.io>
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-23 11:58:53 -07:00
Harshavardhana
069432566f update license change for MinIO
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-23 11:58:53 -07:00
Harshavardhana
2ef824bbb2
collapse two distinct calls into single RenameData() call (#12093)
This is an optimization by reducing one extra system call,
and many network operations. This reduction should increase
the performance for small file workloads.
2021-04-20 10:44:39 -07:00
Harshavardhana
1456f9f090
fix: preserve shared dataDir during suspend overwrites (#12058)
CopyObject() when shares dataDir needs to be preserved,
and upon versioning suspended overwrites should still
preserve the dataDir.
2021-04-15 08:44:05 -07:00
Harshavardhana
928ee1a7b2
remove null version dataDir upon overwrites (#12023) 2021-04-08 19:55:44 -07:00
Klaus Post
d2ac2f758e
odirectReader: handle EOF correctly (#11998)
EOF may be sent along with data so queue it up and 
return it when the buffer is empty.

Also, when reading data without direct io don't add a buffer 
that only results in extra memcopy.
2021-04-07 08:32:59 -07:00
Klaus Post
788a8bc254
Fix disk info race (#11984)
Protect updated members in xlStorage.

```
WARNING: DATA RACE
Write at 0x00c004b4ee78 by goroutine 1491:
  github.com/minio/minio/cmd.(*xlStorage).GetDiskID()
      d:/minio/minio/cmd/xl-storage.go:590 +0x1078
  github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).checkDiskStale()
      d:/minio/minio/cmd/xl-storage-disk-id-check.go:195 +0x84
  github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol()
      d:/minio/minio/cmd/xl-storage-disk-id-check.go:284 +0x16a
  github.com/minio/minio/cmd.erasureObjects.getBucketInfo.func1()
      d:/minio/minio/cmd/erasure-bucket.go:100 +0x1a5
  github.com/minio/minio/pkg/sync/errgroup.(*Group).Go.func1()
      d:/minio/minio/pkg/sync/errgroup/errgroup.go:122 +0xd7

Previous read at 0x00c004b4ee78 by goroutine 1087:
  github.com/minio/minio/cmd.(*xlStorage).CheckFile.func1()
      d:/minio/minio/cmd/xl-storage.go:1699 +0x384
  github.com/minio/minio/cmd.(*xlStorage).CheckFile()
      d:/minio/minio/cmd/xl-storage.go:1726 +0x13c
  github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).CheckFile()
      d:/minio/minio/cmd/xl-storage-disk-id-check.go:446 +0x23b
  github.com/minio/minio/cmd.erasureObjects.parentDirIsObject.func1()
      d:/minio/minio/cmd/erasure-common.go:173 +0x194
  github.com/minio/minio/pkg/sync/errgroup.(*Group).Go.func1()
      d:/minio/minio/pkg/sync/errgroup/errgroup.go:122 +0xd7
```
2021-04-06 11:33:42 -07:00
Harshavardhana
5cce9361bc
fix: avoid an extra rename when there is no dataDir (#11964)
also perform globalSync() in defer when enabled
for RenameData(), to ensure all calls are flushed
to disk.
2021-04-05 08:52:28 -07:00
Harshavardhana
d46386246f
api: Introduce metadata update APIs to update only metadata (#11962)
Current implementation heavily relies on readAllFileInfo
but with the advent of xl.meta inlined with data, we cannot
easily avoid reading data when we are only interested is
updating metadata, this leads to invariably write
amplification during metadata updates, repeatedly reading
data when we are only interested in updating metadata.

This PR ensures that we implement a metadata only update
API at storage layer, that handles updates to metadata alone
for any given version - given the version is valid and
present.

This helps reduce the chattiness for following calls..

- PutObjectTags
- DeleteObjectTags
- PutObjectLegalHold
- PutObjectRetention
- ReplicateObject (updates metadata on replication status)
2021-04-04 13:32:31 -07:00
Poorna Krishnamoorthy
47c09a1e6f
Various improvements in replication (#11949)
- collect real time replication metrics for prometheus.
- add pending_count, failed_count metric for total pending/failed replication operations.

- add API to get replication metrics

- add MRF worker to handle spill-over replication operations

- multiple issues found with replication
- fixes an issue when client sends a bucket
 name with `/` at the end from SetRemoteTarget
 API call make sure to trim the bucket name to 
 avoid any extra `/`.

- hold write locks in GetObjectNInfo during replication
  to ensure that object version stack is not overwritten
  while reading the content.

- add additional protection during WriteMetadata() to
  ensure that we always write a valid FileInfo{} and avoid
  ever writing empty FileInfo{} to the lowest layers.

Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
2021-04-03 09:03:42 -07:00
Harshavardhana
434e5c0cfe
allow preserving legacyXLv1 with inline data format (#11951)
current master breaks this important requirement
we need to preserve legacyXLv1 format, this is simply
ignored and overwritten causing a myriad of issues
by leaving stale files on the namespace etc.

for now lets still use the two-phase approach of
writing to `tmp` and then renaming the content to
the actual namespace.
2021-04-01 22:12:03 -07:00
Harshavardhana
204c610d84
do not use dataDir to reference inline data use versionID (#11942)
versionID is the one that needs to be preserved and as
well as overwritten in case of replication, transition
etc - dataDir is an ephemeral entity that changes
during overwrites - make sure that versionID is used
to save the object content.

this would break things if you are already running
the latest master, please wipe your current content
and re-do your setup after this change.
2021-04-01 13:09:23 -07:00
Klaus Post
2623338dc5
Inline small file data in xl.meta file (#11758) 2021-03-29 17:00:55 -07:00
Harshavardhana
d93c6cb9c7
use Access() instead of Lstat() for frequent use (#11911)
using Lstat() is causing tiny memory allocations,
that are usually wasted and never used, instead
we can simply uses Access() call that does 0
memory allocations.
2021-03-29 08:07:23 -07:00
Harshavardhana
d7f32ad649 xl: avoid sending Delete() remote call for fully successful runs
an optimization to avoid extra syscalls in PutObject(),
adds up to our PutObject response times.
2021-03-24 17:32:12 -07:00
Harshavardhana
410e84d273 xl: add checks for minioTmpMetaBucket in CreateFile 2021-03-24 09:36:10 -07:00
Harshavardhana
75741dbf4a
xl: remove cleanupDir instead use Delete() (#11880)
use a single call to remove directly at disk
instead of doing recursively at network layer.
2021-03-24 09:08:05 -07:00
Ritesh H Shukla
6a2ed44095 fix: optionally enable tracing posix calls 2021-03-23 22:23:08 -07:00
Anis Elleuch
98ff91b484
xl: Reduce usage of isDirEmpty() (#11838)
When an object is removed, its parent directory is inspected to check if
it is empty to remove if that is the case.

However, we can use os.Remove() directly since it is only able to remove
a file or an empty directory.
2021-03-19 15:42:01 -07:00
Anis Elleuch
4d86384dc7
xl: Remove non needed check for empty dir (#11835)
RenameData renames xl.meta and data dir and removes the parent directory
if empty, however, there is a duplicate check for empty dir, since the
parent dir of xl.meta is always the same as the data-dir.
2021-03-19 12:26:53 -07:00
Harshavardhana
b92a220db1
fix: handle weird drives sporadic read O_DIRECT behavior (#11832)
on freshReads if drive returns errInvalidArgument, we
should simply turn-off DirectIO and read normally, there
are situations in k8s like environments where the drives
behave sporadically in a single deployment and may not
have been implemented properly to handle O_DIRECT for
reads.
2021-03-18 20:16:50 -07:00
Harshavardhana
51a8619a79
[feat] Add configurable deadline for writers (#11822)
This PR adds deadlines per Write() calls, such
that slow drives are timed-out appropriately and
the overall responsiveness for Writes() is always
up to a predefined threshold providing applications
sustained latency even if one of the drives is slow
to respond.
2021-03-18 14:09:55 -07:00
Harshavardhana
60b0f2324e
storage write call path optimizations (#11805)
- write in o_dsync instead of o_direct for smaller
  objects to avoid unaligned double Write() situations
  that may arise for smaller objects < 128KiB
- avoid fallocate() as its not useful since we do not
  use Append() semantics anymore, fallocate is not useful
  for streaming I/O we can save on a syscall
- createFile() doesn't need to validate `bucket` name
  with a Lstat() call since createFile() is only used
  to write at `minioTmpBucket`
- use io.Copy() when writing unAligned writes to allow
  usage of ReadFrom() from *os.File providing zero
  buffer writes().
2021-03-17 09:38:38 -07:00
Anis Elleuch
0eb146e1b2
add additional metrics per disk API latency, API call counts #11250)
```
mc admin info --json
```

provides these details, for now, we shall eventually 
expose this at Prometheus level eventually. 

Co-authored-by: Harshavardhana <harsha@minio.io>
2021-03-16 20:06:57 -07:00
Klaus Post
fdc2f69218
truncate xl.meta files upon rewrites #11749)
If the destination files exist and is larger - junk data will be left at the end of the file.
2021-03-09 14:42:24 -08:00
Harshavardhana
d971061305
use listPathRaw for HealObjects() instead of expensive WalkVersions() (#11675) 2021-03-06 09:25:48 -08:00
Klaus Post
fa9cf1251b
Imporve healing and reporting (#11312)
* Provide information on *actively* healing, buckets healed/queued, objects healed/failed.
* Add concurrent healing of multiple sets (typically on startup).
* Add bucket level resume, so restarts will only heal non-healed buckets.
* Print summary after healing a disk is done.
2021-03-04 14:36:23 -08:00
Harshavardhana
c6a120df0e
fix: Prometheus metrics to re-use storage disks (#11647)
also re-use storage disks for all `mc admin server info`
calls as well, implement a new LocalStorageInfo() API
call at ObjectLayer to lookup local disks storageInfo

also fixes bugs where there were double calls to StorageInfo()
2021-03-02 17:28:04 -08:00
Harshavardhana
2f4af09c01 fix: alow changes to readAllData to decrement activeCount() 2021-02-28 20:09:23 -08:00
Harshavardhana
37960cbc2f
fix: avoid writing more content on network with O_DIRECT reads (#11659)
There was an io.LimitReader was missing for the 'length'
parameter for ranged requests, that would cause client to
get truncated responses and errors.

fixes #11651
2021-02-28 15:33:03 -08:00
Harshavardhana
9171d6ef65
rename all references from crawl -> scanner (#11621) 2021-02-26 15:11:42 -08:00
Harshavardhana
6386b45c08
[feat] use rename instead of recursive deletes (#11641)
most of the delete calls today spend time in
a blocking operation where multiple calls need
to be recursively sent to delete the objects,
instead we can use rename operation to atomically
move the objects from the namespace to `tmp/.trash`

we can schedule deletion of objects at this
location once in 15, 30mins and we can also add
wait times between each delete operation.

this allows us to make delete's faster as well
less chattier on the drives, each server runs locally
a groutine which would clean this up regularly.
2021-02-26 09:52:27 -08:00
Harshavardhana
b517c791e9
[feat]: use DSYNC for xl.meta writes and NOATIME for reads (#11615)
Instead of using O_SYNC, we are better off using O_DSYNC
instead since we are only ever interested in data to be
persisted to disk not the associated filesystem metadata.

For reads we ask customers to turn off noatime, but instead
we can proactively use O_NOATIME flag to avoid atime updates
upon reads.
2021-02-24 00:14:16 -08:00
Harshavardhana
18ec933085
fix: for containers use root-disk detection cleverly (#11593)
root-disk implemented currently had issues where root
disk partitions getting modified might race and provide
incorrect results, to avoid this lets rely again back on
DeviceID and match it instead.

In-case of containers `/data` is one such extra entity that
needs to be verified for root disk, due to how 'overlay'
filesystem works and the 'overlay' presents a completely
different 'device' id - using `/data` as another entity
for fallback helps because our containers describe 'VOLUME'
parameter that allows containers to automatically have a
virtual `/data` that points to the container root path this
can either be at `/` or `/var/lib/` (on different partition)
2021-02-22 10:32:21 -08:00
Harshavardhana
8778828a03
fix: read metadata in O_DIRECT if configured and supported (#11594)
reduce the page-cache pressure completely by moving
the entire read-phase of our operations to O_DIRECT,
primarily this is going to be very useful for chatty
metadata operations such as listing, scanner, ilm, healing
like operations to avoid filling up the page-cache upon
repeated runs.
2021-02-22 01:36:17 -08:00
Harshavardhana
7875d472bc
avoid notification for non-existent delete objects (#11514)
Skip notifications on objects that might have had
an error during deletion, this also avoids unnecessary
replication attempt on such objects.

Refactor some places to make sure that we have notified
the client before we

- notify
- schedule for replication
- lifecycle etc.
2021-02-10 22:00:42 -08:00
Harshavardhana
0e3211f4ad
fix: server upgrades should have more descriptive error messages (#11476)
during rolling upgrade, provide a more descriptive error
message and discourage rolling upgrade in such situations,
allowing users to take action.

additionally also rename `slashpath -> pathutil` to avoid
a slighly mis-pronounced usage of `path` package.
2021-02-08 10:15:12 -08:00
Harshavardhana
c9b0f595b9
support directory objects in listing in certain scenarios (#11452)
When a directory object is presented as a `prefix`
param our implementation tend to only list objects
present common to the `prefix` than the `prefix` itself,
to mimic AWS S3 like flat key behavior this PR ensures
that if `prefix` is directory object, it should be
automatically considered to be part of the eventual
listing result.

fixes #11370
2021-02-05 10:12:25 -08:00
Harshavardhana
f108873c48
fix: replication metadata comparsion and other fixes (#11410)
- using miniogo.ObjectInfo.UserMetadata is not correct
- using UserTags from Map->String() can change order
- ContentType comparison needs to be removed.
- Compare both lowercase and uppercase key names.
- do not silently error out constructing PutObjectOptions
  if tag parsing fails
- avoid notification for empty object info, failed operations
  should rely on valid objInfo for notification in all
  situations
- optimize copyObject implementation, also introduce a new 
  replication event
- clone ObjectInfo() before scheduling for replication
- add additional headers for comparison
- remove strings.EqualFold comparison avoid unexpected bugs
- fix pool based proxying with multiple pools
- compare only specific metadata

Co-authored-by: Poorna Krishnamoorthy <poornas@users.noreply.github.com>
2021-02-03 20:41:33 -08:00
Anis Elleuch
b3f81e75f6
xl: Make it clear when to create delete marker for a non existant object (#11423) 2021-02-03 10:33:43 -08:00
Anis Elleuch
6ef678663e
xl: Create a delete-marker when no other version exists (#11362)
Currently, it is not possible to create a delete-marker when xl.meta
does not exist (no version is created for that object yet). This makes a
problem for replication and mc mirroring with versioning enabled.

This also follows S3 specification.
2021-02-01 13:23:50 -08:00
Anis Elleuch
65aa2bc614
ilm: Remove object in HEAD/GET if having an applicable ILM rule (#11296)
Remove an object on the fly if there is a lifecycle rule with delete
expiry action for the corresponding object.
2021-02-01 09:52:11 -08:00
Anis Elleuch
e9ac7b0fb7
heal: Remove empty directories (#11354)
Since the introduction of __XLDIR__, an empty directory does not have a
meaning anymore in erasure mode. Make healing removes it wherever it
finds it.
2021-01-27 02:19:28 -08:00
Harshavardhana
43f973c4cf
fix: check for O_DIRECT support for reads and writes (#11331)
In-case user enables O_DIRECT for reads and backend does
not support it we shall proceed to turn it off instead
and print a warning. This validation avoids any unexpected
downtimes that users may incur.
2021-01-22 15:38:21 -08:00
Harshavardhana
d1a8f0b786
fix possible crashes on deleteMarker replication (#11308)
Delete marker can have `metaSys` set to nil, that
can lead to crashes after the delete marker has
been healed.

Additionally also fix isObjectDangling check
for transitioned objects, that do not have parts
should be treated similar to Delete marker.
2021-01-20 13:12:12 -08:00
Harshavardhana
b5049d541f
fix: reduce an extra readdir() attempted on non-legacy setups (#11301)
to verify moving content and preserving legacy content,
we have way to detect the objects through readdir()
this path is not necessary for most common cases on
newer setups, avoid readdir() to save multiple system
calls.

also fix the CheckFile behavior for most common
use case i.e without legacy format.
2021-01-19 10:01:06 -08:00
Harshavardhana
3ca6330661
fix: optimize parentDirIsObject by moving isObject to storage layer (#11291)
For objects with `N` prefix depth, this PR reduces `N` such network
operations by converting `CheckFile` into a single bulk operation.

Reduction in chattiness here would allow disks to be utilized more
cleanly, while maintaining the same functionality along with one
extra volume check stat() call is removed.

Update tests to test multiple sets scenario
2021-01-18 12:25:22 -08:00
Harshavardhana
f903cae6ff
Support variable server pools (#11256)
Current implementation requires server pools to have
same erasure stripe sizes, to facilitate same SLA
and expectations.

This PR allows server pools to be variadic, i.e they
do not have to be same erasure stripe sizes - instead
they should have SLA for parity ratio.

If the parity ratio cannot be guaranteed by the new
server pool, the deployment is rejected i.e server
pool expansion is not allowed.
2021-01-16 12:08:02 -08:00
Harshavardhana
1a5775e2e8
enable small and large file optimization (#11260)
- for large objects we found that 1MiB block for
  r/w respectively.
- for small objects we found that 128KiB block for
  r/w respectively.
2021-01-12 10:20:39 -08:00
Harshavardhana
e4e117faab
fix: enable xl.json to xl.meta only if legacy drive is found (#11255)
another optimization is renameLegacyMetadata() never needs
to validate bucket with os.Stat() again, leading to reduction
in one extra syscall.
2021-01-11 02:27:04 -08:00
Harshavardhana
4593b146be
fix: print errors only when metacache status has errors (#11248) 2021-01-08 16:52:19 +05:30
Harshavardhana
f21d650ed4
fix: readData in bulk call using messagepack byte wrappers (#11228)
This PR refactors the way we use buffers for O_DIRECT and
to re-use those buffers for messagepack reader writer.

After some extensive benchmarking found that not all objects
have this benefit, and only objects smaller than 64KiB see
this benefit overall.

Benefits are seen from almost all objects from

1KiB - 32KiB

Beyond this no objects see benefit with bulk call approach
as the latency of bytes sent over the wire v/s streaming
content directly from disk negate each other with no
remarkable benefits.

All other optimizations include reuse of msgp.Reader,
msgp.Writer using sync.Pool's for all internode calls.
2021-01-07 19:27:31 -08:00
Harshavardhana
76e2713ffe
fix: use buffers only when necessary for io.Copy() (#11229)
Use separate sync.Pool for writes/reads

Avoid passing buffers for io.CopyBuffer()
if the writer or reader implement io.WriteTo or io.ReadFrom
respectively then its useless for sync.Pool to allocate
buffers on its own since that will be completely ignored
by the io.CopyBuffer Go implementation.

Improve this wherever we see this to be optimal.

This allows us to be more efficient on memory usage.
```
   385  // copyBuffer is the actual implementation of Copy and CopyBuffer.
   386  // if buf is nil, one is allocated.
   387  func copyBuffer(dst Writer, src Reader, buf []byte) (written int64, err error) {
   388  	// If the reader has a WriteTo method, use it to do the copy.
   389  	// Avoids an allocation and a copy.
   390  	if wt, ok := src.(WriterTo); ok {
   391  		return wt.WriteTo(dst)
   392  	}
   393  	// Similarly, if the writer has a ReadFrom method, use it to do the copy.
   394  	if rt, ok := dst.(ReaderFrom); ok {
   395  		return rt.ReadFrom(src)
   396  	}
```

From readahead package
```
// WriteTo writes data to w until there's no more data to write or when an error occurs.
// The return value n is the number of bytes written.
// Any error encountered during the write is also returned.
func (a *reader) WriteTo(w io.Writer) (n int64, err error) {
	if a.err != nil {
		return 0, a.err
	}
	n = 0
	for {
		err = a.fill()
		if err != nil {
			return n, err
		}
		n2, err := w.Write(a.cur.buffer())
		a.cur.inc(n2)
		n += int64(n2)
		if err != nil {
			return n, err
		}
```
2021-01-06 09:36:55 -08:00
Harshavardhana
d0027c3c41
do not use large buffers if not necessary (#11220)
without this change, there is a performance
regression for small objects GETs, this makes
the overall speed to go back to pre '59d363'
commit days.
2021-01-04 18:51:52 -08:00
Harshavardhana
c4b1d394d6
erasure: avoid io.Copy in hotpaths to reduce allocation (#11213) 2021-01-03 16:27:34 -08:00
Harshavardhana
c4131c2798
feat: Small object optimization read data in single bulk call (#11207) 2021-01-03 11:27:57 -08:00
Anis Elleuch
c9d502e6fa
parentDirIsObject() to return quickly with inexistant parent (#11204)
Rewrite parentIsObject() function. Currently if a client uploads
a/b/c/d, we always check if c, b, a are actual objects or not.

The new code will check with the reverse order and quickly quit if 
the segment doesn't exist.

So if a, b, c in 'a/b/c' does not exist in the first place, then returns
false quickly.
2021-01-02 12:01:29 -08:00
Anis Elleuch
677e80c0f8
xl: Remove check-dir in ReadVersion (#11200)
The only purpose of check-dir flag in
ReadVersion is to return 404 when
an object has xl.meta but without data.

This is causing an extract call to the disk 
which can be penalizing in case of busy system
where disks receive many concurrent access.
2021-01-02 10:35:57 -08:00
Anis Elleuch
a317d220ed
xl-storage: Do not stat bucket assuming the object exists (#11201)
In HEAD/GET, only STAT the bucket if the 
object does not exist to return the correct
error response.
2021-01-01 09:44:36 -08:00
Harshavardhana
cc457f1798
fix: enhance logging in crawler use console.Debug instead of logger.Info (#11179) 2020-12-29 01:57:28 -08:00
Harshavardhana
445a9bd827
fix: heal optimizations in crawler to avoid multiple healing attempts (#11173)
Fixes two problems

- Double healing when bitrot is enabled, instead heal attempt
  once in applyActions() before lifecycle is applied.

- If applyActions() is successful and getSize() returns proper
  value, then object is accounted for and should be removed
  from the oldCache namespace map to avoid double heal attempts.
2020-12-28 10:31:00 -08:00
Harshavardhana
c19e6ce773
avoid a crash in crawler when lifecycle is not initialized (#11170)
Bonus for static buffers use bytes.NewReader instead of
bytes.NewBuffer, to use a more reader friendly implementation
2020-12-26 22:58:06 -08:00
Harshavardhana
a773cf48d8
fix: overlapping object and prefix rejected (#11130)
fixes #11129
2020-12-18 08:51:09 -08:00
Harshavardhana
3e83643320
lifecycle improvements and additional debug logging (#11096)
Bonus change fix browser assets
2020-12-13 12:05:54 -08:00