Commit Graph

372 Commits

Author SHA1 Message Date
Klaus Post
4a6c97463f
Fix all racy use of NewDeadlineWorker (#18861)
AlmosAll uses of NewDeadlineWorker, which relied on secondary values, were used in a racy fashion,
which could lead to inconsistent errors/data being returned. It also propagates the deadline downstream.

Rewrite all these to use a generic WithDeadline caller that can return an error alongside a value.

Remove the stateful aspect of DeadlineWorker - it was racy if used - but it wasn't AFAICT.

Fixes races like:

```
WARNING: DATA RACE
Read at 0x00c130b29d10 by goroutine 470237:
  github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion()
      github.com/minio/minio/cmd/xl-storage-disk-id-check.go:702 +0x611
  github.com/minio/minio/cmd.readFileInfo()
      github.com/minio/minio/cmd/erasure-metadata-utils.go:160 +0x122
  github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.1()
      github.com/minio/minio/cmd/erasure-object.go:809 +0x27a
  github.com/minio/minio/cmd.erasureObjects.getObjectFileInfo.func1.2()
      github.com/minio/minio/cmd/erasure-object.go:828 +0x61

Previous write at 0x00c130b29d10 by goroutine 470298:
  github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).ReadVersion.func1()
      github.com/minio/minio/cmd/xl-storage-disk-id-check.go:698 +0x244
  github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
      github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33

WARNING: DATA RACE
Write at 0x00c0ba6e6c00 by goroutine 94507:
  github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol.func1()
      github.com/minio/minio/cmd/xl-storage-disk-id-check.go:419 +0x104
  github.com/minio/minio/internal/ioutil.(*DeadlineWorker).Run.func1()
      github.com/minio/minio/internal/ioutil/ioutil.go:141 +0x33

Previous read at 0x00c0ba6e6c00 by goroutine 94463:
  github.com/minio/minio/cmd.(*xlStorageDiskIDCheck).StatVol()
      github.com/minio/minio/cmd/xl-storage-disk-id-check.go:422 +0x47e
  github.com/minio/minio/cmd.getBucketInfoLocal.func1()
      github.com/minio/minio/cmd/peer-s3-server.go:275 +0x122
  github.com/minio/pkg/v2/sync/errgroup.(*Group).Go.func1()
```

Probably back from #17701
2024-01-24 10:08:31 -08:00
Harshavardhana
52229a21cb
avoid reload of 'format.json' over the network under normal conditions (#18842) 2024-01-23 14:11:46 -08:00
Harshavardhana
dd2542e96c
add codespell action (#18818)
Original work here, #18474,  refixed and updated.
2024-01-17 23:03:17 -08:00
Harshavardhana
21d60eab7c
remove all older unused APIs (#18769) 2024-01-17 20:41:23 -08:00
Anis Eleuch
a47fc75c26
xl: Remove wrong wording for errCorruptedFormat (#18775)
Also add errCorruptedBackend to make it easier to differentiate between
corrupted content or something else wrong in the backend drive
2024-01-12 14:48:44 -08:00
Harshavardhana
e5c8794b8b
avoid disk monitoring leaks under various conditions (#18777)
- HealFormat() was leaking healthcheck goroutines for
  disks, we are only interested in enabling healthcheck
  for the newly formatted disk, not for existing disks.

- When disk is a root-disk a random disk monitor was
  leaking while we ignored the drive.

- When loading the disk for each erasure set, we were
  leaking goroutines for the prepare-storage.go disks
  which were replaced via the globalLocalDrives slice

- avoid disk monitoring utilizing health tokens that
  would cause exhaustion in the tokens, prematurely
  which were meant for incoming I/O. This is ensured
  by avoiding writing O_DIRECT aligned buffer instead
  write 2048 worth of content only as O_DSYNC, which is
  sufficient.
2024-01-12 01:48:36 -08:00
Anis Eleuch
3f4488c589
scanner: Allow full throttle if there is no parallel disk ops (#18109) 2024-01-02 13:51:24 -08:00
Harshavardhana
a50ea92c64
feat: introduce list_quorum="auto" to prefer quorum drives (#18084)
NOTE: This feature is not retro-active; it will not cater to previous transactions
on existing setups. 

To enable this feature, please set ` _MINIO_DRIVE_QUORUM=on` environment
variable as part of systemd service or k8s configmap. 

Once this has been enabled, you need to also set `list_quorum`. 

```
~ mc admin config set alias/ api list_quorum=auto` 
```

A new debugging tool is available to check for any missing counters.
2023-12-29 15:52:41 -08:00
Poorna
e79b289325
fix datadir missing check on HeadObject (#18646)
versions pending purge in replication were seeing a errFileCorrupt
that prevents permanent deletion after replication.

Regression from PR#18477
2023-12-13 14:54:01 -08:00
Harshavardhana
d521c84d55
reduce logging during permission denied errors (#18641)
log them if any only once
2023-12-12 16:11:17 -08:00
Harshavardhana
65f34cd823
fix: remove ODirectReader entirely since we do not need it anymore (#18619) 2023-12-09 10:17:51 -08:00
Anis Eleuch
b7d11141e1
rename Force to Immediate for clarity (#18540) 2023-11-28 22:35:16 -08:00
Krishnan Parthasarathi
9fbd931058
Skip versions expired by DeleteAllVersionsAction (#18537)
Object versions expired by DeleteAllVersionsAction must not be included
toward data-usage accounting.
2023-11-28 08:39:21 -08:00
Anis Eleuch
9cb94eb4a9
cleaning up will delete instead of rename to trash with full disk err (#18534)
moveToTrash() function moves a folder to .trash, for example, when 
doing some object deletions: a data dir that has many parts will be 
renamed to the trash folder; However, ENOSPC is a valid error from 
rename(), and it can cripple a user trying to free some space in an 
entire disk situation.

Therefore, this commit will try to do a recursive delete in that case.
2023-11-27 17:36:02 -08:00
jiuker
be02333529
feat: drive sub-sys to max timeout reload (#18501) 2023-11-27 09:15:06 -08:00
Harshavardhana
fba883839d
feat: bring new HDD related performance enhancements (#18239)
Optionally allows customers to enable 

- Enable an external cache to catch GET/HEAD responses 
- Enable skipping disks that are slow to respond in GET/HEAD 
  when we have already achieved a quorum
2023-11-22 13:46:17 -08:00
Harshavardhana
a4cfb5e1ed
return errors if dataDir is missing during HeadObject() (#18477)
Bonus: allow replication to attempt Deletes/Puts when
the remote returns quorum errors of some kind, this is
to ensure that MinIO can rewrite the namespace with the
latest version that exists on the source.
2023-11-20 21:33:47 -08:00
Klaus Post
51aa59a737
perf: websocket grid connectivity for all internode communication (#18461)
This PR adds a WebSocket grid feature that allows servers to communicate via 
a single two-way connection.

There are two request types:

* Single requests, which are `[]byte => ([]byte, error)`. This is for efficient small
  roundtrips with small payloads.

* Streaming requests which are `[]byte, chan []byte => chan []byte (and error)`,
  which allows for different combinations of full two-way streams with an initial payload.

Only a single stream is created between two machines - and there is, as such, no
server/client relation since both sides can initiate and handle requests. Which server
initiates the request is decided deterministically on the server names.

Requests are made through a mux client and server, which handles message
passing, congestion, cancelation, timeouts, etc.

If a connection is lost, all requests are canceled, and the calling server will try
to reconnect. Registered handlers can operate directly on byte 
slices or use a higher-level generics abstraction.

There is no versioning of handlers/clients, and incompatible changes should
be handled by adding new handlers.

The request path can be changed to a new one for any protocol changes.

First, all servers create a "Manager." The manager must know its address 
as well as all remote addresses. This will manage all connections.
To get a connection to any remote, ask the manager to provide it given
the remote address using.

```
func (m *Manager) Connection(host string) *Connection
```

All serverside handlers must also be registered on the manager. This will
make sure that all incoming requests are served. The number of in-flight 
requests and responses must also be given for streaming requests.

The "Connection" returned manages the mux-clients. Requests issued
to the connection will be sent to the remote.

* `func (c *Connection) Request(ctx context.Context, h HandlerID, req []byte) ([]byte, error)`
   performs a single request and returns the result. Any deadline provided on the request is
   forwarded to the server, and canceling the context will make the function return at once.

* `func (c *Connection) NewStream(ctx context.Context, h HandlerID, payload []byte) (st *Stream, err error)`
   will initiate a remote call and send the initial payload.

```Go
// A Stream is a two-way stream.
// All responses *must* be read by the caller.
// If the call is canceled through the context,
//The appropriate error will be returned.
type Stream struct {
	// Responses from the remote server.
	// Channel will be closed after an error or when the remote closes.
	// All responses *must* be read by the caller until either an error is returned or the channel is closed.
	// Canceling the context will cause the context cancellation error to be returned.
	Responses <-chan Response

	// Requests sent to the server.
	// If the handler is defined with 0 incoming capacity this will be nil.
	// Channel *must* be closed to signal the end of the stream.
	// If the request context is canceled, the stream will no longer process requests.
	Requests chan<- []byte
}

type Response struct {
	Msg []byte
	Err error
}
```

There are generic versions of the server/client handlers that allow the use of type
safe implementations for data types that support msgpack marshal/unmarshal.
2023-11-20 17:09:35 -08:00
Harshavardhana
91d8bddbd1
use sendfile/splice implementation to perform DMA (#18411)
sendfile implementation to perform DMA on all platforms

Go stdlib already supports sendfile/splice implementations
for

- Linux
- Windows
- *BSD
- Solaris

Along with this change however O_DIRECT for reads() must be
removed as well since we need to use sendfile() implementation

The main reason to add O_DIRECT for reads was to reduce the
chances of page-cache causing OOMs for MinIO, however it would
seem that avoiding buffer copies from user-space to kernel space
this issue is not a problem anymore.

There is no Go based memory allocation required, and neither
the page-cache is referenced back to MinIO. This page-
cache reference is fully owned by kernel at this point, this
essentially should solve the problem of page-cache build up.

With this now we also support SG - when NIC supports Scatter/Gather
https://en.wikipedia.org/wiki/Gather/scatter_(vector_addressing)
2023-11-10 10:10:14 -08:00
Harshavardhana
ac8c43fe9c
fix: allow missing hot-tier accounting (#18345) 2023-10-30 14:42:11 -07:00
Harshavardhana
877e0cac03
fix: tiering statistics handling a bug in clone() implementation (#18342)
Tiering statistics have been broken for some time now, a regression
was introduced in 6f2406b0b6

Bonus fixes an issue where the objects are not assumed to be
of the 'STANDARD' storage-class for the objects that have
not yet tiered, this should be conditional based on the object's
metadata not a default assumption.

This PR also does some cleanup in terms of implementation,

fixes #18070
2023-10-30 09:59:51 -07:00
Harshavardhana
e1e33077e8
fix: tests and resync replication status (#18244) 2023-10-13 17:03:34 -07:00
Harshavardhana
409c391850
implement helpers to get relevant info instead of FileInfo() (#18228) 2023-10-12 15:29:59 -07:00
Aditya Manthramurthy
2b4531f069
fix: O_DIRECT is on only for multi-disk setups (#18194)
Disable it for single disk/unsupported platforms
2023-10-09 17:08:40 -07:00
Aditya Manthramurthy
4bda4e4e2b
fix: check for disk-level O_DIRECT support (#18173)
Disk level O_DIRECT support checking at xl storage initialization was
conditional on a config setting being enabled. (This never took effect
because config initialization happens after ObjectLayer is ready.) This
is not necessary as the config setting is dynamic - O_DIRECT should be
enabled via runtime config. So we need to do the disk level support
check regardless of the config setting.
2023-10-05 20:54:49 -06:00
Harshavardhana
cdeab19673
fix: always check error upon w.Close() in Write() (#18111)
not checking w.Close() can prematurely make us
think that the w.Write() actually succeeded, apparently
Write() may or may not return an error but sometimes
only during a Close() call to the fd we may see the
error from Write() propagate.

Fdatasync(w) on the FD would return an error requiring
Close() error handling is less of a concern, however it may
happen such that fdatasync() did not return an error, where
as Close() would.
2023-09-26 11:04:00 -07:00
Harshavardhana
ac3a19138a
fix: set scanning details locally to avoid cached values (#18092)
atomic variable results such as scanning must not use
cached values, instead rely on real-time information.
2023-09-25 08:26:29 -07:00
Harshavardhana
8b8be2695f
optimize mkdir calls to avoid base-dir Mkdir attempts (#18021)
Currently we have IOPs of these patterns

```
[OS] os.Mkdir play.min.io:9000 /disk1 2.718µs
[OS] os.Mkdir play.min.io:9000 /disk1/data 2.406µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys 4.068µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys/tmp 2.843µs
[OS] os.Mkdir play.min.io:9000 /disk1/data/.minio.sys/tmp/d89c8ceb-f8d1-4cc6-b483-280f87c4719f 20.152µs
```

It can be seen that we can save quite Nx levels such as
if your drive is mounted at `/disk1/minio` you can simply
skip sending an `Mkdir /disk1/` and `Mkdir /disk1/minio`.

Since they are expected to exist already, this PR adds a way
for us to ignore all paths upto the mount or a directory which
ever has been provided to MinIO setup.
2023-09-13 08:14:36 -07:00
Harshavardhana
1df5e31706
optimize MRF replication queue to avoid memory leaks (#18007) 2023-09-11 20:59:11 -07:00
Anis Eleuch
41de53996b
heal: calculate the number of workers based on NRRequests (#17945) 2023-09-11 14:48:54 -07:00
Harshavardhana
3995355150
avoid repeated large allocations for large parts (#17968)
objects with 10,000 parts and many of them can
cause a large memory spike which can potentially
lead to OOM due to lack of GC.

with previous PR reducing the memory usage significantly
in #17963, this PR reduces this further by 80% under
repeated calls.

Scanner sub-system has no use for the slice of Parts(),
it is better left empty.

```
benchmark                            old ns/op     new ns/op     delta
BenchmarkToFileInfo/ToFileInfo-8     295658        188143        -36.36%

benchmark                            old allocs     new allocs     delta
BenchmarkToFileInfo/ToFileInfo-8     61             60             -1.64%

benchmark                            old bytes     new bytes     delta
BenchmarkToFileInfo/ToFileInfo-8     1097210       227255        -79.29%
```
2023-09-02 07:49:24 -07:00
Harshavardhana
8208bcb896
remove all unnecessary logging, logOnce when absolutely needed (#17965) 2023-09-01 16:19:18 -07:00
Poorna
b48bbe08b2
Add additional info for replication metrics API (#17293)
to track the replication transfer rate across different nodes,
number of active workers in use and in-queue stats to get
an idea of the current workload.

This PR also adds replication metrics to the site replication
status API. For site replication, prometheus metrics are
no longer at the bucket level - but at the cluster level.

Add prometheus metric to track credential errors since uptime
2023-08-30 01:00:59 -07:00
Harshavardhana
7cafdc0512
fix: skip access checks further for known buckets (#17934) 2023-08-28 15:16:41 -07:00
Harshavardhana
8a57b6bced
use renameat2 Linux extension syscall (#17757)
this is a faster and safer alternative
on newer kernel versions.
2023-08-27 09:57:11 -07:00
Harshavardhana
124e28578c
remove strict persistence requirements for List() .metacache objects (#17917)
.metacache objects are transient in nature, and are better left to
use page-cache effectively to avoid using more IOPs on the disks.

this allows for incoming calls to be not taxed heavily due to
multiple large batch listings.
2023-08-25 07:58:11 -07:00
Harshavardhana
dde1a12819
fix: validate incoming uploadID to be base64 encoded (#17865)
Bonus fixes include

- do not have to write final xl.meta (renameData) does this
  already, saves some IOPs.

- make sure to purge the multipart directory properly using
  a recursive delete, otherwise this can easily pile up and
  rely on the stale uploads cleanup.

fixes #17863
2023-08-17 09:37:55 -07:00
Harshavardhana
6e860b6dc5
count all versions as part of DeleteAllVersionsAction (#17821) 2023-08-09 08:55:19 -07:00
Harshavardhana
cb089dcb52
error out by default beyond 10000 versions per object (#17803)
```
You've exceeded the limit on the number of versions you can create on this object
```
2023-08-04 10:40:21 -07:00
Harshavardhana
0153f96a20
add deadlines for readMetadata() in listing (#17776)
Bonus: also skip spending time looking for xl.json

- Listing()
- Delete()
2023-08-01 21:52:31 -07:00
Harshavardhana
81be718674
fix: optimize DiskInfo() call avoid metrics when not needed (#17763) 2023-07-31 15:20:48 -07:00
Harshavardhana
73edd5b8fd
introduce 'mc admin config set alias/ api odirect=on' (#17753)
change disable_odirect=off -> odirect=on to make it
easier to understand, instead of making it double
negative.
2023-07-31 00:12:53 -07:00
Harshavardhana
f13cfcb83e
allow disabling O_DIRECT for write ops (#17751)
on really slow systems, O_DIRECT simply kills the drives
allow for a way to disable them.
2023-07-29 15:17:56 -07:00
Harshavardhana
731e03fe5a
add ReadFileStream deadline for disk call (#17745)
timeout the reader side if hung via disk max timeout
2023-07-28 15:37:53 -07:00
Harshavardhana
47dcfcbdd4
introduce deadlines on READ operations (#17724) 2023-07-27 07:33:05 -07:00
Harshavardhana
b28bcad11b
avoid Access() calls on known bucket paths (#17719) 2023-07-26 11:31:40 -07:00
Poorna
f95129894d
Use decrypted object size while computing object size summary (#17717)
Corrects an issue with encrypted versioned objects being reported under
`unversioned` bin in the object version histogram
2023-07-24 17:13:25 -07:00
Harshavardhana
14e1ace552
remove serializing WalkDir() across all buckets/prefixes on SSDs (#17707)
slower drives get knocked off because they are too slow via 
active monitoring, we do not need to block calls arbitrarily.

Serializing adds latencies for already slow calls, remove
it for SSDs/NVMEs

Also, add a selection with context when writing to `out <-`
channel, to avoid any potential blocks.
2023-07-24 09:30:19 -07:00
Krishnan Parthasarathi
0120ff93bc
admin-info: add DeleteMarkers count (#17659) 2023-07-18 10:49:40 -07:00
Harshavardhana
24e86d0c59
avoid passing around poolIdx, setIdx instead pass the relevant disks (#17660) 2023-07-17 09:52:05 -07:00
Kaan Kabalak
f64d62b01d
Fix style of logOnceIf calls w/unique identifiers (#17631) 2023-07-11 13:17:45 -07:00
Harshavardhana
82075e8e3a
use strconv variants to improve on performance per 'op' (#17626)
```
BenchmarkItoa
BenchmarkItoa-8         	673628088	         1.946 ns/op	       0 B/op	       0 allocs/op
BenchmarkFormatInt
BenchmarkFormatInt-8    	592919769	         2.012 ns/op	       0 B/op	       0 allocs/op
BenchmarkSprint
BenchmarkSprint-8       	26149144	        49.06 ns/op	       2 B/op	       1 allocs/op
BenchmarkSprintBool
BenchmarkSprintBool-8   	26440180	        45.92 ns/op	       4 B/op	       1 allocs/op
BenchmarkFormatBool
BenchmarkFormatBool-8   	1000000000	         0.2558 ns/op	       0 B/op	       0 allocs/op
```
2023-07-11 07:46:58 -07:00
Kaan Kabalak
bd6842d917
Further print log messages once per error (#17618) 2023-07-10 07:59:57 -07:00
Kaan Kabalak
21fbe88e1f
Print certain log messages once per error (#17484) 2023-06-24 20:29:13 -07:00
Aditya Manthramurthy
5a1612fe32
Bump up madmin-go and pkg deps (#17469) 2023-06-19 17:53:08 -07:00
Harshavardhana
1443b5927a
allow quorum fileInfo to pick same parityBlocks (#17454)
Bonus: allow replication to proceed for 503 errors such as
with error code SlowDownRead
2023-06-18 18:20:15 -07:00
jiuker
5a21b1f353
fix: Delete dir failed when .DS_Store in it (#17352) 2023-06-06 10:12:06 -07:00
Klaus Post
bb6f4d7633
Remove redundant checkFormatJSON logging (#17134) 2023-05-04 07:28:37 -07:00
Klaus Post
e8c0a50862
optimization use small blocks up to 64KB (#17107) 2023-05-01 09:47:49 -07:00
Harshavardhana
8c874884fc
fix: do not copy context in DiskInfo cache (#17085) 2023-04-26 12:13:54 -07:00
Anis Eleuch
224d9a752f
fix: the race in healing tracker code (#17048) 2023-04-18 14:49:56 -07:00
Krishnan Parthasarathi
6877578bbc
Update minio_node_bucket_scans_finished metrics (#17006) 2023-04-11 19:21:34 -07:00
ferhat elmas
056ca0c68e
refactor: reuse open file in storage interface (#16970) 2023-04-09 23:09:28 -07:00
Krishnan Parthasarathi
25f7a8e406
Indicate RenameData is called by healObject (#16997) 2023-04-09 10:25:37 -07:00
Shubhendu
4c204707fd
Correct to remove null version while ILM rule application (#16971)
Signed-off-by: Shubhendu Ram Tripathi <shubhendu@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
2023-04-06 14:10:01 -07:00
jiuker
5fa3665074
add isSysErrNotDir to openFileNoSync (#16958) 2023-04-04 08:00:08 -07:00
Harshavardhana
b984bf8d1a
allow expiration of all versions during Listing() (#16757) 2023-03-09 15:15:30 -08:00
ferhat elmas
714283fae2
cleanup ignored static analysis (#16767) 2023-03-06 08:56:10 -08:00
Klaus Post
9acf1024e4
Remove bloom filter (#16682)
Removes the bloom filter since it has so limited usability, often gets saturated anyway and adds a bunch of complexity to the scanner.

Also removes a tiny bit of CPU by each write operation.
2023-02-24 09:03:31 +05:30
Klaus Post
fd6622458b
Add detailed scanner trace output and notifications (#16668) 2023-02-21 09:33:33 -08:00
Klaus Post
6a04067514
fix: tweak read buffer size to reduce over-reading (#16338) 2023-01-01 08:14:20 -08:00
Krishnan Parthasarathi
2fa35def2c
Fix DeleteObject when only free versions remain (#16289) 2022-12-21 16:24:07 -08:00
Harshavardhana
20ef5e7a6a
avoid double deletes() when no more versions (#16206) 2022-12-12 01:40:04 -08:00
Klaus Post
3eb2d086b2
Replace filepathx with fork (#16192) 2022-12-08 10:42:44 -08:00
Aditya Manthramurthy
a30cfdd88f
Bump up madmin-go to v2 (#16162) 2022-12-06 13:46:50 -08:00
Anis Elleuch
1bae32dc96
xl: Delete older data-dir when replacing an existing version-id (#16176) 2022-12-06 13:43:18 -08:00
Klaus Post
cc1d8f0057
Check for abandoned data when healing (#16122) 2022-11-28 10:20:55 -08:00
Krishnan Parthasarathi
6eef9b4a23
lifecycle: simplify Eval and HasActiveRules (#16036) 2022-11-10 07:17:45 -08:00
Klaus Post
ecc932d5dd
Clean entire tmp-old on restart (#15979) 2022-10-31 07:27:50 -07:00
Harshavardhana
23b329b9df
remove gateway completely (#15929) 2022-10-24 17:44:15 -07:00
Harshavardhana
58a8275e84
do not assume invalid buf to be non-xl.meta (#15843) 2022-10-17 09:39:21 -07:00
Anis Elleuch
6e84283c66
fix: ignoring O_DIRECT in case of erasure single disk (#15734)
fixes #15733 
fixes #15735
2022-09-22 10:41:06 -07:00
Klaus Post
ff12080ff5
Remove deprecated io/ioutil (#15707) 2022-09-19 11:05:16 -07:00
Harshavardhana
2d9b5a65f1
verify RenameData() versions to be consistent (#15649)
xl.meta gets written and never rolled back, however
we definitely need to validate the state that is
persisted on the disk, if there are inconsistencies

- more than write quorum we should return an error
  to the client

- if write quorum was achieved however there are
  inconsistent xl.meta's we should simply trigger
  an MRF on them
2022-09-05 16:51:37 -07:00
Harshavardhana
97376f6e8f
improve performance for inlined data (#15603)
inlined data often is bigger than the allowed
O_DIRECT alignment, so potentially we can write
'xl.meta' without O_DSYNC instead we can rely on
O_DIRECT + fdatasync() instead.

This PR allows O_DIRECT on inlined data that
would gain the benefits of performing O_DIRECT,
eventually performing an fdatasync() at the end.

Performance boost can be observed here for small
objects < 128KiB. The performance boost is mainly
seen on HDD, and marginal on NVMe setups.
2022-08-29 11:19:29 -07:00
Anis Elleuch
5682685c80
Introduce disk io stats metrics (#15512) 2022-08-16 07:13:49 -07:00
Harshavardhana
3bd9615d0e
fix: log if there is readDir() failure with ListBuckets (#15461)
This is actionable and must be logged.

Bonus: also honor umask by using 0o666 for all Open() syscalls.
2022-08-04 07:23:05 -07:00
Harshavardhana
043aaa792d
fix: intrument os.OpenFile differently for Reads and Writes (#15449)
allows us to trace latency for READs or WRITEs
2022-08-01 13:22:43 -07:00
Harshavardhana
7725425e05
fix: fork os.MkdirAll to optimize cases where parent exists (#15379)
a/b/c/d/ where `a/b/c/` exists results in additional syscalls
such as an Lstat() call to verify if the `a/b/c/` exists
and its a directory.

We do not need to do this on MinIO since the parent prefixes
if exist, we can simply return success without spending
additional syscalls.

Also this implementation attempts to simply use Access() calls
to avoid os.Stat() calls since the latter does memory allocation
for things we do not need to use.

Access() is simpler since we have a predictable structure on
the backend and we know exactly how our path structures are.
2022-07-24 00:43:11 -07:00
Klaus Post
69bf39f42e
fix: make complete multipart uploads faster encrypted/compressed backends (#15375)
- Only fetch the parts we need and abort as soon as one is missing.
- Only fetch the number of parts requested by "ListObjectParts".
2022-07-21 16:47:58 -07:00
Klaus Post
f939d1c183
Independent Multipart Uploads (#15346)
Do completely independent multipart uploads.

In distributed mode, a lock was held to merge each multipart 
upload as it was added. This lock was highly contested and 
retries are expensive (timewise) in distributed mode.

Instead, each part adds its metadata information uniquely. 
This eliminates the per object lock required for each to merge.
The metadata is read back and merged by "CompleteMultipartUpload" 
without locks when constructing final object.

Co-authored-by: Harshavardhana <harsha@minio.io>
2022-07-19 08:35:29 -07:00
Praveen raj Mani
b49fc33cb3
purge objects immediately with x-minio-force-delete in DeleteObject and DeleteBucket API (#15148) 2022-07-11 09:15:54 -07:00
Klaus Post
ac055b09e9
Add detailed scanner metrics (#15161) 2022-07-05 14:45:49 -07:00
Harshavardhana
bd099f5e71
fix: change timedValue to return the previously cached value (#15169)
fix: change timedvalue to return previous cached value

caller can interpret the underlying error and decide
accordingly, places where we do not interpret the
errors upon timedValue.Get() - we should simply use
the previously cached value instead of returning "empty".

Bonus: remove some unused code
2022-06-25 08:50:16 -07:00
Harshavardhana
d55efc791f
relax O_DIRECT in single drive mode if unsupported (#15045) 2022-06-07 06:44:01 -07:00
Harshavardhana
df9eeb7f8f
fix: do not log concurrently when multiple disks return errors (#15044)
since the values inside 'context' are mutated internally by
logger, make sure to log serially upon errors not concurrently.
2022-06-06 15:15:11 -07:00
Harshavardhana
5afdc56796
allow single drive mode to run on root disk (#15037)
for practical reasons, allow root disk based installs for single drive mode.
2022-06-03 12:53:42 -07:00
Harshavardhana
52221db7ef
fix: for unexpected errors in reading versioning config panic (#14994)
We need to make sure if we cannot read bucket metadata
for some reason, and bucket metadata is not missing and
returning corrupted information we should panic such
handlers to disallow I/O to protect the overall state
on the system.

In-case of such corruption we have a mechanism now
to force recreate the metadata on the bucket, using
`x-minio-force-create` header with `PUT /bucket` API
call.

Additionally fix the versioning config updated state
to be set properly for the site replication healing
to trigger correctly.
2022-05-31 02:57:57 -07:00
Anis Elleuch
ca69e54cb6
tests: Fix sporadic failure of TestXLStorageDeleteFile (#14911)
The test expects from DeleteFile to return errDiskNotFound when the disk
is not available. It calls os.RemoveAll() to remove one disk after XL storage
initialization. However, this latter contains some goroutines which can
race with os.RemoveAll() and then the test fails sporadically with
returning random errors.

The commit will tweak the initialization routine of the XL storage to
only run deletion of temporary and metacache data in the  background,
so TestXLStorageDeleteFile won't fail anymore.
2022-05-12 15:24:58 -07:00
Anis Elleuch
df50eda811
Add number of versions in server info API (#14812)
The goal is to show the number of versions in the server info API.
2022-04-25 22:04:10 -07:00