Commit Graph

53 Commits

Author SHA1 Message Date
Harshavardhana
a17f14f73a
separate lock from common grid to avoid epoll contention (#20180)
epoll contention on TCP causes latency build-up when
we have high volume ingress. This PR is an attempt to
relieve this pressure.

upstream issue https://github.com/golang/go/issues/65064
It seems to be a deeper problem; haven't yet tried the fix
provide in this issue, but however this change without
changing the compiler helps. 

Of course, this is a workaround for now, hoping for a
more comprehensive fix from Go runtime.
2024-07-29 11:10:04 -07:00
Klaus Post
51aa59a737
perf: websocket grid connectivity for all internode communication (#18461)
This PR adds a WebSocket grid feature that allows servers to communicate via 
a single two-way connection.

There are two request types:

* Single requests, which are `[]byte => ([]byte, error)`. This is for efficient small
  roundtrips with small payloads.

* Streaming requests which are `[]byte, chan []byte => chan []byte (and error)`,
  which allows for different combinations of full two-way streams with an initial payload.

Only a single stream is created between two machines - and there is, as such, no
server/client relation since both sides can initiate and handle requests. Which server
initiates the request is decided deterministically on the server names.

Requests are made through a mux client and server, which handles message
passing, congestion, cancelation, timeouts, etc.

If a connection is lost, all requests are canceled, and the calling server will try
to reconnect. Registered handlers can operate directly on byte 
slices or use a higher-level generics abstraction.

There is no versioning of handlers/clients, and incompatible changes should
be handled by adding new handlers.

The request path can be changed to a new one for any protocol changes.

First, all servers create a "Manager." The manager must know its address 
as well as all remote addresses. This will manage all connections.
To get a connection to any remote, ask the manager to provide it given
the remote address using.

```
func (m *Manager) Connection(host string) *Connection
```

All serverside handlers must also be registered on the manager. This will
make sure that all incoming requests are served. The number of in-flight 
requests and responses must also be given for streaming requests.

The "Connection" returned manages the mux-clients. Requests issued
to the connection will be sent to the remote.

* `func (c *Connection) Request(ctx context.Context, h HandlerID, req []byte) ([]byte, error)`
   performs a single request and returns the result. Any deadline provided on the request is
   forwarded to the server, and canceling the context will make the function return at once.

* `func (c *Connection) NewStream(ctx context.Context, h HandlerID, payload []byte) (st *Stream, err error)`
   will initiate a remote call and send the initial payload.

```Go
// A Stream is a two-way stream.
// All responses *must* be read by the caller.
// If the call is canceled through the context,
//The appropriate error will be returned.
type Stream struct {
	// Responses from the remote server.
	// Channel will be closed after an error or when the remote closes.
	// All responses *must* be read by the caller until either an error is returned or the channel is closed.
	// Canceling the context will cause the context cancellation error to be returned.
	Responses <-chan Response

	// Requests sent to the server.
	// If the handler is defined with 0 incoming capacity this will be nil.
	// Channel *must* be closed to signal the end of the stream.
	// If the request context is canceled, the stream will no longer process requests.
	Requests chan<- []byte
}

type Response struct {
	Msg []byte
	Err error
}
```

There are generic versions of the server/client handlers that allow the use of type
safe implementations for data types that support msgpack marshal/unmarshal.
2023-11-20 17:09:35 -08:00
Anis Eleuch
6d0bc5ab1e
prometheus: Fix internode stats (#17594)
Internode calculation was done inside S3 handlers, fix it by moving it
to internode handlers.

Remove admin stats since it is not used.
2023-07-08 07:35:11 -07:00
Klaus Post
ff5988f4e0
Reduce allocations (#17584)
* Reduce allocations

* Add stringsHasPrefixFold which can compare string prefixes, while ignoring case and not allocating.
* Reuse all msgp.Readers
* Reuse metadata buffers when not reading data.

* Make type safe. Make buffer 4K instead of 8.

* Unslice
2023-07-06 16:02:08 -07:00
Harshavardhana
31b0decd46
migrate to minio/mux from gorilla/mux (#16456) 2023-01-23 16:42:47 +05:30
Anis Elleuch
a35ef155fc
return appropriate error status code in the lock handler (#15950) 2022-10-26 09:51:26 -07:00
Harshavardhana
f1abb92f0c
feat: Single drive XL implementation (#14970)
Main motivation is move towards a common backend format
for all different types of modes in MinIO, allowing for
a simpler code and predictable behavior across all features.

This PR also brings features such as versioning, replication,
transitioning to single drive setups.
2022-05-30 10:58:37 -07:00
Harshavardhana
6cfb1cb6fd
fix: timer usage across codebase (#14935)
it seems in some places we have been wrongly using the
timer.Reset() function, nicely exposed by an example
shared by @donatello https://go.dev/play/p/qoF71_D1oXD

this PR fixes all the usage comprehensively
2022-05-17 22:42:59 -07:00
Klaus Post
779060bc16
Locker: Improve Refresh speed (#13430)
Refresh was doing a linear scan of all locked resources. This was adding 
up to significant delays in locking on high load systems with long 
running requests.

Add a secondary index for O(log(n)) UID -> resource lookups. 
Multiple resources are stored in consecutive strings.

Bonus fixes:

 * On multiple Unlock entries unlock the write locks we can.
 * Fix `expireOldLocks` skipping checks on entry after expiring one.
 * Return fast on canTakeUnlock/canTakeLock.
 * Prealloc some places.
2021-10-15 03:12:13 -07:00
Harshavardhana
ffd497673f
internode lockArgs should use messagepack (#13329)
it would seem like using `bufio.Scan()` is very
slow for heavy concurrent I/O, ie. when r.Body
is slow , instead use a proper
binary exchange format, to marshal and unmarshal
the LockArgs datastructure in a cleaner way.

this PR increases performance of the locking
sub-system for tiny repeated read lock requests
on same object.

```
BenchmarkLockArgs
BenchmarkLockArgs-4              6417609               185.7 ns/op            56 B/op          2 allocs/op
BenchmarkLockArgsOld
BenchmarkLockArgsOld-4           1187368              1015 ns/op            4096 B/op          1 allocs/op
```
2021-09-30 11:53:01 -07:00
Harshavardhana
a2cd3c9a1d
use ParseForm() to allow query param lookups once (#12900)
```
cpu: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
BenchmarkURLQueryForm
BenchmarkURLQueryForm-4         247099363                4.809 ns/op           0 B/op          0 allocs/op
BenchmarkURLQuery
BenchmarkURLQuery-4              2517624               462.1 ns/op           432 B/op          4 allocs/op
PASS
ok      github.com/minio/minio/cmd      3.848s
```
2021-08-07 22:43:01 -07:00
Harshavardhana
1f262daf6f
rename all remaining packages to internal/ (#12418)
This is to ensure that there are no projects
that try to import `minio/minio/pkg` into
their own repo. Any such common packages should
go to `https://github.com/minio/pkg`
2021-06-01 14:59:40 -07:00
Anis Elleuch
0b34dfb479
lock: Timeout Unlock RPC call (#12213)
RPC unlock call needs to be timed out otherwise this can block
indefinitely.

Signed-off-by: Anis Elleuch <anis@min.io>
2021-05-11 02:11:29 -07:00
Harshavardhana
069432566f update license change for MinIO
Signed-off-by: Harshavardhana <harsha@minio.io>
2021-04-23 11:58:53 -07:00
Anis Elleuch
7be7109471
locking: Add Refresh for better locking cleanup (#11535)
Co-authored-by: Anis Elleuch <anis@min.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
2021-03-03 18:36:43 -08:00
Harshavardhana
2a79ea0332
isServerResolvable its sufficient to check server is reachable (#11609)
using isServerResolvable for expiration can lead to chicken
and egg problems, a lock might expire knowingly when server
is booting up causing perpetual locks getting expired.
2021-02-22 16:29:53 -08:00
Harshavardhana
88c1bb0720
fix: improper ticker usage in goroutines (#11468)
- lock maintenance loop was incorrectly sleeping
  as well as using ticker badly, leading to
  extra expiration routines getting triggered
  that could flood the network.

- multipart upload cleanup should be based on
  timer instead of ticker, to ensure that long
  running jobs don't get triggered twice.

- make sure to get right lockers for object name
2021-02-05 19:23:48 -08:00
Harshavardhana
da55a05587
fix aggressive expiration detection (#11446)
for some flaky networks this may be too fast of a value
choose a defensive value, and let this be addressed
properly in a new refactor of dsync with renewal logic.

Also enable faster fallback delay to cater for misconfigured
IPv6 servers

refer
 - https://golang.org/pkg/net/#Dialer
 - https://tools.ietf.org/html/rfc6555
2021-02-04 16:56:40 -08:00
Harshavardhana
9cdd981ce7
fix: expire locks only on participating lockers (#11335)
additionally also add a new ForceUnlock API, to
allow forcibly unlocking locks if possible.
2021-01-25 10:01:27 -08:00
Harshavardhana
4550ac6fff
fix: refactor locks to apply them uniquely per node (#11052)
This refactor is done for few reasons below

- to avoid deadlocks in scenarios when number
  of nodes are smaller < actual erasure stripe
  count where in N participating local lockers
  can lead to deadlocks across systems.

- avoids expiry routines to run 1000 of separate
  network operations and routes per disk where
  as each of them are still accessing one single
  local entity.

- it is ideal to have since globalLockServer
  per instance.

- In a 32node deployment however, each server
  group is still concentrated towards the
  same set of lockers that partipicate during
  the write/read phase, unlike previous minio/dsync
  implementation - this potentially avoids send
  32 requests instead we will still send at max
  requests of unique nodes participating in a
  write/read phase.

- reduces overall chattiness on smaller setups.
2020-12-10 07:28:37 -08:00
Harshavardhana
4ec45753e6 rename server sets to server pools 2020-12-01 13:50:33 -08:00
Harshavardhana
029758cb20
fix: retain the previous UUID for newly replaced drives (#10759)
only newly replaced drives get the new `format.json`,
this avoids disks reloading their in-memory reference
format, ensures that drives are online without
reloading the in-memory reference format.

keeping reference format in-tact means UUIDs
never change once they are formatted.
2020-10-26 10:29:29 -07:00
Harshavardhana
d9db7f3308
expire lockers if lockers are offline (#10749)
lockers currently might leave stale lockers,
in unknown ways waiting for downed lockers.

locker check interval is high enough to safely
cleanup stale locks.
2020-10-24 13:23:16 -07:00
Harshavardhana
ad726b49b4
rename zones to serverSets to avoid terminology conflict (#10679)
we are bringing in availability zones, we should avoid
zones as per server expansion concept.
2020-10-15 14:28:50 -07:00
Harshavardhana
a3ba8188d7 fix: allow locker to be niladic 2020-10-12 14:23:44 -07:00
Harshavardhana
2760fc86af
Bump default idleConnsPerHost to control conns in time_wait (#10653)
This PR fixes a hang which occurs quite commonly at higher concurrency
by allowing following changes

- allowing lower connections in time_wait allows faster socket open's
- lower idle connection timeout to ensure that we let kernel
  reclaim the time_wait connections quickly
- increase somaxconn to 4096 instead of 2048 to allow larger tcp
  syn backlogs.

fixes #10413
2020-10-12 14:19:46 -07:00
Harshavardhana
a0d0645128
remove safeMode behavior in startup (#10645)
In almost all scenarios MinIO now is
mostly ready for all sub-systems
independently, safe-mode is not useful
anymore and do not serve its original
intended purpose.

allow server to be fully functional
even with config partially configured,
this is to cater for availability of actual
I/O v/s manually fixing the server.

In k8s like environments it will never make
sense to take pod into safe-mode state,
because there is no real access to perform
any remote operation on them.
2020-10-09 09:59:52 -07:00
Harshavardhana
eafa775952
fix: add lock ownership to expire locks (#10571)
- Add owner information for expiry, locking, unlocking a resource
- TopLocks returns now locks in quorum by default, provides
  a way to capture stale locks as well with `?stale=true`
- Simplify the quorum handling for locks to avoid from storage
  class, because there were challenges to make it consistent
  across all situations.
- And other tiny simplifications to reset locks.
2020-09-25 19:21:52 -07:00
Harshavardhana
90cff10e2b avoid crash if disks are not initialized 2020-09-23 12:00:29 -07:00
Harshavardhana
d2a3f92452
fix: health handler for lockers (#10280) 2020-08-18 07:27:41 -07:00
Harshavardhana
a20d4568a2
fix: make sure to use uniform drive count calculation (#10208)
It is possible in situations when server was deployed
in asymmetric configuration in the past such as

```
minio server ~/fs{1...4}/disk{1...5}
```

Results in setDriveCount of 10 in older releases
but with fairly recent releases we have moved to
having server affinity which means that a set drive
count ascertained from above config will be now '4'

While the object layer make sure that we honor
`format.json` the storageClass configuration however
was by mistake was using the global value obtained
by heuristics. Which leads to prematurely using
lower parity without being requested by the an
administrator.

This PR fixes this behavior.
2020-08-05 13:31:12 -07:00
Harshavardhana
fe157166ca
fix: Pass context all the way down to the network call in lockers (#10161)
Context timeout might race on each other when timeouts are lower
i.e when two lock attempts happened very quickly on the same resource
and the servers were yet trying to establish quorum.

This situation can lead to locks held which wouldn't be unlocked
and subsequent lock attempts would fail.

This would require a complete server restart. A potential of this
issue happening is when server is booting up and we are trying
to hold a 'transaction.lock' in quick bursts of timeout.
2020-07-29 23:15:34 -07:00
Harshavardhana
7ed1077879
Add a custom healthcheck function for online status (#9858)
- Add changes to ensure remote disks are not
  incorrectly taken online if their order has
  changed or are incorrect disks.
- Bring changes to peer to detect disconnection
  with separate Health handler, to avoid a
  rather expensive call GetLocakDiskIDs()
- Follow up on the same changes for Lockers
  as well
2020-06-17 14:49:26 -07:00
Harshavardhana
4915433bd2
Support bucket versioning (#9377)
- Implement a new xl.json 2.0.0 format to support,
  this moves the entire marshaling logic to POSIX
  layer, top layer always consumes a common FileInfo
  construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
  for object placements.

Fixes #2111
2020-06-12 20:04:01 -07:00
Harshavardhana
cfc9cfd84a
fix: various optimizations, idiomatic changes (#9179)
- acquire since leader lock for all background operations
  - healing, crawling and applying lifecycle policies.

- simplify lifecyle to avoid network calls, which was a
  bug in implementation - we should hold a leader and
  do everything from there, we have access to entire
  name space.

- make listing, walking not interfere by slowing itself
  down like the crawler.

- effectively use global context everywhere to ensure
  proper shutdown, in cache, lifecycle, healing

- don't read `format.json` for prometheus metrics in
  StorageInfo() call.
2020-03-22 12:16:36 -07:00
Harshavardhana
c9212819af
fix: lock maintenance should honor quorum (#9138)
The staleness of a lock should be determined by
the quorum number of entries returning stale,
this allows for situations when locks are held
when nodes are down - we don't accidentally
clear locks unintentionally when they are valid
and correct.

Also lock maintenance should be run by all servers,
not one server, stale locks need to be run outside
the requirement for holding distributed locks.

Thanks @klauspost for reproducing this issue
2020-03-15 11:55:52 -07:00
kannappanr
c7ca791c58
fix: lock expiry on zoned setups (#9084)
lock ownership is limited to endpoints on first zone,
as we do not hold locks on other zones in an expanded
setup. current code unintentionally expired active locks
when it couldn't see ownership from the secondary zone
which leads to unexpected bugs as locking fails to work
as expected.
2020-03-04 16:06:17 -08:00
Harshavardhana
ab7d3cd508
fix: Speed up multi-object delete by taking bulk locks (#8974)
Change distributed locking to allow taking bulk locks
across objects, reduces usually 1000 calls to 1.

Also allows for situations where multiple clients sends
delete requests to objects with following names

```
{1,2,3,4,5}
```

```
{5,4,3,2,1}
```

will block and ensure that we do not fail the request
on each other.
2020-02-21 11:29:57 +05:30
Harshavardhana
b00cda8ad4 Avoid running lock maintenance from all nodes (#8737)
Co-Authored-By: Krishnan Parthasarathi <krisis@users.noreply.github.com>
2020-01-03 23:11:07 +05:30
Harshavardhana
10b2f15f6f Add randomize sleep times for lock checkers (#8628) 2019-12-11 10:57:05 -08:00
Harshavardhana
720442b1a2
Add lock expiry handler to expire state locks (#8562) 2019-11-25 16:39:43 -08:00
Harshavardhana
c3771df641
Add bootstrap REST handler for verifying server config (#8550) 2019-11-22 12:45:13 -08:00
Harshavardhana
347b29d059 Implement bucket expansion (#8509) 2019-11-19 17:42:27 -08:00
Harshavardhana
e9b2bf00ad Support MinIO to be deployed on more than 32 nodes (#8492)
This PR implements locking from a global entity into
a more localized set level entity, allowing for locks
to be held only on the resources which are writing
to a collection of disks rather than a global level.

In this process this PR also removes the top-level
limit of 32 nodes to an unlimited number of nodes. This
is a precursor change before bring in bucket expansion.
2019-11-13 12:17:45 -08:00
Harshavardhana
4e63e0e372 Return appropriate errors API versions changes across REST APIs (#8480)
This PR adds code to appropriately handle versioning issues
that come up quite constantly across our API changes. Currently
we were also routing our requests wrong which sort of made it
harder to write a consistent error handling code to appropriately
reject or honor requests.

This PR potentially fixes issues

 - old mc is used against new minio release which is incompatible
   returns an appropriate for client action.
 - any older servers talking to each other, report appropriate error
 - incompatible peer servers should report error and reject the calls
   with appropriate error
2019-11-04 09:30:59 -08:00
Harshavardhana
6a4ef2e48e Initialize configs correctly, move notification config (#8367)
This PR also removes deprecated tests, adds checks
to avoid races reproduced on CI/CD.
2019-10-09 11:41:15 +05:30
Praveen raj Mani
e48005ddc7 Add more context to rpc version mismatch errors (#8271)
Fixes #5665
2019-10-03 00:08:12 -07:00
Krishna Srinivas
2ab0681c0c Do not ignore Lock()'s return value (#8142) 2019-08-28 16:12:57 -07:00
Harshavardhana
e6d8e272ce
Use const slashSeparator instead of "/" everywhere (#8028) 2019-08-06 12:08:58 -07:00
Harshavardhana
b52b90412b Avoid data-transfer in distributed locking (#8004) 2019-08-05 11:45:30 -07:00