Add some design docs for distributed setup (#7950)

This commit is contained in:
Harshavardhana 2019-07-22 19:18:11 -07:00 committed by Nitish Tiwari
parent 38bc3a45db
commit 87e6533cf3
2 changed files with 61 additions and 37 deletions

View File

@ -1,28 +1,26 @@
# Large Bucket Support Design Guide [![Slack](https://slack.min.io/slack?type=svg)](https://slack.min.io) # Distributed Server Design Guide [![Slack](https://slack.min.io/slack?type=svg)](https://slack.min.io)
This document explains the design approach, advanced use cases and limits of the MinIO distributed server.
This document explains the design approach, advanced use cases and limits of the large bucket feature. If you're looking to get started with large bucket support, we suggest you go through the [getting started document](https://github.com/minio/minio/blob/master/docs/large-bucket/README.md) first.
## Command-line ## Command-line
``` ```
NAME: NAME:
minio server - Start object storage server. minio server - start object storage server.
USAGE: USAGE:
minio server [FLAGS] DIR1 [DIR2..] minio server [FLAGS] DIR1 [DIR2..]
minio server [FLAGS] DIR{1...64} minio server [FLAGS] DIR{1...64}
DIR: DIR:
DIR points to a directory on a filesystem. When you want to combine multiple drives DIR points to a directory on a filesystem. When you want to combine
into a single large system, pass one directory per filesystem separated by space. multiple drives into a single large system, pass one directory per
You may also use a `...` convention to abbreviate the directory arguments. Remote filesystem separated by space. You may also use a '...' convention
directories in a distributed setup are encoded as HTTP(s) URIs. to abbreviate the directory arguments. Remote directories in a
distributed setup are encoded as HTTP(s) URIs.
``` ```
## Common usage ## Common usage
Standalone erasure coded configuration with 4 sets with 16 disks each. Standalone erasure coded configuration with 4 sets with 16 disks each.
``` ```
minio server dir{1...64} minio server dir{1...64}
``` ```
@ -33,38 +31,21 @@ Distributed erasure coded configuration with 64 sets with 16 disks each.
minio server http://host{1...16}/export{1...64} minio server http://host{1...16}/export{1...64}
``` ```
## Other usages ## Architecture
### Advanced use cases with multiple ellipses Expansion of ellipses and choice of erasure sets based on this expansion is an automated process in MinIO. Here are some of the details of our underlying erasure coding behavior.
Standalone erasure coded configuration with 4 sets with 16 disks each, which spawns disks across controllers. - Erasure coding used by MinIO is [Reed-Solomon](https://github.com/klauspost/reedsolomon) erasure coding scheme, which has a total shard maximum of 256 i.e 128 data and 128 parity. MinIO design goes beyond this limitation by doing some practical architecture choices.
```
minio server /mnt/controller{1...4}/data{1...16}
```
Standalone erasure coded configuration with 16 sets 16 disks per set, across mnts, across controllers. - Erasure set is a single erasure coding unit within a MinIO deployment. An object is sharded within an erasure set. Erasure set size is automatically calculated based on the number of disks. MinIO supports unlimited number of disks but each erasure set can be upto 16 disks and a minimum of 4 disks.
```
minio server /mnt{1..4}/controller{1...4}/data{1...16}
```
Distributed erasure coded configuration with 2 sets 16 disks per set across hosts. - We limited the number of drives to 16 for erasure set because, erasure code shards more than 16 can become chatty and do not have any performance advantages. Additionally since 16 drive erasure set gives you tolerance of 8 disks per object by default which is plenty in any practical scenario.
```
minio server http://host{1...32}/disk1
```
Distributed erasure coded configuration with rack level redundancy 32 sets in total, 16 disks per set. - Choice of erasure set size is automatic based on the number of disks available, let's say for example if there are 32 servers and 32 disks which is a total of 1024 disks. In this scenario 16 becomes the erasure set size. This is decided based on the greatest common divisor (GCD) of acceptable erasure set sizes ranging from *4, 6, 8, 10, 12, 14, 16*.
```
minio server http://rack{1...4}-host{1...8}.example.net/export{1...16}
```
Distributed erasure coded configuration with no rack level redundancy but redundancy with in the rack we split the arguments, 32 sets in total, 16 disks per set. - *If total disks has many common divisors the algorithm chooses the minimum amounts of erasure sets possible for a erasure set size of any N*. In the example with 1024 disks - 4, 8, 16 are GCD factors. With 16 disks we get a total of 64 possible sets, with 8 disks we get a total of 128 possible sets, with 4 disks we get a total of 256 possible sets. So algorithm automatically chooses 64 sets, which is *16 * 64 = 1024* disks in total.
```
minio server http://rack1-host{1...8}.example.net/export{1...16} http://rack2-host{1...8}.example.net/export{1...16} http://rack3-host{1...8}.example.net/export{1...16} http://rack4-host{1...8}.example.net/export{1...16}
```
### Expected expansion for double ellipses - In this algorithm, we also make sure that we spread the disks out evenly. MinIO server expands ellipses passed as arguments. Here is a sample expansion to demonstrate the process.
MinIO server internally expands ellipses passed as arguments. Here is a sample expansion to demonstrate the process
``` ```
minio server http://host{1...4}/export{1...8} minio server http://host{1...4}/export{1...8}
@ -106,6 +87,50 @@ Expected expansion
> http://host4/export8 > http://host4/export8
``` ```
A noticeable trait of this expansion is that it chooses unique hosts such that the erasure code is efficient across drives and hosts.
- Choosing an erasure set for the object is decided during `PutObject()`, object names are used to find the right erasure set using the following pseudo code.
```go
// hashes the key returning an integer.
func crcHashMod(key string, cardinality int) int {
keyCrc := crc32.Checksum([]byte(key), crc32.IEEETable)
return int(keyCrc % uint32(cardinality))
}
```
Input for the key is the object name specified in `PutObject()`, returns a unique index. This index is one of the erasure sets where the object will reside. This function is a consistent hash for a given object name i.e for a given object name the index returned is always the same.
- Write and Read quorum are required to be satisfied only across the erasure set for an object. Healing is also done per object within the erasure set which contains the object.
- MinIO does erasure coding at the object level not at the volume level, unlike other object storage vendors. This allows applications to choose different storage class by setting `x-amz-storage-class=STANDARD/REDUCED_REDUNDANCY` for each object uploads so effectively utilizing the capacity of the cluster. Additionally these can also be enforced using IAM policies to make sure the client uploads with correct HTTP headers.
## Other usages
### Advanced use cases with multiple ellipses
Standalone erasure coded configuration with 4 sets with 16 disks each, which spawns disks across controllers.
```
minio server /mnt/controller{1...4}/data{1...16}
```
Standalone erasure coded configuration with 16 sets, 16 disks per set, across mounts and controllers.
```
minio server /mnt{1..4}/controller{1...4}/data{1...16}
```
Distributed erasure coded configuration with 2 sets, 16 disks per set across hosts.
```
minio server http://host{1...32}/disk1
```
Distributed erasure coded configuration with rack level redundancy 32 sets in total, 16 disks per set.
```
minio server http://rack{1...4}-host{1...8}.example.net/export{1...16}
```
Distributed erasure coded configuration with no rack level redundancy but redundancy with in the rack we split the arguments, 32 sets in total, 16 disks per set.
```
minio server http://rack1-host{1...8}.example.net/export{1...16} http://rack2-host{1...8}.example.net/export{1...16} http://rack3-host{1...8}.example.net/export{1...16} http://rack4-host{1...8}.example.net/export{1...16}
```
## Backend `format.json` changes ## Backend `format.json` changes
`format.json` has new fields `format.json` has new fields
@ -186,5 +211,5 @@ type formatXLV2 struct {
## Limits ## Limits
- Minimum of 4 disks are needed for erasure coded configuration. - Minimum of 4 disks are needed for any erasure coded configuration.
- Maximum of 32 distinct nodes are supported in distributed configuration. - Maximum of 32 distinct nodes are supported in distributed configuration.

View File

@ -67,7 +67,6 @@ __NOTE:__ `{1...n}` shown have 3 dots! Using only 2 dots `{1..32}` will be inter
To test this setup, access the MinIO server via browser or [`mc`](https://docs.min.io/docs/minio-client-quickstart-guide). To test this setup, access the MinIO server via browser or [`mc`](https://docs.min.io/docs/minio-client-quickstart-guide).
## Explore Further ## Explore Further
- [MinIO Large Bucket Support Guide](https://docs.min.io/docs/minio-large-bucket-support-quickstart-guide)
- [MinIO Erasure Code QuickStart Guide](https://docs.min.io/docs/minio-erasure-code-quickstart-guide) - [MinIO Erasure Code QuickStart Guide](https://docs.min.io/docs/minio-erasure-code-quickstart-guide)
- [Use `mc` with MinIO Server](https://docs.min.io/docs/minio-client-quickstart-guide) - [Use `mc` with MinIO Server](https://docs.min.io/docs/minio-client-quickstart-guide)
- [Use `aws-cli` with MinIO Server](https://docs.min.io/docs/aws-cli-with-minio) - [Use `aws-cli` with MinIO Server](https://docs.min.io/docs/aws-cli-with-minio)