Commit Graph

123 Commits

Author SHA1 Message Date
Harshavardhana fb96779a8a Add large bucket support for erasure coded backend (#5160)
This PR implements an object layer which
combines input erasure sets of XL layers
into a unified namespace.

This object layer extends the existing
erasure coded implementation, it is assumed
in this design that providing > 16 disks is
a static configuration as well i.e if you started
the setup with 32 disks with 4 sets 8 disks per
pack then you would need to provide 4 sets always.

Some design details and restrictions:

- Objects are distributed using consistent ordering
  to a unique erasure coded layer.
- Each pack has its own dsync so locks are synchronized
  properly at pack (erasure layer).
- Each pack still has a maximum of 16 disks
  requirement, you can start with multiple
  such sets statically.
- Static sets set of disks and cannot be
  changed, there is no elastic expansion allowed.
- Static sets set of disks and cannot be
  changed, there is no elastic removal allowed.
- ListObjects() across sets can be noticeably
  slower since List happens on all servers,
  and is merged at this sets layer.

Fixes #5465
Fixes #5464
Fixes #5461
Fixes #5460
Fixes #5459
Fixes #5458
Fixes #5460
Fixes #5488
Fixes #5489
Fixes #5497
Fixes #5496
2018-02-15 17:45:57 -08:00
poornas 4f73fd9487 Unify gateway and object layer. (#5487)
* Unify gateway and object layer. Bring bucket policies into
object layer.
2018-02-09 15:19:30 -08:00
Harshavardhana 0c880bb852 Deprecate and remove in-memory object caching (#5481)
in-memory caching cannot be cleanly implemented
without the access to GC which Go doesn't naturally
provide. At times we have seen that object caching
is more of an hindrance rather than a boon for
our use cases.

Removing it completely from our implementation
  related to #5160 and #5182
2018-02-02 10:17:13 -08:00
Harshavardhana 1ebbc2ce88 Make sure to convert the disk errors to object errors (#5480)
Fixes a bug introduced in the directory support PR, with
this fix s3fs works properly.
2018-02-02 14:04:15 +05:30
Krishna Srinivas 2afd196c83 Quorum based listing for XL (#5475)
fixes #5380
2018-02-01 10:47:49 -08:00
Harshavardhana 3ea28e9771 Support creating directories on erasure coded backend (#5443)
This PR continues from #5049 where we started supporting
directories for erasure coded backend
2018-01-30 08:13:13 +05:30
poornas 0bb6247056 Move nslocking from s3 layer to object layer (#5382)
Fixes #5350
2018-01-13 10:04:52 +05:30
kannappanr 20584dc08f
Remove unnecessary errors printed on the console (#5386)
Some of the errors printed on server console can be
removed as those error message is unnecessary.

Fixes #5385
2018-01-11 11:42:05 -08:00
Nitish Tiwari 1e5fb4b79a
Fix storage class related issues (#5338)
- Update startup banner to print storage class in capitals. This
makes it easier to identify different storage classes available.

- Update response metadata to not send STANDARD storage class.
This is in accordance with AWS S3 behaviour.

- Update minio-go library to bring in storage class related
changes. This is needed to make transparent translation of
storage class headers for Minio S3 Gateway.
2018-01-04 11:44:45 +05:30
Harshavardhana c0721164be Automatically set goroutines based on shardSize (#5346)
Update reedsolomon library to enable feature to automatically
set number of go-routines based on the input shard size,
since shard size is sort of a constant in Minio for
objects > 10MiB (default blocksize)

klauspost reported around 15-20% improvement in performance
numbers on older systems such as AVX and SSE3

```
name                  old speed      new speed      delta
Encode10x2x10000-8    5.45GB/s ± 1%  6.22GB/s ± 1%  +14.20%    (p=0.000 n=9+9)
Encode100x20x10000-8  1.44GB/s ± 1%  1.64GB/s ± 1%  +13.77%  (p=0.000 n=10+10)
Encode17x3x1M-8       10.0GB/s ± 5%  12.0GB/s ± 1%  +19.88%  (p=0.000 n=10+10)
Encode10x4x16M-8      7.81GB/s ± 5%  8.56GB/s ± 5%   +9.58%   (p=0.000 n=10+9)
Encode5x2x1M-8        15.3GB/s ± 2%  19.6GB/s ± 2%  +28.57%   (p=0.000 n=9+10)
Encode10x2x1M-8       12.2GB/s ± 5%  15.0GB/s ± 5%  +22.45%  (p=0.000 n=10+10)
Encode10x4x1M-8       7.84GB/s ± 1%  9.03GB/s ± 1%  +15.19%    (p=0.000 n=9+9)
Encode50x20x1M-8      1.73GB/s ± 4%  2.09GB/s ± 4%  +20.59%   (p=0.000 n=10+9)
Encode17x3x16M-8      10.6GB/s ± 1%  11.7GB/s ± 4%  +10.12%   (p=0.000 n=8+10)
```
2018-01-03 13:47:22 -08:00
Nitish Tiwari 545a9e4a82 Fix storage class related issues (#5322)
- Add storage class metadata validation for request header
- Change storage class header values to be consistent with AWS S3
- Refactor internal method to take only the reqd argument
2017-12-27 10:06:16 +05:30
Nitish Tiwari 1a3dbbc9dd
Add x-amz-storage-class support (#5295)
This adds configurable data and parity options on a per object
basis. To use variable parity

- Users can set environment variables to cofigure variable
parity

- Then add header x-amz-storage-class to putobject requests
with relevant storage class values

Fixes #4997
2017-12-22 16:58:13 +05:30
Harshavardhana 8efa82126b
Convert errors tracer into a separate package (#5221) 2017-11-25 11:58:29 -08:00
Harshavardhana 5eb210dd2e Set etag properly to calculated value if available (#5106)
Fixes #5100
2017-10-24 12:25:42 -07:00
Harshavardhana 1d8a8c63db Simplify data verification with HashReader. (#5071)
Verify() was being called by caller after the data
has been successfully read after io.EOF. This disconnection
opens a race under concurrent access to such an object.
Verification is not necessary outside of Read() call,
we can simply just do checksum verification right inside
Read() call at io.EOF.

This approach simplifies the usage.
2017-10-22 11:00:34 +05:30
Frank Wessels f598f4fd1b Fix typo in comment (#5088) 2017-10-20 15:08:15 +05:30
A. Elleuch b919462610 fix: Avoid teeing data into a null cache buffer (#5070)
In some cases, Cache manager returns ErrCacheFull error when creating a
new cache buffer but the code still sends object data to nil cache buffer data.
2017-10-18 14:42:10 -07:00
Harshavardhana 0b546ddfd4 Return errors in PutObject()/PutObjectPart() if input size is -1. (#5015)
Amazon S3 API expects all incoming stream has a content-length
set it was superflous for us to support object layer which supports
unknown sized stream as well, this PR removes such requirements
and explicitly error out if input stream is less than zero.
2017-10-06 09:38:01 -07:00
Andreas Auernhammer 02af37a394 optimize memory allocs during reconstruct (#4964)
The reedsolomon library now avoids allocations during reconstruction.
This change exploits that to reduce memory allocs and GC preasure during
healing and reading.
2017-09-27 10:29:42 -07:00
Andreas Auernhammer 79ba4d3f33 refactor ObjectLayer PutObject and PutObjectPart (#4925)
This change refactor the ObjectLayer PutObject and PutObjectPart
functions. Instead of passing an io.Reader and a size to PUT operations
ObejectLayer expects an HashReader.
A HashReader verifies the MD5 sum (and SHA256 sum if required) of the object.
This change updates all all PutObject(Part) calls and removes unnecessary code
in all ObjectLayer implementations.

Fixes #4923
2017-09-19 12:40:27 -07:00
Harshavardhana db5af1b126 fix: tests error conditions should be used properly. (#4833) 2017-08-23 17:58:52 -07:00
Harshavardhana 2e6ee68409 fix: [minor] Avoid unnecessary typecasting. (#4828)
We don't need to typecast identifiers from
their base to type to same type again. This
is not a bug and compiler is fine to skip
it but it is better to avoid if not needed.
2017-08-18 11:45:16 -07:00
Andreas Auernhammer 85fcee1919 erasure: simplify XL backend operations (#4649) (#4758)
This change provides new implementations of the XL backend operations:
 - create file
 - read   file
 - heal   file
Further this change adds table based tests for all three operations.

This affects also the bitrot algorithm integration. Algorithms are now
integrated in an idiomatic way (like crypto.Hash).
Fixes #4696
Fixes #4649
Fixes #4359
2017-08-14 18:08:42 -07:00
Frank Wessels 46897b1100 Name return values to prevent the need (and unnecessary code bloat) (#4576)
This is done to explicitly instantiate objects for every return statement.
2017-06-21 19:53:09 -07:00
Anis Elleuch af8071c86a xl: Fix rare freeze after many disk/network errors (#4438)
xl.storageDisks is sometimes passed to some low-level XL functions. Some disks in
xl.storageDisks are set to nil when they encounter some errors. This means all
elements in xl.storageDisks will be nil after some time which lead to an unusable XL.
2017-06-14 17:14:27 -07:00
Aditya Manthramurthy 8975da4e84 Add new ReadFileWithVerify storage-layer API (#4349)
This is an enhancement to the XL/distributed-XL mode. FS mode is
unaffected.

The ReadFileWithVerify storage-layer call is similar to ReadFile with
the additional functionality of performing bit-rot checking. It
accepts additional parameters for a hashing algorithm to use and the
expected hex-encoded hash string.

This patch provides significant performance improvement because:

1. combines the step of reading the file (during
erasure-decoding/reconstruction) with bit-rot verification;

2. limits the number of file-reads; and

3. avoids transferring the file over the network for bit-rot
verification.

ReadFile API is implemented as ReadFileWithVerify with empty hashing
arguments.

Credits to AB and Harsha for the algorithmic improvement.

Fixes #4236.
2017-05-16 14:21:52 -07:00
Harshavardhana 155a90403a fs/erasure: Rename meta 'md5Sum' as 'etag'. (#4319)
This PR also does backend format change to 1.0.1
from 1.0.0.  Backward compatible changes are still
kept to read the 'md5Sum' key. But all new objects
will be stored with the same details under 'etag'.

Fixes #4312
2017-05-14 12:05:51 -07:00
Harshavardhana 298b470f69 fs/erasure: Ignore objects with / even for DeleteObject() (#4303)
Additionally GetObject() also returns errFileNotFound similar
to HeadObject().

Fixes #4302
2017-05-09 14:32:24 -07:00
Bala FA 1c97dcb10a Add UTCNow() function. (#3931)
This patch adds UTCNow() function which returns current UTC time.

This is equivalent of UTCNow() == time.Now().UTC()
2017-03-18 11:28:41 -07:00
Anis Elleuch a5e60706a2 xl,fs: Return 404 if object ends with a separator (#3897)
HEAD Object for FS and XL was returning invalid object name when
an object name has a trailing slash separator, this PR changes the
behavior and will always return 404 object not found, this guarantees
a better compatibility with S3 spec.
2017-03-13 22:20:46 -07:00
Anis Elleuch a2eae54d11 xl: Respect min. space by checking PrepareFile err (#3867)
It was possible to upload a big file which overcomes the minimal
disk space limit in XL, PrepareFile was actually checking for disk
space but we weren't checking its returned error. This patch fixes
this behavior.
2017-03-07 14:48:56 -08:00
Anis Elleuch dce0345f8f Set disk to nil after write which needs quorum (#3795)
Ignore a disk which wasn't able to successfully perform an action to
avoid eventual perturbations when the disk comes back in the middle
of write change.
2017-02-26 11:58:32 -08:00
Harshavardhana bcc5b6e1ef xl: Rename getOrderedDisks as shuffleDisks appropriately. (#3796)
This PR is for readability cleanup

- getOrderedDisks as shuffleDisks
- getOrderedPartsMetadata as shufflePartsMetadata

Distribution is now a second argument instead being the
primary input argument for brevity.

Also change the usage of type casted int64(0), instead
rely on direct type reference as `var variable int64` everywhere.
2017-02-24 09:20:40 -08:00
Harshavardhana cc28765025 xl/multipart: Make sure to delete temp renamed object. (#3785)
Existing objects before overwrites are renamed to
temp location in completeMultipart. We make sure
that we delete it even if subsequenty calls fail.

Additionally move verifying of parent dir is a
file earlier to fail the entire operation.

Ref #3784
2017-02-21 19:43:44 -08:00
Harshavardhana 7ea1de8245 copyObject: Be case sensitive for windows only server. (#3766)
For case sensitive platforms we should honor case.

Fixes #3765

```
1) python s3cmd -c s3cfg_localminio put logo.png s3://testbucket/xyz/etc2/logo.PNG

2) python s3cmd -c s3cfg_localminio ls s3://testbucket/xyz/etc2/
2017-02-18 10:58     22059   s3://testbucket/xyz/etc2/logo.PNG

3) python s3cmd -c s3cfg_localminio cp s3://testbucket/xyz/etc2/logo.PNG s3://testbucket/xyz/etc2/logo.png
remote copy: 's3://testbucket/xyz/etc2/logo.PNG' -> 's3://testbucket/xyz/etc2/logo.png'

4) python s3cmd -c s3cfg_localminio ls s3://testbucket/xyz/etc2/
2017-02-18 10:58     22059   s3://testbucket/xyz/etc2/logo.PNG
2017-02-18 11:10     22059   s3://testbucket/xyz/etc2/logo.png
```
2017-02-18 13:41:59 -08:00
Harshavardhana 22909c849e objcache: Return io.ReaderAt to avoid Seeking and Reading. (#3735) 2017-02-11 17:17:58 -08:00
Harshavardhana 6a6c930f5b xl: Abort multipart upload should honor quorum properly. (#3670)
Current implementation didn't honor quorum properly and didn't
handle the errors generated properly. This patch addresses that
and also moves common code `cleanupMultipartUploads` into xl
specific private function.

Fixes #3665
2017-02-01 11:16:17 -08:00
Harshavardhana 1b30a3be2b xl/utils: getPartSizeFromIdx should return error. (#3669) 2017-01-31 15:34:49 -08:00
Anis Elleuch e9394dc22d xl PutObject: Split object into parts (#3651)
For faster time-to-first-byte when we try to download a big object
2017-01-30 15:44:42 -08:00
Krishna Srinivas b288eaddb3 xl: bit-rot algo was not set in get-object. (#3652)
fixes #3650
2017-01-30 14:25:28 -08:00
Anis Elleuch e1bc99e4fe xl: Fix GET of an empty multiparted object (#3646)
GetObject returns unsatisfied range error when we try to download an object
uploaded using multipart mechanism.
2017-01-27 10:51:02 -08:00
Harshavardhana 51fa4f7fe3 Make PutObject a nop for an object which ends with "/" and size is '0' (#3603)
This helps majority of S3 compatible applications while not returning
an error upon directory create request.

Fixes #2965
2017-01-20 16:33:01 -08:00
Harshavardhana 98a6a2bcab obj: Return objectInfo for CompleteMultipartUpload(). (#3587)
This patch avoids doing GetObjectInfo() in similar way
how we did for PutOject().
2017-01-16 19:23:43 -08:00
Harshavardhana 1c699d8d3f fs: Re-implement object layer to remember the fd (#3509)
This patch re-writes FS backend to support shared backend sharing locks for safe concurrent access across multiple servers.
2017-01-16 17:05:00 -08:00
Harshavardhana 69559aa101 objAPI: Implement CopyObject API. (#3487)
This is written so that to simplify our handler code
and provide a way to only update metadata instead of
the data when source and destination in CopyObject
request are same.

Fixes #3316
2016-12-26 16:29:26 -08:00
Harshavardhana 15b4c49621 fs/xl: Simplify bucket metadata reading. (#3486)
ObjectLayer GetObject() now returns the entire object
if starting offset is 0 and length is negative. This
also allows to simplify handler layer code where
we always had to use GetObjectInfo() before proceeding
to read bucket metadata files examples `policy.json`.

This also reduces one additional call overhead.
2016-12-21 11:29:32 -08:00
Harshavardhana faa6b1e925 vendorize deps for snappy, blake2b and sha256 (#3476)
Bring in new optimization and portability changes.

Fixes https://github.com/minio/minio-go/issues/578
2016-12-19 19:32:55 -08:00
Harshavardhana 4daa0d2cee lock: Moving locking to handler layer. (#3381)
This is implemented so that the issues like in the
following flow don't affect the behavior of operation.

```
GetObjectInfo()
.... --> Time window for mutation (no lock held)
.... --> Time window for mutation (no lock held)
GetObject()
```

This happens when two simultaneous uploads are made
to the same object the object has returned wrong
info to the client.

Another classic example is "CopyObject" API itself
which reads from a source object and copies to
destination object.

Fixes #3370
Fixes #2912
2016-12-10 16:15:12 -08:00
Harshavardhana b363709c11 caching: Optimize memory allocations. (#3405)
This change brings in changes at multiple places

 - Reuse buffers at almost all locations ranging
   from rpc, fs, xl, checksum etc.
 - Change caching behavior to disable itself
   under low memory conditions i.e < 8GB of RAM.
 - Only objects cached are of size 1/10th the size
   of the cache for example if 4GB is the cache size
   the maximum object size which will be cached
   is going to be 400MB. This change is an
   optimization to cache more objects rather
   than few larger objects.
 - If object cache is enabled default GC
   percent has been reduced to 20% in lieu
   with newly found behavior of GC. If the cache
   utilization reaches 75% of the maximum value
   GC percent is reduced to 10% to make GC
   more aggressive.
 - Do not use *bytes.Buffer* due to its growth
   requirements. For every allocation *bytes.Buffer*
   allocates an additional buffer for its internal
   purposes. This is undesirable for us, so
   implemented a new cappedWriter which is capped to a
   desired size, beyond this all writes rejected.

Possible fix for #3403.
2016-12-08 20:35:07 -08:00
Harshavardhana ff4ce0ee14 fs/xl: Combine input checks into re-usable functions. (#3383)
Repeated code around both object layers are moved
and combined into simple re-usable functions.
2016-12-01 23:15:17 -08:00
Bala FA 1d4ac4b084 Rename getUUID() into mustGetUUID() (#3320)
In case of UUID generation failure mustGetUUID() will panic than
infinitely trying in for loop.
2016-11-22 16:52:37 -08:00
Harshavardhana 5197649081 utils: reduceErrs returns and validates quorum errors. (#3300)
This is needed as explained by @krisis

Lets say we have following errors.

```
[]error{nil, errFileNotFound, errDiskAccessDenied, errDiskAccesDenied}
```

Since the last two errors are filtered, the maximum is nil,
depending on map order.

Let's say we get nil from reduceErr. Clearly at this point
we don't have quorum nodes agreeing about the data and since
GetObject only requires N/2 (Read quorum) and isDiskQuorum
would have returned true. This is problematic and can lead to
undersiable consequences.

Fixes #3298
2016-11-21 01:47:26 -08:00
Krishnan Parthasarathi eed9ab0464 XL: pickValidXLMeta should return error instead of panic'ing (#3277) 2016-11-20 20:56:44 -08:00
Harshavardhana 0b9f0d14a1 auth/rpc: Take remote disk offline after maximum allowed attempts. (#3288)
Disks when are offline for a long period of time, we should
ignore the disk after trying Login upto 5 times.

This is to reduce the network chattiness, this also reduces
the overall time spent on `net.Dial`.

Fixes #3286
2016-11-20 16:57:12 -08:00
Anis Elleuch ffbee70e04 Avoid removing 'tmp' directory inside '.minio.sys' (#3294) 2016-11-20 14:25:43 -08:00
Aditya Manthramurthy dd0698d14c Improve namespace lock API: (#3203)
- abstract out instrumentation information.
- use separate lockInstance type that encapsulates the nsMutex, volume,
  path and opsID as the frontend or top-level lock object.
2016-11-09 10:58:41 -08:00
Anis Elleuch a47ce7ab22 Add support of fallocate for FS and XL backends (#3032) 2016-10-29 12:44:44 -07:00
Krishnan Parthasarathi 6fc81dc162 Delete temp object/part when PutObject{,Part} fails (#3004) 2016-10-19 22:52:03 -07:00
Anis Elleuch 2208992e6a More informative message when erasure fails to read a part of an object (#2989) 2016-10-18 13:09:26 -07:00
Harshavardhana 39331b6b4e xl: GetCheckSumInfo() shouldn't fail if hash not available. (#2984)
In a multipart upload scenario disks going down and coming backup
can lead to certain parts missing on the disk/server which was
going down. This is a valid case since these blocks can be
missing and should be healed through heal operation. But we are
not supposed to fail prematurely since we have enough data on
the other disks as well within read-quorum.

This fix relaxes previous assumption, fixes a major corruption
issue reproduced by @vadmeste.

Fixes #2976
2016-10-18 11:13:25 -07:00
Krishnan Parthasarathi b89609dc2e XL: Filter out md5Sum from user defined headers (#2962) 2016-10-17 08:41:33 -07:00
Harshavardhana fee3f99a6e xl: heal bucket should validate if bucket exists first. (#2953)
Fixes #2944
2016-10-17 02:10:23 -07:00
Harshavardhana f22862aa28 heal: Refactor heal command. (#2901)
- return errors for heal operation through rpc replies.
  - implement rotating wheel for healing status.

Fixes #2491
2016-10-14 19:57:40 -07:00
Krishna Srinivas 0320a77dc0 HealBucket: create the bucket if it is missing in one of the disks. (#2924) 2016-10-14 11:12:17 -07:00
Harshavardhana 6494b77d41 server: Add more elaborate startup messages. (#2731)
These messages based on our prep stage during XL
and prints more informative message regarding
drive information.

This change also does a much needed refactoring.
2016-10-05 12:48:07 -07:00
Krishna Srinivas 61a18ed48f sha256: Verify sha256 along with md5sum, signature is verified on the request early. (#2813) 2016-10-02 15:51:49 -07:00
Aditya Manthramurthy 10d2ef5449 Remove comments relating to deprecated MINIO_DEBUG envvar (#2797) 2016-09-27 18:28:46 -07:00
Krishnan Parthasarathi 669783f875 Purge stale object cache entry (#2770) 2016-09-23 19:55:28 -07:00
Karthic Rao 8bd78fbdfb performance: gjson parsing for readXLMeta, listParts, getObjectInfo. (#2631)
- Using gjson for constructing xlMetaV1{} in realXLMeta.
- Test for parsing constructing xlMetaV1{} using gjson.
- Changes made since benchmarks showed 30-40% improvement in speed.
- Follow up comments in issue https://github.com/minio/minio/issues/2208
  for more details.
- gjson parsing of parts from xl.json for listParts.
- gjson parsing of statInfo from xl.json for getObjectInfo.
- Vendorizing gjson dependency.
2016-09-13 21:18:30 -07:00
Krishna Srinivas b4e4846e9f PutObject: object layer now returns ObjectInfo instead of md5sum to avoid extra GetObjectInfo call. (#2599)
From the S3 layer after PutObject we were calling GetObjectInfo for bucket notification. This can
be avoided if PutObjectInfo returns ObjectInfo.

fixes #2567
2016-09-13 21:18:30 -07:00
Krishna Srinivas 9358ee011b logging: Print stack trace in case of errors.
fixes #1827
2016-09-13 21:18:30 -07:00
Karthic Rao 07d232c7b4 instrumentation: instrumentation for locks. (#2584)
- Instrumentation for locks.
- Detailed test coverage.
- Adding RPC control handler to fetch lock instrumentation.
- RPC control handlers suite tests with a test RPC server.
2016-09-13 21:18:30 -07:00
Harshavardhana bccf549463 server: Move all the top level files into cmd folder. (#2490)
This change brings a change which was done for the 'mc'
package to allow for clean repo and have a cleaner
github drop in experience.
2016-08-18 16:23:42 -07:00