ddea0bdf11
``` CSV parsing, BEFORE: BenchmarkReaderBasic-12 2842 407533 ns/op 397860 B/op 957 allocs/op BenchmarkReaderReplace-12 2718 429914 ns/op 397844 B/op 957 allocs/op BenchmarkReaderReplaceTwo-12 2718 435556 ns/op 397855 B/op 957 allocs/op BenchmarkAggregateCount_100K-12 171 6798974 ns/op 16667102 B/op 308077 allocs/op BenchmarkAggregateCount_1M-12 19 65657411 ns/op 168057743 B/op 3146610 allocs/op BenchmarkSelectAll_10M-12 1 20882119900 ns/op 2758799896 B/op 41978762 allocs/op CSV parsing, AFTER: BenchmarkReaderBasic-12 3721 312549 ns/op 101920 B/op 338 allocs/op BenchmarkReaderReplace-12 3776 318810 ns/op 101993 B/op 340 allocs/op BenchmarkReaderReplaceTwo-12 3610 330967 ns/op 102012 B/op 341 allocs/op BenchmarkAggregateCount_100K-12 295 4149588 ns/op 3553623 B/op 103261 allocs/op BenchmarkAggregateCount_1M-12 30 37746503 ns/op 33827931 B/op 1049435 allocs/op BenchmarkSelectAll_10M-12 1 17608495800 ns/op 1416504040 B/op 21007082 allocs/op ~ benchcmp old.txt new.txt benchmark old ns/op new ns/op delta BenchmarkReaderBasic-12 407533 312549 -23.31% BenchmarkReaderReplace-12 429914 318810 -25.84% BenchmarkReaderReplaceTwo-12 435556 330967 -24.01% BenchmarkAggregateCount_100K-12 6798974 4149588 -38.97% BenchmarkAggregateCount_1M-12 65657411 37746503 -42.51% BenchmarkSelectAll_10M-12 20882119900 17608495800 -15.68% benchmark old allocs new allocs delta BenchmarkReaderBasic-12 957 338 -64.68% BenchmarkReaderReplace-12 957 340 -64.47% BenchmarkReaderReplaceTwo-12 957 341 -64.37% BenchmarkAggregateCount_100K-12 308077 103261 -66.48% BenchmarkAggregateCount_1M-12 3146610 1049435 -66.65% BenchmarkSelectAll_10M-12 41978762 21007082 -49.96% benchmark old bytes new bytes delta BenchmarkReaderBasic-12 397860 101920 -74.38% BenchmarkReaderReplace-12 397844 101993 -74.36% BenchmarkReaderReplaceTwo-12 397855 102012 -74.36% BenchmarkAggregateCount_100K-12 16667102 3553623 -78.68% BenchmarkAggregateCount_1M-12 168057743 33827931 -79.87% BenchmarkSelectAll_10M-12 2758799896 1416504040 -48.66% ``` ``` BenchmarkReaderHuge/97K-12 2200 540840 ns/op 184.32 MB/s 1604450 B/op 687 allocs/op BenchmarkReaderHuge/194K-12 1522 752257 ns/op 265.04 MB/s 2143135 B/op 1335 allocs/op BenchmarkReaderHuge/389K-12 1190 947858 ns/op 420.69 MB/s 3221831 B/op 2630 allocs/op BenchmarkReaderHuge/778K-12 806 1472486 ns/op 541.61 MB/s 5201856 B/op 5187 allocs/op BenchmarkReaderHuge/1557K-12 426 2575269 ns/op 619.36 MB/s 9101330 B/op 10233 allocs/op BenchmarkReaderHuge/3115K-12 286 4034656 ns/op 790.66 MB/s 12397968 B/op 16099 allocs/op BenchmarkReaderHuge/6230K-12 172 6830563 ns/op 934.05 MB/s 16008416 B/op 26844 allocs/op BenchmarkReaderHuge/12461K-12 100 11409467 ns/op 1118.39 MB/s 22655163 B/op 48107 allocs/op BenchmarkReaderHuge/24922K-12 66 19780395 ns/op 1290.19 MB/s 35158559 B/op 90216 allocs/op BenchmarkReaderHuge/49844K-12 34 37282559 ns/op 1369.03 MB/s 60528624 B/op 174497 allocs/op ``` |
||
---|---|---|
.. | ||
README.md | ||
select.py |
README.md
Select API Quickstart Guide
Traditional retrieval of objects is always as whole entities, i.e GetObject for a 5 GiB object, will always return 5 GiB of data. S3 Select API allows us to retrieve a subset of data by using simple SQL expressions. By using Select API to retrieve only the data needed by the application, drastic performance improvements can be achieved.
You can use the Select API to query objects with following features:
- CSV, JSON and Parquet - Objects must be in CSV, JSON, or Parquet format.
- UTF-8 is the only encoding type the Select API supports.
- GZIP or BZIP2 - CSV and JSON files can be compressed using GZIP or BZIP2. The Select API supports columnar compression for Parquet using GZIP, Snappy, LZ4. Whole object compression is not supported for Parquet objects.
- Server-side encryption - The Select API supports querying objects that are protected with server-side encryption.
Type inference and automatic conversion of values is performed based on the context when the value is un-typed (such as when reading CSV data). If present, the CAST function overrides automatic conversion.
1. Prerequisites
- Install MinIO Server from here.
- Familiarity with AWS S3 API.
- Familiarity with Python and installing dependencies.
2. Install boto3
Install aws-sdk-python
from AWS SDK for Python official docs here
3. Example
As an example, let us take a gzip compressed CSV file. Without S3 Select, we would need to download, decompress and process the entire CSV to get the data you needed. With Select API, can use a simple SQL expression to return only the data from the CSV you’re interested in, instead of retrieving the entire object. Following Python example shows how to retrieve the first column Location
from an object containing data in CSV format.
Please replace endpoint_url
,aws_access_key_id
, aws_secret_access_key
, Bucket
and Key
with your local setup in this select.py
file.
#!/usr/bin/env/env python3
import boto3
s3 = boto3.client('s3',
endpoint_url='http://localhost:9000',
aws_access_key_id='minio',
aws_secret_access_key='minio123',
region_name='us-east-1')
r = s3.select_object_content(
Bucket='mycsvbucket',
Key='sampledata/TotalPopulation.csv.gz',
ExpressionType='SQL',
Expression="select * from s3object s where s.Location like '%United States%'",
InputSerialization={
'CSV': {
"FileHeaderInfo": "USE",
},
'CompressionType': 'GZIP',
},
OutputSerialization={'CSV': {}},
)
for event in r['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)
elif 'Stats' in event:
statsDetails = event['Stats']['Details']
print("Stats details bytesScanned: ")
print(statsDetails['BytesScanned'])
print("Stats details bytesProcessed: ")
print(statsDetails['BytesProcessed'])
4. Run the Program
Upload a sample dataset to MinIO using the following commands.
$ curl "https://esa.un.org/unpd/wpp/DVD/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2017_TotalPopulationBySex.csv" > TotalPopulation.csv
$ mc mb myminio/mycsvbucket
$ gzip TotalPopulation.csv
$ mc cp TotalPopulation.csv.gz myminio/mycsvbucket/sampledata/
Now let us proceed to run our select example to query for Location
which matches United States
.
$ python3 select.py
840,United States of America,2,Medium,1950,1950.5,79233.218,79571.179,158804.395
840,United States of America,2,Medium,1951,1951.5,80178.933,80726.116,160905.035
840,United States of America,2,Medium,1952,1952.5,81305.206,82019.632,163324.851
840,United States of America,2,Medium,1953,1953.5,82565.875,83422.307,165988.190
....
....
....
Stats details bytesScanned:
6758866
Stats details bytesProcessed:
25786743
For a more detailed SELECT SQL reference, please see here
5. Explore Further
- Use
mc
with MinIO Server - Use
mc sql
with MinIO Server - Use
minio-go
SDK with MinIO Server - Use
aws-cli
with MinIO Server - Use
s3cmd
with MinIO Server - The MinIO documentation website
6. Implementation Status
- Full AWS S3 SELECT SQL syntax is supported.
- All operators are supported.
- All aggregation, conditional, type-conversion and string functions are supported.
- JSON path expressions such as
FROM S3Object[*].path
are not yet evaluated. - Large numbers (outside of the signed 64-bit range) are not yet supported.
- The Date functions
DATE_ADD
,DATE_DIFF
,EXTRACT
andUTCNOW
along with type conversion usingCAST
to theTIMESTAMP
data type are currently supported. - AWS S3's reserved keywords list is not yet respected.
- CSV input fields (even quoted) cannot contain newlines even if
RecordDelimiter
is something else.