Commit Graph

72 Commits

Author SHA1 Message Date
Anis Elleuch 9902c9baaa
sql: Add support of escape quote in CSV (#9231)
This commit modifies csv parser, a fork of golang csv
parser to support a custom quote escape character.

The quote escape character is used to escape the quote
character when a csv field contains a quote character
as part of data.
2020-04-01 15:39:34 -07:00
Klaus Post 8d98662633
re-implement data usage crawler to be more efficient (#9075)
Implementation overview: 

https://gist.github.com/klauspost/1801c858d5e0df391114436fdad6987b
2020-03-18 16:19:29 -07:00
Anis Elleuch 35ecc04223
Support configurable quote character parameter in Select (#8955) 2020-03-13 22:09:34 -07:00
Harshavardhana 603cf2a8bb
fix: broken gzip handling with Select API (#9128)
This PR fixes a regression introduced in a1c7c9ea73
2020-03-12 15:34:11 -07:00
Aditya Manthramurthy cec8cdb35e
S3Select: Handle array selection in from clause (#9076) 2020-03-10 22:34:58 -07:00
ebozduman a1c7c9ea73
Matches s3 invalid compression format error for 'mc sql' (#9067) 2020-03-05 19:34:04 -08:00
Klaus Post e4020fb41f
SIMDJSON S3 select input (#8401) 2020-02-13 14:03:52 -08:00
Anis Elleuch de924605a1
Import CSV parser library (#8927)
The CSV library code is imported from Go 1.13.6
2020-02-07 16:25:36 +05:30
Bruce Wang c476b27a65 Comment typo "index max" to "index map" (#8700) 2019-12-24 21:57:43 -08:00
Klaus Post bf3a97d3aa S3 Select: Concurrent LINES delimited json parsing (#8610)
The speedup is ~5x on a 6 core CPU
2019-12-09 06:55:31 -08:00
Klaus Post f1e2e1cc9e S3 Select: Mismatched types don't match (#8608)
When comparing for equality, if types cannot be matched, they don't match.
2019-12-06 07:24:41 -08:00
Harshavardhana 5d3d57c12a
Start using error wrapping with fmt.Errorf (#8588)
Use fatih/errwrap to fix all the code to use
error wrapping with fmt.Errorf()
2019-12-02 09:28:01 -08:00
Klaus Post 1c90a6bd49 S3 Select: Convert CSV data to JSON (#8464) 2019-11-09 09:10:35 -08:00
Klaus Post 26e760ee62 Fix JSON Close data race. (#8486)
The JSON stream library has no safe way of aborting while

Since we cannot expect the called to safely handle "Read" and "Close" calls we must handle this.

Also any Read error returned from upstream will crash the server. We preserve the errors and instead always return io.EOF upstream, but send the error on Close.

`readahead v1.3.1` handles Read after Close better.

Updates to `progressReader` is mostly to ensure safety.

Fixes #8481
2019-11-05 14:20:37 -08:00
Klaus Post 38e6d911ea S3 Select: Detect full object (#8456)
Check if select is `SELECT s.* from S3Object s` and forward it to All

Fixes #8371 and makes this case run significantly faster.
2019-10-30 13:46:55 +05:30
Klaus Post 51456e6adc Select: Support Square Bracket Lists (#8457)
Allows for S3 compatible `SELECT * from s3object s WHERE id IN [3,2]`

Fixes #8422
2019-10-30 11:34:40 +05:30
Harshavardhana d48fd6fde9
Remove unusued params and functions (#8399) 2019-10-15 18:35:41 -07:00
Klaus Post 002ac82631 S3 Select: Add parser support for lists. (#8329) 2019-10-06 07:52:45 -07:00
Klaus Post c1a17c2561 S3 Select: Aggregate AVG/SUM as float (#8326)
Force sum/average to be calculated as a float.

As noted in #8221

> run SELECT AVG(CAST (Score as int)) FROM S3Object on

```
Name,Score
alice,80
bob,81
```

> AWS S3 gives 80.5 and MinIO gives 80.

This also makes overflows much more unlikely.
2019-09-27 16:12:03 -07:00
Klaus Post 1c5b05c130 S3 select: Fix output conversion on select * (#8303)
Fixes #8268
2019-09-27 12:33:14 -07:00
Klaus Post be313f1758 S3 Select: Workaround java buffer size (#8312)
Updates #7475

The Java implementation has a 128KB buffer and a message must be emitted before that is used. #7475 therefore limits the message size to 128KB. But up to 256 bytes are written to the buffer in each call. This means we must emit a message before shorter than 128KB.

Therefore we change the limit to 128KB minus 256 bytes.
2019-09-26 04:56:20 +05:30
Klaus Post 520552ffa9 S3 select: flush when reaching limit (#8279)
Add missing flush when reaching select limit.
2019-09-20 11:00:17 -07:00
Klaus Post dac1cf5a9a S3 Select: Parsing tweaks (#8261)
* Don't output empty lines.
* Trim whitespace from byte to int/float/bool conversions.
2019-09-17 17:21:23 -07:00
Klaus Post c9b8bd8de2 S3 Select: optimize output (#8238)
Queue output items and reuse them.
Remove the unneeded type system in sql and just use the Go type system.

In best case this is more than an order of magnitude speedup:

```
BenchmarkSelectAll_1M-12    	       1	1841049400 ns/op	274299728 B/op	 4198522 allocs/op
BenchmarkSelectAll_1M-12    	      14	  84833400 ns/op	169228346 B/op	 3146541 allocs/op
```
2019-09-17 05:56:27 +05:30
Klaus Post 017456df63 Wait clearing the close channel (#8250)
Close channel should not be nilled before goroutines have exited.

Fixes potential hang on closing.
2019-09-16 16:18:01 -07:00
Klaus Post ddea0bdf11 Concurrent CSV parsing and reduce S3 select allocations (#8200)
```
CSV parsing, BEFORE:
BenchmarkReaderBasic-12         	    2842	    407533 ns/op	  397860 B/op	     957 allocs/op
BenchmarkReaderReplace-12       	    2718	    429914 ns/op	  397844 B/op	     957 allocs/op
BenchmarkReaderReplaceTwo-12    	    2718	    435556 ns/op	  397855 B/op	     957 allocs/op
BenchmarkAggregateCount_100K-12    	     171	   6798974 ns/op	16667102 B/op	  308077 allocs/op
BenchmarkAggregateCount_1M-12    	      19	  65657411 ns/op	168057743 B/op	 3146610 allocs/op
BenchmarkSelectAll_10M-12    	       1	20882119900 ns/op	2758799896 B/op	41978762 allocs/op

CSV parsing, AFTER:
BenchmarkReaderBasic-12         	    3721	    312549 ns/op	  101920 B/op	     338 allocs/op
BenchmarkReaderReplace-12       	    3776	    318810 ns/op	  101993 B/op	     340 allocs/op
BenchmarkReaderReplaceTwo-12    	    3610	    330967 ns/op	  102012 B/op	     341 allocs/op
BenchmarkAggregateCount_100K-12    	     295	   4149588 ns/op	 3553623 B/op	  103261 allocs/op
BenchmarkAggregateCount_1M-12    	      30	  37746503 ns/op	33827931 B/op	 1049435 allocs/op
BenchmarkSelectAll_10M-12    	       1	17608495800 ns/op	1416504040 B/op	21007082 allocs/op

~ benchcmp old.txt new.txt
benchmark                           old ns/op       new ns/op       delta
BenchmarkReaderBasic-12             407533          312549          -23.31%
BenchmarkReaderReplace-12           429914          318810          -25.84%
BenchmarkReaderReplaceTwo-12        435556          330967          -24.01%
BenchmarkAggregateCount_100K-12     6798974         4149588         -38.97%
BenchmarkAggregateCount_1M-12       65657411        37746503        -42.51%
BenchmarkSelectAll_10M-12           20882119900     17608495800     -15.68%

benchmark                           old allocs     new allocs     delta
BenchmarkReaderBasic-12             957            338            -64.68%
BenchmarkReaderReplace-12           957            340            -64.47%
BenchmarkReaderReplaceTwo-12        957            341            -64.37%
BenchmarkAggregateCount_100K-12     308077         103261         -66.48%
BenchmarkAggregateCount_1M-12       3146610        1049435        -66.65%
BenchmarkSelectAll_10M-12           41978762       21007082       -49.96%

benchmark                           old bytes      new bytes      delta
BenchmarkReaderBasic-12             397860         101920         -74.38%
BenchmarkReaderReplace-12           397844         101993         -74.36%
BenchmarkReaderReplaceTwo-12        397855         102012         -74.36%
BenchmarkAggregateCount_100K-12     16667102       3553623        -78.68%
BenchmarkAggregateCount_1M-12       168057743      33827931       -79.87%
BenchmarkSelectAll_10M-12           2758799896     1416504040     -48.66%
```

```
BenchmarkReaderHuge/97K-12         	    2200	    540840 ns/op	 184.32 MB/s	 1604450 B/op	     687 allocs/op
BenchmarkReaderHuge/194K-12        	    1522	    752257 ns/op	 265.04 MB/s	 2143135 B/op	    1335 allocs/op
BenchmarkReaderHuge/389K-12        	    1190	    947858 ns/op	 420.69 MB/s	 3221831 B/op	    2630 allocs/op
BenchmarkReaderHuge/778K-12        	     806	   1472486 ns/op	 541.61 MB/s	 5201856 B/op	    5187 allocs/op
BenchmarkReaderHuge/1557K-12       	     426	   2575269 ns/op	 619.36 MB/s	 9101330 B/op	   10233 allocs/op
BenchmarkReaderHuge/3115K-12       	     286	   4034656 ns/op	 790.66 MB/s	12397968 B/op	   16099 allocs/op
BenchmarkReaderHuge/6230K-12       	     172	   6830563 ns/op	 934.05 MB/s	16008416 B/op	   26844 allocs/op
BenchmarkReaderHuge/12461K-12      	     100	  11409467 ns/op	1118.39 MB/s	22655163 B/op	   48107 allocs/op
BenchmarkReaderHuge/24922K-12      	      66	  19780395 ns/op	1290.19 MB/s	35158559 B/op	   90216 allocs/op
BenchmarkReaderHuge/49844K-12      	      34	  37282559 ns/op	1369.03 MB/s	60528624 B/op	  174497 allocs/op
```
2019-09-13 14:18:35 -07:00
Yao Zongyou 18fedc67d5 friendly prompt for s3select MalformedXML error (#8171)
partly fix #7911
2019-09-09 21:33:27 -07:00
Yao Zongyou ec9bfd3aef speed up the performance of s3select on csv (#7945) 2019-08-31 00:07:40 -07:00
Kanagaraj M 12353caf35 Fix: Support Unicode delimiters in s3 select (#7931) 2019-07-17 19:10:17 +01:00
Yao Zongyou c4f480a839 fix csv read bug (#7885) 2019-07-05 12:08:56 -07:00
Yao Zongyou 60831e3299 aggregation functions' argument may already has been cast to numeric (#7876) 2019-07-05 10:38:38 -07:00
Yao Zongyou 037319066f fix unicode support related bugs in s3select (#7877) 2019-07-05 09:43:10 -07:00
Ryan Tam bd56f80250 Fix ignored alias for aggregate result in S3 Select (#7849)
The SQL parser as it stands right now ignores alias for aggregate
result, e.g. `SELECT COUNT(*) AS thing FROM s3object` doesn't actually
return record like `{"thing": 42}`, it returns a record like `{"_1": 42}`.
Column alias for aggregate result is supported in AWS's S3 Select, so
this commit fixes that by respecting the `expr.As` in the expression.

Also improve test for S3 select

On top of testing a simple `SELECT` query, we want to test a few more
"advanced" queries (e.g. aggregation).

Convert existing tests into table driven tests[1], and add the new test
cases with "advanced" queries into them.

[1] - https://github.com/golang/go/wiki/TableDrivenTests
2019-07-03 16:34:54 -07:00
Yao Zongyou 941fed8e4a s3Select: call Close on error to release the read lock (#7830) 2019-06-25 13:30:48 -07:00
Yao Zongyou 55092bede1 add timestamp compare support (#7832) 2019-06-25 11:05:37 -07:00
Yao Zongyou 90a3b830f4 fix typo and the string representation of the time.Time value (#7831) 2019-06-25 09:54:14 -07:00
Yao Zongyou 23b9df0694 Fix s3select TRIM function's nil pointer dereference bug (#7817) 2019-06-24 16:59:33 -07:00
Joe Stevens a19cf063b5 Fixes for multiplatform dev and testing from forks (#7734)
Add support for correct dependency URLs on all platforms

only build mountinfo.go on linux

make testfile path relative to support fork work
2019-06-04 00:59:40 -07:00
kannappanr 5ecac91a55
Replace Minio refs in docs with MinIO and links (#7494) 2019-04-09 11:39:42 -07:00
Aditya Manthramurthy b1b1d77893 Set S3 Select record message length to 128KiB (#7475)
- Previously this limit was a little more than 1MiB, and it broke
  compatibility with AWS SDK Java causing a buffer overflow error.
2019-04-04 00:41:52 -07:00
Kirill Motkov 3d29ab4059 Rewrite if-else chains to switch statements (#7382) 2019-03-18 07:46:20 -07:00
Harshavardhana 91d85a0d53
Fix stale locks held by SelectParquet API (#7364)
Vendorize upstream parquet-go to fix this issue.
2019-03-13 20:33:18 -07:00
Aditya Manthramurthy e463386921 Add JSON Path expression evaluation support (#7315)
- Includes support for FROM clause JSON path
2019-03-09 08:13:37 -08:00
Aditya Manthramurthy f4879ed96d Use jstream to serialize records to JSON format in S3Select (#7318)
- Also, switch to jstream to generate internal record representation
  from CSV/JSON readers

- This fixes a bug in which JSON output objects have their keys
  reversed from the order they are specified in the Select columns.

- Also includes a fix for tests.
2019-03-07 00:20:10 -08:00
Harshavardhana 2520e535a0
Allow lazyQuotes for certain types of CSV (#7278)
Set lazyQuotes to true, to allow a quote to appear
in an unquote field and a non-doubled quote may
appear in a quoted field.
2019-02-24 06:51:02 -08:00
Aditya Manthramurthy 80a351633f Update vendorized bcicen/jstream (#7257)
- Includes an error handling fix that is waiting to be merged upstream
- Uses order-preserving (un)marshalling for JSON objects.
2019-02-20 23:59:23 -08:00
Aditya Manthramurthy 8a405cab2f COUNT() function in select should return an int (#7243) 2019-02-13 16:32:59 -08:00
Harshavardhana df35d7db9d Introduce staticcheck for stricter builds (#7035) 2019-02-13 18:29:36 +05:30
Aditya Manthramurthy ee5b3622a5 Evaluate where clause in aggregation queries (#7235) 2019-02-12 13:54:26 -08:00
Harshavardhana 85e939636f Fix JSON parser handling for certain objects (#7162)
This PR also adds some comments and simplifies
the code. Primary handling is done to ensure
that we make sure to honor cached buffer.

Added unit tests as well

Fixes #7141
2019-02-07 08:04:42 +05:30