I'm seeing what is possible performance-wise in the current C++ before
trying out Go and Rust implementations.
* use the google benchmark framework and some real data.
* use release builds - I hadn't done this in a while, and there were a
few compile errors that manifested only in release mode. Update the
readme to suggest using a release build.
* optimize the varint decoder and SampleIndexIterator to branch less.
* enable link-time optimization for release builds.
* add some support for feedback-directed optimization. Ideally "make"
would automatically produce the "generate" build outputs with a
different object/library/executable suffix, run the generate
benchmark, and then produce the "use" builds. This is not that fancy;
you have to run an arcane command:
alias cmake='cmake -DCMAKE_BUILD_TYPE=Release'
cmake -DPROFILE_GENERATE=true -DPROFILE_USE=false .. && \
make recording-bench && \
src/recording-bench && \
cmake -DPROFILE_GENERATE=false -DPROFILE_USE=true .. && \
make recording-bench && \
perf stat -e cycles,instructions,branches,branch-misses \
src/recording-bench --benchmark_repetitions=5
That said, the results are dramatic - at least 50% improvement. (The
results weren't stable before as small tweaks to the code caused a
huge shift in performance, presumably something something branch
alignment something something.)