Observing Memory Level Parallelism with JMH

Quite some time ago I observed an effect where breaking a cache-inefficient shuffle algorithm into short stages could improve throughput: when cache misses were likely, an improvement could be seen in throughput as a function of stage length. The implementations benchmarked were as follows, where op is either precomputed (a closure over an array of
Continue reading Observing Memory Level Parallelism with JMH

Mixing Vector and Scalar Instructions

I saw an interesting tweet from one of the developers of Pilosa this week, reporting performance improvements from unrolling a bitwise reduction in Go. This surprised me because Go seems to enjoy a reputation for being a high performance language, and it certainly has great support for concurrency, but compilers should unroll loops as standard
Continue reading Mixing Vector and Scalar Instructions

Limiting Factors in a Dot Product Calculation

The dot product is a simple calculation which reduces two vectors to the sum of their element-wise products. The calculation has a variety of applications and is used heavily in neural networks, linear regression and in search. What are the constraints on its computational performance? The combination of the computational simplicity and its streaming nature
Continue reading Limiting Factors in a Dot Product Calculation