Author Archives: Richard Startin

Tuning Spark Back Pressure by Simulation

Spark back pressure, which can be enabled by setting spark.streaming.backpressure.enabled=true, will dynamically resize batches so as to avoid queue build up. It is implemented using a Proportional Integral Derivative (PID) algorithm. This algorithm has some interesting properties, including the lack of guarantee of a stable fixed point. This can manifest itself not just in transient overshoot, but in

Concise Binary Object Representation

Concise Binary Object Representation (CBOR) defined by RFC 7049 is a binary, typed, self describing serialisation format. In contrast with JSON, it is binary and distinguishes between different sizes of primitive type properly. In contrast with Avro and Protobuf, it is self describing and can be used without a schema. It goes without saying for all binary

Perpetual Kerberos Login in Hadoop

Kerberos is the only real option for securing an Hadoop cluster. When deploying custom services into a cluster with Kerberos enabled, authentication can quickly become a cross-cutting concern. Kerberos Basics First, a brief introduction to basic Kerberos mechanisms. In each realm there is a Key Distribution Centre (KDC) which issues different types of tickets. A KDC has two

Co-locating Spark Partitions with HBase Regions

HBase scans can be accelerated if they start and stop on a single region server. IO costs can be reduced further if the scan is executed on the same machine as the region server. This article is about extending the Spark RDD abstraction to load an RDD from an HBase table so each partition is co-located with a region server. This

HBase Connection Management

I have built several web applications recently using Apache HBase as a backend data store. This article addresses some of the design concerns and approaches made in efficiently managing HBase connections. One of the first things I noticed about the HBase client API was how long it takes to create the connection. HBase connection creation is