Voting systems can be roughly divided into three (or more recently four [CLW08]) types according to their central cryptographic components. One of these components are mixnets [CHA81]. Until recently, mixnets have been too slow for use in elections in the hundreds of thousands of votes. This changed with efficient proofs of shuffle such as those proposed by Wikstrom [TW10] and Groth [BG12]. Despite these advances it is still important to carefully develop efficient implementations in order to achieve systems that can scale beyond toy use cases. This is even more important in the current age of multi-core cpus whose extra performance is not available automatically, but must be deliberately exploited with parallel code.

In this post we show benchmarks obtained from parallelization work on a mixnet based voting prototype in order to achieve large scale election performance starting from a naive implementation. This is joint work with the Bern voting group behind the unicrypt voting library [BFH2] as well as the univote2 voting system [BFH3] which served as a reference.

The prototype implements the protocol described previously here. Benchmarks were run for two types of runs, mixing phase and full election cycles. For each we briefly describe the main steps of the protocol, followed by the results.

## Mixing phase

During the mixing phase:

- Each authority gets the serialized votes from the bulletin board (BB) and converts them to objects
- Each authority shuffles the votes and generates proofs (Terelius-Wikstrom offline and online parts)
- Each authority serializes the mixed votes and proofs
- Each authority posts the serialized data to the BB
- The BB converts the posted data to objects
- The BB verifies the proofs

# Results

All results for two authorities and 2048 bit modulus.

## 1.1 Setup A: mixnet0.1 + unicrypt2.2

- mixnet0.1 base version
- unicrypt2.2 base version, includes isPrime performance fix
- max votes tested 5k
- max votes / hour 5709.969702
- total scaling factor 1.234471975

As expected, scaling is limited (1.23x) as cpu utilization without parallelism is limited to one core.

## 1.2 Setup B: mixnet0.2 + unicrypt2.2_GMP

- mixnet0.2 adds some parallelism at the protocol level
- unicrypt2.2_GMP modpow provided by libgmp native implementation
- max votes tested 10k
- max votes / hour 23847.21882
- total scaling factor 1.222546883

The use of native libgmp implementation yields an approximately 5x performance improvement up to 24k votes per hour, but cpu scaling is still limited to 1.22x as this version is still using only one core.

## 1.3 Setup C: mixnet0.2 + unicrypt2.2*GMP*MPS

- mixnet0.2 as before
- unicrypt2.2
*GMP*MPS uses mpservice plus two other optimizations - max votes tested 400k
- max votes / hour 255545.6965
- total scaling factor 13.10075599 (uses Setup B as estimator for 1 core performance)

Whereas the previous versions had very limited performance scaling (1.23x), this version achieves a total scaling of 13x with cpu utilization of all cores. Notwithstanding, we still see a trend of diminishing returns as cores as the curve slope begins to fall. This pattern is typical of practical parallelization that rarely reaches theoretical limits as serial bottlenecks begin to dominate performance.

## 1.4 Setup C Clustered

In this setup we use a 4 machine cluster, 1 machine carries out the election, the other 3 are modular exponentiation slaves.

- mixnet0.2 as before
- unicrypt2.2
*GMP*MPS uses mpservice cluster plus two other optimizations - max votes tested 500k
- max votes / hour 449697.8031
- total scaling factor 23.05412013 (uses Setup B as estimator for 1 core performance)

The code manages utilize all cores across a 3-cpu cluster, achieving a total scaling of 23x. Again, the pattern of diminishing returns is present, and becomes even more apparent as additional bottlenecks from network overheads enter the picture.

## Full election cycle

In a full election cyle:

- Each authority gets the serialized votes from the BB and converts them to objects
- Each authority shuffles the votes and generates proofs (TW offline and online parts)
- Each authority serializes the mixed votes and proofs
- Each authority posts the serialized data to the BB
- The BB converts the posted data to objects
- The BB verifies the proofs
- Each authority get the final mixed votes from the BB and converts them to objects
- Each authority computes partial decryptions and proofs
- Each authority serializes decryptions and proofs
- The BB converts the posted data to objects
- The BB verifies the proofs
- The BB combines the decryptions and decrypts the votes

# Results

All results for two authorities and 2048 bit modulus.

## 2.1 Setup A: mixnet0.1 + unicrypt2.2

- mixnet0.1 base version
- unicrypt2.2 base version, includes isPrime performance fix
- max votes tested 5k
- max votes / hour 4439.581732
- total scaling factor

## 2.2 Setup B: mixnet0.2 + unicrypt2.2_GMP

- mixnet0.2 adds some parallelism at the protocol level
- unicrypt2.2_GMP modpow provided by libgmp native implementation
- max votes tested 10k
- max votes / hour 19477.7791
- total scaling factor

## 2.3 Setup C: mixnet0.2 + unicrypt2.2*GMP*MPS

- mixnet0.2 as before
- unicrypt2.2
*GMP*MPS uses mpservice plus two other optimizations - max votes tested 400k
- max votes / hour 195172.2693
- total scaling factor

## 2.4 Setup C Clustered

In this setup we use a 4 machine cluster, 1 machine carries out the election, the other 3 are modular exponentiation slaves.

- mixnet0.2 as before
- unicrypt2.2
*GMP*MPS uses mpservice cluster plus two other optimizations - max votes tested 500k
- max votes / hour 337420.432
- total scaling factor

## Implementation highlights

Effort has gone into making the prototype reasonably performant. This involves changes to the mixnet prototype code as well as changes to the underlying unicrypt library. As a result, total scaling with respect to the original implementation reaches almost 200x in the best case, using a clustered setup. The following are the main optimization areas.

## Parallel modular exponentiation: record/replay

Voting systems are inherently very parallelizable as much of the processing is done per-vote in an independent way. The bulk of computation in public key cryptography and derived voting systems is modular exponentiation. The task is then to parallelize these costly operations. Unfortunately, extracting parallelism from code that was not designed with it as a central concern from the very beginning is usually very difficult or outright not practical. In this particular case the difficulty is manifested as:

- the are many different callstacks using modpow, since it is a basic low level operation
- modpows occur deep in the callstack, whereas the context where they can be made parallel (loops) is much higher up. For example:

- modpow results are typically used immediately after the their calculation in the code, which makes the vectorization harder; a boundary needs to be created to collect each modpow before it is used.

On way to solve these problems would be to do a full rewrite with parallelization in mind from the beginning, but this is simply not practical. The approach used instead was to construct an automatic parallelism extraction mechanism that works by attaching “bridge” objects to each execution thread. When activated, these bridge objects monitor the threads at specific code blocks and intercept modular exponentiation calls. The calls are collected (*record*), and computed in bulk via the mpservice. MPService computes modular exponentiations transparently across cores and machines in a cluster using scala parallel collections and akka. Its interface is:

```
trait ModPowService {
// compute modular exponentiation for a list of inputs
def compute(work: Array[ModPow]): Array[BigInteger]
// compute modular exponentiation for a list of inputs with common modulus
def compute(work: Array[ModPow2], mod: BigInteger): Array[BigInteger]
}
```

Once the modpows are computed, they are collected and *replayed* back to the inspected thread along the specified code block. Because both java8 and scala support lambdas it is possible to represent these code blocks as closures. The extractor is then a higher order function. For automatic extraction to work, these closures **must** be purely functional.

The parallelization work on the code is then:

- Identify all the stack traces with modpows that scale with number of votes
- For each stack trace, identify or convert suitable interception point
- Convert the code block into a closure and apply extractor
- Measure gains and losses due to replay

Here’s an example of applying the record/replay extraction mechanism for the modpow calculation in the stack trace above:

```
MPBridge.ex(() -> {
for (int i : this.getAllIndices()) {
elements[i] = this.getAt(i).apply(element.getAt(i), randomByteSequence);
}
return elements;
}, "2");
```

The closure (() -> {}) wraps the original code. Extraction and mpservice are the main scaling feature providing ~90% of the speed up seen in this graph (copied from above):

## Clustered modular exponentiation

When configured, mpservice computes exponentiations across an akka cluster. There is probably a lot of room for improvement here tuning akka parameters. Also, the bulk model of extraction and computation could be replaced with a streaming one to offset part of the network delays.

## Native libgmp modpow implementation

The jna-gmp library is used to speed up modular exponentiation through libgmp’s fast native implementation.

## Protocol level parallelism

Protocol overlaps where computations occur simultaneously are simulated with Futures and Future composition.

The implementation covers parallelism at the mixing and decryption phase. It is worth noting that parallelization at the permutation phase (offline TW) allows partial hiding of this loop which is a real performance killer:

```
// CANT BE PARALLELIZED
for (int i = 0; i < this.size; i++) {
Element c_i_1 = i == 0 ? h : cs[i -
cs[i] = temp[i].apply(c_i_1.selfApply(ePrimeV.getAt(i))); // [2n]
}
```

Original, unsplit version:

```
for (int i = 0; i < this.size; i++) {
Element c_i_1 = i == 0 ? h : cs[i - 1];
cs[i] = g.selfApply(rV.getAt(i)).apply(c_i_1.selfApply(ePrimeV.getAt(i))); // [2n]
if (i > 0) {
ds[i] = rV.getAt(i).apply(ds[i - 1].selfApply(ePrimeV.getAt(i)));
}
}
```

Note that although protocol level parallelism is simulated with futures, machine resources are split, whereas in a real scenario authority machines have all the hardware to themselves. This means that permutation shuffle and decryption should be somewhat faster than these benchmarks suggest.

## Parallel generator computation

The computation of random generators with a deterministic random byte sequence involves calculating hashes sequentially on a single thread which incurs small but noticeable performance penalty. This is done in parallel seeding the sequences with different numbers:

```
def getIndependentGenerators[E <: Element[_]](group: AbstractCyclicGroup[E, _], skip: Int, size: Int): java.util.List[E] = {
val split = generatorParallelism
val total = size + skip
val a = Array.fill(total % split)((total / split) + 1)
val b = Array.fill(split - (total % split))(total / split)
val c = a ++ b
val seedLength = CTR_DRBG.getFactory().getSeedByteLength()
val converter = ByteArrayToBigInteger.getInstance(seedLength)
val rds = c.zipWithIndex.map{ case (value, index) =>
// 1000: we want to leave room for generators not to overlap
val seed = java.math.BigInteger.valueOf(index * (total / split) *1000).mod(MathUtil.powerOfTwo(CTR_DRBG.getFactory().getSeedByteLength()))
val r = DeterministicRandomByteSequence.getInstance(CTR_DRBG.getFactory(), converter.reconvert(seed))
(r, value)
}
val items = rds.par.flatMap { case (d, i) =>
val sequence = group.getIndependentGenerators(d).limit(i)
sequence.toList
}
items.drop(skip).toList
}
```

## Other possible optimizations

Performance results suggest that a large portion of computations are now executed in parallel. However a large factor of improvement probably remains. Some ideas:

- Stream modpows instead of waiting till bulk
- Use gmp Legendre symbol for membership checks in quadratic residue groups
- Allow extracting modpows from parallel collections
- MPIR/use gmp for multiplying
- JVM/gc/parallelism/akka tuning
- Convert non modpow loops in unicrypt to parallel

# Future work

The plan is to go from prototype to fully functional system suitable for real world production use. One of the first steps already underway is to enable remoting, that is, distribute mixnet authorities over the network. In this prototype these communications were simulated with local method calls. Likewise, the bulletin board was simulated with a simple invocation interface, and will have to be carefully implemented as part of a fully featured system.

# References

[BG12] http://www0.cs.ucl.ac.uk/staff/J.Groth/MinimalShuffle.pdf

[TW10] http://www.csc.kth.se/~dog/research/TW10Conf.pdf

[CLW08] http://www.internetsociety.org/sites/default/files/chow_0.pdf

[BFH2] https://github.com/bfh-evg/unicrypt

[BFH3] https://github.com/bfh-evg/univote2/blob/development/doc/report/report.pdf

[CHA81] http://www.csl.mtu.edu/cs6461/www/Reading/08/Chaum-ACMCOMM81.pdf