Sunday, October 19, 2008

Tutorial 11b: Efficient generic programming

Specialization is a programming technique used to improve the performance of generic code. It consists of specifying specific implementations for one or several specific instantiations of generic code.

Specialization using pattern matching

The "runge_kutta_4" function shown in these tutorials contains the following lines:

old + (1.0/6.0) * (v1 + 2.0 * (v2 + v3) + v4)


This is fine when dealing with floats, but with 3-dimensional vectors and larger types, it may be excessively slow due to the large amounts of intermediate values produced. Indeed, this code is the same as:

tmp1 = v2 + v3
tmp2 = 2.0 * tmp1
tmp3 = tmp2 + v4
tmp5 = v1 + tmp3
tmp6 = 1.0 / 6.0
tmp7 = tmp6 * tmp5
old + tmp7


Of these 7 intermediate values, 6 could be replaced by one.

What if we had a function to add 6 vectors creating only one value, the final result?

let add6 ((v0: Vec3d), (v1: Vec3d), (v2: Vec3d), (v3: Vec3d), (v4: Vec3d), (v5: Vec3d)) =
let x = v0.x + v1.x + v2.x + v3.x + v4.x + v5.x
let y = v0.y + v1.y + v2.y + v3.y + v4.y + v5.y
let z = v0.z + v1.z + v2.z + v3.z + v4.z + v5.z

Vec3d(x, y ,z)


Admittedly, it's not very pretty, and it would be nice if the user of "runge_kutta_4" was not forced to provide such a function.

F# has a parametric type called "option" useful to represent this kind of optional values. The C++ counter-part is to use a pointer, and let the null value denote the case when no value is provided.

The dictionary of operations is modified as follows:

type VecOps<'vector, 'scalar> = {
up: 'vector
add: 'vector * 'vector -> 'vector
add6: Add6<'vector> option
scale: 'scalar * 'vector -> 'vector
}


(By the way, another equivalent way of writing the declaration of add6 would be
add6: 'vector Add6 option
)

An instance of the dictionary using add6 is created as follows:

let Vec3d_ops = {
add = plus_Vec3d;
scale = scale_Vec3d;
add6 = Some add6;
up = Vec3d(1.0)
}


Or, to leave out add6:

let Vec3d_ops = {
add = plus_Vec3d;
scale = scale_Vec3d;
add6 = None;
up = Vec3d(1.0)
}


The last part of the "runge_kutta_4" function is modified to check if the dictionary of operations contains an "add6":

match vec_ops.add6 with
// No addMany available, use pair-wise additions (will create many intermediate vectors).
| None ->
old + (1.0/6.0) * (v1 + 2.0 * (v2 + v3) + v4)
// addMany available, use it.
| Some add ->
let tmp = add (v1, v2, v2, v3, v3, v4)
old + (1.0/6.0) * tmp


This is a new F# construct, called "pattern-matching". Basically, it means:

"Check the value of vec_ops.add6. If it is equal to None, then do the sum the usual way. otherwise, if the value of vec_ops.add6 is of the form Some add, then do the some by using 'add'".

I think pattern matching is a very nice construct. It allows you to match a value with fairly complex expressions called patterns. I won't dig deeper into patterns yet, let me just say that even though they look complex, the code generated by the compiler is as optimized as possible.

Optimization troubles

Unfortunately, not everything runs as fine as I had hoped. If I look at the generated code, I see that the optimizations that the F# compiler performed in the first part of Tutorial 11 have now disappeared. The problem seems to be that the compiler no longer sees through the record used for the dictionary of operations, and is therefore unable to inline the addition and scale operations. It is also unable to see if vec_ops.add6 is constantly "None" or if it has a constant function value.

As it turns out, the problem has nothing to do with pattern matching. Indeed, if I removed the "up" field from the record, or if use multiple parameters instead of a record, the generated code looks very nice, as shown below.

When an "add6" function is provided:



When no "add6" function is provided:




Was it worth it?
I get a 25% performance boost thanks to "add6" when using 3d vectors. On the other hand, I get a 65% performance drop when summing floats. This seems to be a clear case where tailoring generic code to specific types really helps to achieve the best performance.
There is a link to the full code available at the bottom of this page, if you want to try it out for yourself. I have made an habit of messing up my benchmark measurements, so don't take my word!

Conclusion

Time to conclude the part of these Tutorials dedicated to generic programming and low-level optimization.

To summarize:
1) Make sure your computations on floats are correct. NaN values kill performance.
2) Avoid curried functions when performance is critical, especially for methods.
3) Inline generic functions to avoid indirect calls to operations. However, do not take the habit of inlining all small functions. The compiler or the CLR do that for you.
4) Keep dictionaries of operations small, or avoid them altogether. Alternatives include static type constraints, shown in Tutorial 9, or function parameters.

Regarding item 4), I hope this is a problem that will be fixed in later releases of the F# compiler.

Source code
The complete source code used in this tutorial is available on google code (with comments, for a change).

No comments: