emperor: (lego scholar)
emperor ([personal profile] emperor) wrote2005-05-30 11:39 am

Nargery

I've been thinking for a while now that I should maybe re-organise some of my work code. The question turns out to be slightly less trivial than I thought...

Essentially, I have a library of useful graph/network analysis and simulation tools, written in C. There are currently three ways you can represent a graph in this library:

  1. an n-by-n array of integers

  2. an n-by-n array of bytes

  3. an n-array of adjacency lists


...though I might want to add ones later. Currently you chose between these at compile time (it's a switch in the Makefile), and then matrix.h does things like:
#ifdef S_MATRIX
#include "smatrix.h"
#elif defined I_MATRIX
#include "imatrix.h"
...

The various header files then #define various useful functions, eg:
/*is bit x of a set?*/
#define MAT_TEST(m,a,x) (m[(a)][SM_BYTE(x)] & SM_BIT(x))
/*set bit x of a*/
#define MAT_SET(m,a,x) (m[(a)][SM_BYTE(x)] |= SM_BIT(x))


The various functions I've written call MAT_SET and so on without having to care what type of graph type you're using (since generally it's only an efficiency issue); in the odd case that a function needs a particuar representation, it can check what #define you used (again, this is currently a compile-time thing).

There are a few obvious problems. Firstly, for this library to be more generally useful than for just me (which I think would be a good aim), people shouldn't have to re-build the entire library frequently (consider multi-user systems). Secondly, it is occasionally useful to be able to transform between the different structures (for example, reading in some data formats is best done using an adjacency-list representation), and only with a bit of hackery can you do this currently.

I can think of a few possibilities for going ahead, and I was also wondering what my readers (many of whom are far better at this coding malarky than I am) thought:

Firstly, I could carry on as I am doing, which is the past of least effort currently. I'm already finding myself having to do unnecessary work as a result, though. I'd like to find something better

Secondly, I could have different types for each data type, so you could say SMAT_TEST or IMAT_TEST, and so on. You'd still need some mechanism for chosing the default, though. Also, you'd need some mechanism for a library function to know which sort of structure it had been passed. Maybe the matrix_t structure could be a union of the other types, with a flag to say which it was? That has the problem that the TEST and SET functions are often macros. I guess you could compile your code three times into one vast library, and then have a piece of code to work out what type of graph type you were using, and call the relevant version. This all seems terribly complicated, though...

If I seem a little obsessed with all this, it makes a noticable difference to run-time which data structure type I use for different algorithms - some graph algorithms are very expensive, and I'm working with large graphs, hence the desire to get this sort of thing right.

[if anyone wants to see the complete header files, then do ask, they just seem a bit large to dump into LJ]
ext_8103: (Default)

[identity profile] ewx.livejournal.com 2005-05-30 11:24 am (UTC)(link)

For your library you can decorate the functions with the representation and the user-facing header file can pick the right implementation for the selected algorithm by defining the publicly visible function name to the appropriate decorated one.

However, having callers know the representation, at the level of having macros that need to understand it, etc, is going to lose you type-safety at link time.

My usual approach for differing types with a common interface is the same as that used by Berkeley DB; the object representing the graph contains, or contains a pointer to, a table of function pointers, and all the knowledge of the underlying representation is hidden behind that. The only time your program ever knows the representation is when it requests the library to create it a new graph.

ext_8103: (Default)

[identity profile] ewx.livejournal.com 2005-05-30 11:25 am (UTC)(link)
...I see IWJ is suggesting the same idea as my final paragraph on IRC in fact l-)

Re: via #chiark

[identity profile] deliberateblank.livejournal.com 2005-05-30 02:48 pm (UTC)(link)
One issue with converting from macros to functions is that these operations (essentially single bit testing and manipulation) are fairly trivial, such that the function call overhead can be many times higher than the cost of the operation itself. If you end up calling them a *lot* (like at the bottom of a nested loop) this overhead can become crippling to performance. Of course whether this is an issue for you or not depends on what exactly the client code is doing - ie you need to test it on some real world problems to find out.

Personally I'd go with the function name decoration with macros to choose a default set with undecorated versions of the names, but make the functions inlineable. If you put a key value in each matrix structure, higher level functions can choose the right low level implementation without needing to be duplicated for each one (or requiring an extra type parameter).

Re: via #chiark

[identity profile] womble2.livejournal.com 2005-05-30 08:19 pm (UTC)(link)
Those of us who actually know C++ know that virtual function calls are rather more expensive than ordinary function calls, since they generally can't benefit from inlining or branch prediction. The same goes for any indirect branching mechanism.