emperor | Nargery

I've been thinking for a while now that I should maybe re-organise some of my work code. The question turns out to be slightly less trivial than I thought...

Essentially, I have a library of useful graph/network analysis and simulation tools, written in C. There are currently three ways you can represent a graph in this library:

an n-by-n array of integers

an n-by-n array of bytes

an n-array of adjacency lists

...though I might want to add ones later. Currently you chose between these at compile time (it's a switch in the Makefile), and then matrix.h does things like:

#ifdef S_MATRIX
#include "smatrix.h"
#elif defined I_MATRIX
#include "imatrix.h"
...

The various header files then #define various useful functions, eg:

/*is bit x of a set?*/
#define MAT_TEST(m,a,x) (m[(a)][SM_BYTE(x)] & SM_BIT(x))
/*set bit x of a*/
#define MAT_SET(m,a,x) (m[(a)][SM_BYTE(x)] |= SM_BIT(x))

The various functions I've written call MAT_SET and so on without having to care what type of graph type you're using (since generally it's only an efficiency issue); in the odd case that a function needs a particuar representation, it can check what #define you used (again, this is currently a compile-time thing).

There are a few obvious problems. Firstly, for this library to be more generally useful than for just me (which I think would be a good aim), people shouldn't have to re-build the entire library frequently (consider multi-user systems). Secondly, it is occasionally useful to be able to transform between the different structures (for example, reading in some data formats is best done using an adjacency-list representation), and only with a bit of hackery can you do this currently.

I can think of a few possibilities for going ahead, and I was also wondering what my readers (many of whom are far better at this coding malarky than I am) thought:

Firstly, I could carry on as I am doing, which is the past of least effort currently. I'm already finding myself having to do unnecessary work as a result, though. I'd like to find something better

Secondly, I could have different types for each data type, so you could say SMAT_TEST or IMAT_TEST, and so on. You'd still need some mechanism for chosing the default, though. Also, you'd need some mechanism for a library function to know which sort of structure it had been passed. Maybe the matrix_t structure could be a union of the other types, with a flag to say which it was? That has the problem that the TEST and SET functions are often macros. I guess you could compile your code three times into one vast library, and then have a piece of code to work out what type of graph type you were using, and call the relevant version. This all seems terribly complicated, though...

If I seem a little obsessed with all this, it makes a noticable difference to run-time which data structure type I use for different algorithms - some graph algorithms are very expensive, and I'm working with large graphs, hence the desire to get this sort of thing right.

[if anyone wants to see the complete header files, then do ask, they just seem a bit large to dump into LJ]

Flat | Top-Level Comments Only

For your library you can decorate the functions with the representation and the user-facing header file can pick the right implementation for the selected algorithm by defining the publicly visible function name to the appropriate decorated one.

However, having callers know the representation, at the level of having macros that need to understand it, etc, is going to lose you type-safety at link time.

My usual approach for differing types with a common interface is the same as that used by Berkeley DB; the object representing the graph contains, or contains a pointer to, a table of function pointers, and all the knowledge of the underlying representation is hidden behind that. The only time your program ever knows the representation is when it requests the library to create it a new graph.

...I see IWJ is suggesting the same idea as my final paragraph on IRC in fact l-)

Indeed. I'm about to stick their remarks into a comment here.

<Diziet> Why not have a virtual method table ?
<Emperor> How would that work, then?
<Diziet> struct GeneralMatrix { const GeneralMatrixVtable *vtable; /*other stuff goes here for each particular representation*/ };
<Diziet> struct GeneralMatrixVtable { int (*test)(GeneralMatrix*, ... etc.
<Diziet> #define MAT_TEST(m,a,x) ((m)->vtable->test((m),(a),(x)))
<Diziet> static const struct GeneralMatrixVtable vtable_for_byte_matrix= { bytematrix_test, ... }
<Emperor> Hm, yes, I can see that would work.
<Diziet> int bytematrix_test() { actual code; }
<Diziet> Modulo actual syntax and stuff :-).
<Diziet> This is actually OOP, of course, but it sounds like what you want is exactly OOP.
<Diziet> You can do it in C++ with a virtual base class and a bunch of concrete implementations of it but C++ is pretty nasty. In C it's not too hard; you just need a cast in each method to take the general pointer and turn it into the specific one for the kind of type, but it's not really error-prone.
<Diziet> There are a couple of examples of this kind of thing floating about. liboop is one, although it doesn't have the vtable in a separate struct.
<Diziet> (You do want the vtable in a separate struct so that you don't have to have a copy of it for each matrix.)

<Diziet> You can use a union, but either that wastes memory by allocating enough memory for large representations when you have a small one, or it's a lie because it's not really a union, just the head of one.
<Diziet> So struct ByteArrayMatrix { GeneralMatrix general; actual stuff goes here; }
<Diziet> And then when you write bytematrix_test you cast the incoming GeneralMatrix* to a ByteArrayMatrix*.
<Diziet> The caller never allocates one. They just call GeneralMatrix *bytematrix_create(....)
<Diziet> Then they have GeneralMatrix *matrix which they can do matrix->vtable->test(matrix, ...) (via a macro if you want to do it lots).
<Diziet> s/which/with &/
<Diziet> bytematrix_create has a ByteArrayMatrix *bam= malloc(sizeof(*bam)) and returns &bam->general.
<Diziet> You should put a comment near GeneralMatrix general; in each SpecificKindOfMatrix saying not to move it from being the first member.
<Diziet> (C guarantees that the address of the first member is the same as the address of the whole struct, modulo type conversions.)
<Diziet> (If you need it to not be the first member, eg because something has already nabbed that trick, you can use offsetof but it's quite tedious.)
<ewx> if you have to use offsetof then hide the offsetofery behind a macro rather than typing it out lots of times
<Diziet> Oh, and of course you shouldn't call these things Matrix because they're not matrices.

One issue with converting from macros to functions is that these operations (essentially single bit testing and manipulation) are fairly trivial, such that the function call overhead can be many times higher than the cost of the operation itself. If you end up calling them a *lot* (like at the bottom of a nested loop) this overhead can become crippling to performance. Of course whether this is an issue for you or not depends on what exactly the client code is doing - ie you need to test it on some real world problems to find out.

Personally I'd go with the function name decoration with macros to choose a default set with undecorated versions of the names, but make the functions inlineable. If you put a key value in each matrix structure, higher level functions can choose the right low level implementation without needing to be duplicated for each one (or requiring an extra type parameter).

Those of us who actually know C++ know that virtual function calls are rather more expensive than ordinary function calls, since they generally can't benefit from inlining or branch prediction. The same goes for any indirect branching mechanism.

Nargery

no subject

no subject

no subject

via #chiark

more IRC discussion

Re: via #chiark

Re: via #chiark