On the problem of database functions implementation "fan out"

On part I, we've introduced the problem of DB function fan-out, and we figured every such function is actually a multi-function with 2^n versions, given n inputs. On part II we've seen a few solutions to the problem at hand. On part III we introduced some template metaprogramming techniques and a few classes that will serve as building blocks and will help us understand the solution which will be presented on this part. Besides exploring the solution, we'll also introduce a benchmark to compare it to some of the other solutions we've shown during this multi-part article. We'll wrap up by connecting it back to the subject we started from: databases (or ETL engines). As always, I'll try to shed some light on modern C++ topics that appear on the sideways (at least from my perspective). A lot to accomplish in this last part...so let's get started!

Introducing the SuperFunctor concept, usage & power

On the previous article, I've presented the FunctorInputSetter struct, with its wrapper struct FunctorInputSetterWrapper. Let's recall: given a tuple of inputs and a version number (mask) between 0 and tuple size - 1, I use this mask to get a new tuple of the same length. Bit #i on the mask decides for the ith element in the tuple - type T, whether to include T in the result tuple (i.e.: T represents a constant input element) or to include std::vector<T> in it (i.e.: T represents a vector input element). The tuple of inputs and the version number represent a scalar function input and the multi-function version number respectively. The output represents the right input to that version of the multi-function. Now we'd like to use that struct to achieve the following:

Take a function, like subString with a (scalar) signature of string subString(string, int, int), and design a super-functor class that will implement all 2^3 = 8 versions of the function, using a mask that decides on the version number.
All these different implementations must be created at compile-time, be effective, and most importantly - be written only once and run "polymorphically" for all versions.

Now, I think that already sounds like a difficult task, but is it enough? Let's assume we do have this "magical" code and that it contains only 30-40 rows of code. So we'll have a dedicated SubString class with one scalar version of subString and 30-40 lines of template metaprogramming code that "manufactures" all 8 different versions of this multi-function. Well, impressive, but what about the concat function? or Plus? How about toDate? And what about all other 250 multi-functions of the database? Should each one of them be embedded in its own class and include this "magical" TMP code of 30-40 lines? If we've learned the TMP lesson well, the answer is No. We can make the "magic" code even more magical by using more TMP and avoid the repetitive code per multi-function class! We'll create a single class that will help us implement all multi-functions and all their versions! We'll call our TMP-based class the SuperFunctor!

Before presenting the full solution, let's have a look at its usage. I'd like to show you how simple things can get thanks to that solution. Do you remember the long & complex subString code we saw on both MonetDB & ClickHouse on part II? Do you remember the horrible code repetition we saw in these solutions (and also in the simplest solution presented first)? Bare in mind: this was only for subString. Now imagine this kind of code for 250 database functions and you got yourself a prefect mess to maintain. Now have a look at the SuperFunctor use for subString. First, we implement the basic, scalar version of subString:

Now let's see the initialization of all 8 versions of the subString's SuperFunctor:

What we have here is the declaration of 8 different versions of subString, with substringFunctorX being version #x of the subString multi-function, as presented in the truth table on part III of this article. For example: substringFunctor4 is the subString version with the multi-function input signature of (string, int, vector<int>), as 4 = 0b100 would suggest. Now, in the image above, we can see that we pass the scalar function substring as an argument to the createSuperFunctor template function and use the mask as an explicit template parameter (8 masks in total). We'll soon see how and why this syntax is used, but before that let's agree that this is rather simple.

Now, let's see the use of these different super functor versions. For that assume we have 3 constant variables declared like this (pay attention to the index and type of each, as in the truth table):

and 3 vector variables (again - index and type as in the truth table):

(code in my git includes the full example, including the input vector initialization). For now, let's focus on the usage of the different 8 versions we created above:

Pay close attention please: each version uses the "right" variables, either a vector or a scalar, on each of the 3 function parameters. These are put inside a function call (functor objects has an opeartor() function) that calls the multi-function version very naturally, each call with its own version. These calls might remind you of the simplest example we showed on part I of the article. There, we used 8 isolated, overloaded functions (no templates) that basically duplicated the entire code (each time with a different signature and a bit of a different loop inside). So, please notice that the function calls on our solution are as simple as the ones on the simplest solution! All that was needed is declaring the functors above, using the factory function createSuperFunctor with the mask as a template argument, and the scalar function as a parameter to the function. Then we could use all 8 versions very naturally!

Moreover, if we got confused, trying to call, for instance, subStringFunctor7 with 1 non-vector parameter like that: res7 = substringFunctor7(param1, param2, constant1), we immediately get a compilation error: "error: no match for call to ‘(SuperFunctor...".

Let that sink in for a moment: The compiler protects us from calling any one of the automatically manufactured 8 versions with wrong parameters (any one of the 3)!!!

Amazing. Imagine what a great power to have on your side: a C++ compiler! That might become very handy on multiple multi-functions with 3+ parameters, right?

Later, we'll discuss briefly how the above code could neatly be used inside a database catalog that will easily register all these versions more effectively than the 8 rows shown above (which aren't that bad anyway).

The SuperFunctor class implementation

Now, we present the full SuperFunctor implementation and try to explain its main parts:

createSuperFunctor helper method

First, have a look at lines 76-81. There we define a helper function that will help us construct the SuperFunctor more easily (more on helper functions on C++, their pre-C++17 role and why I used them, down below). The createSuperFunctor function is templated on 3 parameters:

Mask (which we already saw).
Out which is the out parameter type of the db multi-function (e.g.: std::string for subString).
FullInputTypes, which in this case is the variadic type list of the scalar version of the multi-function (e.g.: string, int, int, in case of subString).

The way we get Mask is by explicitly supplying it in the function call, as in: auto substringFunctor2 = createSuperFunctor<0b010>(subString); (Mask=0b010).

The other 2 template parameters are obtained using template function type deduction from the function argument (more on that in the "Helper methods and CTAD" section below). When we declare a function argument: Out (*t) (const FullInputTypes&...), we tell the compiler that we expect a function that returns Out as its return type and gets the types in the variadic template list (FullInputTypes) as input. This is of course much simpler than writing explicitly all these template parameters. Now, this function basically performs two things:

On line 78 - static assert on the mask size - verifies at compile time that the size of the mask isn't larger than it should be (no more than 2^n - 1 when n is the length of the variadic template type). More details on that on the next section.
On Lines 79-80 we use the class template presented on part III of this article: FunctorInputSetterWrapper. We make use of this class to transform the list of "scalar" types in FullInputTypes to a "vectorized" form according to the value of Mask, as explained above. This way, we can supply the SuperFunctor class defined on lines 22-74, with the TransformedType variadic template parameter which it needs (see below). So on lines 79-80, we call SuperFucntor with a full explicit template list specified (all 3 types), and pass the scalar function t as a parameter to the constructor (the same t we got on line 77 as parameter).

SuperFunctor class

Now, let's focus on the SuperFunctor definition on lines 22-74. As before, we use a general type with 3 template parameter types on lines 22-23, and then declare a specific specialization of it on lines 25-74. The class is templated on 3 type parameters (T, Mask & S), which are specialized using 4 template types: Out (as above), variadic type FullInputTypes (scalar function types, as explained above), Mask & another variadic type, TransformedType (the vectorized, transformed version of FullInputTypes, as explained above). We obtain the values for all 4 types by the specialization on line 26:

Out & FullInputTypes are deduced from matching the expression Out (const FullInputTypes&...) to the actual scalar function signature passed on line 79 (which relates to T on line 22). I declared This type of function as Func on line 28 for convenience (the syntax is a bit different from that of the template parameter on line 26, but this is just the C++ syntax).
Mask is explicitly given (relates to Mask type on line 22).
The TransformedType variadic type is given within a std::tuple (std::tuple<TransformedType...>). It relates to S type on line 22.

Now, let's go over the rest of the class quickly:

The constructor on line 30, uses the definition of Func above, to get the actual scalar function as parameter f. It keeps that function as a private field (declared on line 33).
2 templated functions named getSize on lines 45-53, are used to return the size of the vector (line 47) or the size of the scalar (which is always 1, line 52).
Function findSize on lines 55-60 is used to obtain the input size. Yes, we could get it as another parameter to operator() below, but I thought it might be nice to deduce it by myself without troubling the user. This function finds the size of every element in the input (given by TransformedType) and inserts it into a vector of sizes (each one of the inputs is send to getSize template function, using fold expression feature of C++17 on vector's push_back presented on the part III). Then, it returns the max element of this vector, which represents the size of all vectors on input (if there are only scalars, the super functor will return a vector of size 1 with the scalar function applied on the scalar input).
2 templated functions extractValue on lines 35-43 are used to abstract away the functionality of extracting the value of index i from any one of the inputs (either a vector or a scalar), passed as a parameter to each function. So for a vector (line 46) it just returns the value in the index inside the vector, whereas for a scalar (line 37) it just returns the constant value (ignoring the index value).
The only public method is operator() (as usual with functor objects) on lines 63-73. This way we can use the functor like a function, plain and simple (e.g.: subStringFunctor7(param1, param2, param3)). This function defines a vector of results using Out type (line 64) and then finds the input size using findSize (line 65). Then it uses a sophisticated foreach loop on line 68-70 to call the scalar function (held on private field func) on each "row" of the input. For every index i, this function executes the templated extractValue function on each one of the input types by itself. It does it by using the fold expression feature over the variable input, of the variadic type TransformedType. Then, it runs func (which is the scalar function) on the results of all calls to extractValue, thus applying the scalar function on all "scalar" values, extracted from all inputs. The result is pushed immediately to the vector of results. All this happens on line 69, which is the highlight of this solution! Awesome isn't it? The result vector is then returned to the caller.

As promised: we only implemented the functionality of the multi-function once (in the scalar function only), and then created a super functor for it! No code duplication at all, and you don't even need to write that super functor for every multi-function.

It is written once and runs everywhere: for every function, on all its versions. Neat!

Note: The use of private template functions inside the solution don't blow-up the generated code too much - they are created at compile time and each one has a max of n versions with n = size of variadic input.

Static assert on mask size (+ compile error on mask too big)

I'd like to emphasize another one of the benefits we obtain from using TMP. Having another look at the initialization of the functors above, using createSuperFunctor with the additional mask, one could argue, and rightfully so, that we might make a mistake while writing the mask. We saw above that the compiler protects us from calling any one of the 8 versions with wrong parameters. Guess what - it can also prevent you from initializing your class with wrong mask! For that we use the static assert on line 78. We calculate a constant value in compile time, that in case of subString, is equal to 2^3=8 (using std::pow with 2 as base and the size of the variadic input of the function as coefficient). This way we limit Mask to be less than 8 (0 - 7). If we would like (I haven't added that), we can assert that Mask >= 0. The code wouldn't compile anyway, but I think using negative mask is less likely than using a value of mask that is "out of bound" (which could happen easily). Actually, if you think about it using super functor with Mask = 0 is also redundant as we already implement the scalar version anyway, so it doesn't really matter (although we support it!). So, back to the upper bound, if we wrongly use the mask value of 8 = 0b1000 like that: auto myFunctor8 = createSuperFunctor<0b1000>(subString), we'll immediately get a compilation error with the error message from the assert above: "error: static assertion failed: Mask too big.". Thanks compiler, you've been very helpful, mate!

Helper methods and CTAD

An astute reader might have wondered about the use of the helper method createSuperFunctor on line 77. Why do we need it? Why can't we just call SuperFunctor directly without this intermediate call? Let's dive into modern C++ for a bit. Prior to C++17, you always had to specify all template parameters for a class template explicitly. Back then, C++ had no support for automatic deduction of class template parameters from the arguments passed to the class template constructor. However, function templates have always supported the automatic deduction of template parameters based on the arguments passed to the function template. For example, the class template std::pair can hold 2 values of different types. Up to C++17, if we wanted to initiate an std::pair<int, double> we would need to choose one of these options:

Explicitly specify all template parameters on the type definition std::pair<int, double> pair1(2, 3.4);.
Avoid the explicit specification of all template parameters by using a helper function called std::make_pair(). This function "enjoys" the automatic deduction of template parameters based on the values past to it (as any function), so we can easily write: auto pair2 = std::make_pair(2, 3.4); As long as the actual values are well-defined (e.g.: 2 can't be translated to both int and short), this would work.

Now, with C++17 we have a new feature called CTAD - class template argument deduction. The compiler now automatically deduces the template type parameters based on the arguments passed to the constructor. So now we can simply write: std::pair pair3(2, 3.4);.

Therefore such helper functions are not necessary anymore. Or are they? First, we still encounter them with the smart pointer helper functions std::make_unique & std::make_shared (if you pass a T* to their constructors, the compiler would have to choose between deducing <T> or <T[]>...not good enough I guess). Generally speaking, CTAD would work only when all template parameters of a class template could be either deduced from parameters to the constructor or have a default value in the template class definition. Well, in our case we declared a class template with multi-template parameters :

but required our initialization to be very simple: auto substringFunctor2 = createSuperFunctor<0b010>(subString); Hence, we wanted to supply only the Mask as explicit template parameter, and all other template parameters to be automatically deduced from the scalar function passed as parameter! Well, unfortunately this can't be done even with the powerful feature of CTAD, as too many unknown template parameters are left uninitialized. We have to use the plain-old helper method for that, as its type deduction is still much more capable than that of the class template deduction (not sure why. Interesting discussions on stackoverflow...).

Some benchmark figures - examine the performance perspective

As discussed on part II of this article, we have much simpler solutions for the problem. Sure, they have their disadvantages (and plenty of them, as we saw), but some of them, especially the simple, hard-coded one, might serve as a baseline for what we might expect from the multi-function implementation from the performance perspective. As discussed before, it might be a "very nice", "academic" exercise to solve this problem with TMP (this is the common way most professionals treat TMP anyway). This wasn't my purpose here.

I would like to offer that solution as an alternative solution to the problem in a high performance database or ETL engine. Therefore, that won't give us much if the solution is much slower than the baseline solution.

To help put the presented solution in a performance perspective, I've made some basic benchmark tests. This is not the usual tests you would have performed for benchmarking. The purpose here was not to optimize the subString solution as much as we can. In order to do that, we will need to:

Optimize the scalar subString function used on all solutions (highly important. This function is executed millions of times on a multi-million rows vector input).
Run all solutions in a multi-threaded manner, as there is no dependency between row i and row j for i != j. It's a basic solution of mapping the input to multiple threads in parallel (there is no real reduce here) and finishing the task much faster. Nothing new here: I have done that dozens of times before, while implementing new functions on a large scale database.

Well, this is not the purpose of this benchmark. Any solution we've seen could be easily paralleled by using multiple threads / tasks approach, and any one of them will benefit the same from a faster scalar subString implementation (as they all use it the same number of times). The purpose of the benchmark is to show that applying the TMP solution, instead of the baseline solution, will not only benefit us from the functional, code-reuse, testing (and other) perspectives, but that it is also a great contestant from the performance perspective as well. Meaning - it is a very effective solution that won't make you throw it out the window when considering it as a solution to the problem in a real database, or marking it as "pure academic". The only thing you "pay" is the time you spent in learning TMP techniques and applying them. I think it's a reasonable price.

In the benchmark I've looked into the 4 solutions discussed thus far:

The baseline solution - the one we presented on part I, that is basically what was done on both the databases we've analysed on Part II (a.k.a.: all versions use the same for loop on different vectors and calling the scalar version on every row).
The "parametrized" solution shown on Part II - templated class Paramater.
A variant of Parameter that uses std::variant - the class Parameter2 presented on part II.
Our bright, shining, new TMP solution :)

I've run all solutions on the all-vector version of subString (getting 3 vectors as input) with input size of 50M. It was executed on my laptop in a single-threaded manner. As discussed, the absolute numbers introduced below aren't important, but only the relative execution times between the solutions and the baseline solution (therefore it doesn't matter what is the CPU spec of my laptop and the fact that I run that on a single thread). Well, enough said, here are the results for "subString7" version:

So, we can observe that both the parametrized solutions are too slow: the first adds only +26.77% to the running time in comparison to the baseline simple solution (still high), and the 2nd is "outrageously" slow (+142.27% running time). So both are "out of the question" for best-performance replacement to the simple solution. But have a look:

our new TMP solution is only adding 1.5% to the running time of the baseline solution!

I think this analysis shows that from the performance perspective, the new solution is at least worth your while. I'll also mention that the results are consistent, and keep the same ratio on larger inputs (no diminishing performance on any solution on larger inputs).

To conclude that section, let's examine a similar, smaller, comparison of both solutions in "subString1" version, i.e.: the version of subString with a string vector and 2 scalar integers. Here, the gaps gets larger to 4%-5% (still not that bad). The reason for that is simple: "the edge" of the basic solution over the TMP solution is that for any row, it has "direct access" to the parameter values (either vector or constant), whereas the TMP solution must always use 3 calls to extractValue function to extract the value "polymorphically" from either a vector or a constant. When all inputs are vectors, this difference is smaller (because vector access still requires special memory access to multiple cells even if they are consecutive). But when constants are used, the basic solution uses a single-memory cell holding the constant value (which is probably accessible through L1 cache throughout the entire calculation), whereas the TMP solution uses a function call every time. This is less effective. It's not that critical, because these 4-5 percent difference is kept on larger inputs as well (so it's capped). It doesn't rule out that solution of course.

Some words on developing with TMP

Developing a solution in TMP is periodic; everyone of my solutions had a few cycles of development - each one improves/optimizes the previous one by a bit. The solution presented here wasn't written like that out-of-the-box (no way Jose..). I've kept finding better and better implementations even while writing this set of articles! So my message to the amateur TMP developer:

Don't give up when something doesn't work or even doesn't compile - that happens a-l-o-t! Try, try again!

Possible use in a database catalog

As I've mentioned more than once, I strongly believe this solution can be effectively integrated in a real database management system. An important part of any such DBMS is the database catalog. This catalog usually declares what kind of data types the database can handle and which functions it supports. Generally speaking, a database will have a query DSL (e.g.:SQL) that will use a dedicated syntax with function extensions. Every time a user writes a query with a function, the parser will parse the query and check whether it is a valid one. In the parsing phase, we usually don't validate the functions, but only refer to them as 'tokens' or nodes in the parse tree. Later, before creating the query execution plan, every such function name and its parameters are checked against the database catalog. Only if the function appears in the database and with a signature that supports the current use of it in the query, then we can move on to its execution. In order to support that functionality, there is usually a module in code that is responsible for the function registration in the database catalog. If you recall, we've seen an implementation of such DB registration on our discussion of MonetDB on part II of this article.

Now, we had the following initialization of subString before:

This kind of code in a database will usually be written a bit differently. Usually we'll find it as part of 8 different function calls to the db catalog register function. In our case all these functions are actually created during this phase (as with any template class, it must be created at compile time with all relevant types). Now, the code in the image above looks a bit tedious. Why do we need this "boilerplate" code of 8 lines? As always, we should ask ourselves whether we can let the compiler handle this work. The answer is yes. We can easily use a templated register function that will get the subString scalar function as a single function argument. This function will deduce the number of input parameters n in that scalar function and can generate all necessary calls to createSuperFunctor using Mask values from 0 to 2^n - 1. It can also make the necessary type registration to its inner data structures to support both functions' input & output type checking, and function name checking during query execution later.

I haven't added this code here, because I haven't implemented an entire database catalog, and because this would have made this article even longer (oh my god no. It's long enough - you will agree :)). But as discussed above, this solution can be fully integrated with easy, sophisticated db catalog registration from one hand, and query post-parsing, function resolution processing on the other hand.

Wrap up

Well, by this we've concluded all parts of the article 'On the problem of database functions implementation "fan out" '. First, I hope you're still here. I know this part was a long, difficult read. So if you are here, I consider it an accomplishment and you as a faithful, interested reader. Thank you. Second, I hope I've convinced you that I've dealt with a real-world problem, one that actually bothered me when I got the task to implement a database or an ETL engine row-based functions (dozens of them). Third, I hope you could understand most, if not all, of the offered TMP solution and that you found it at least thought-provoking, if not mind-blowing (I couldn't expect that of course). Last, I took the time to discuss some topics that are pure database topics. I also tried to extend the discussion to inclulde a few of the corners in the C++ standard that we've came across (especially the ones dealing with templates, variadic templates, TMP and type deduction). I've introduced some TMP techniques using the latest C++ features (like fold expressions) to make you intrigued and help you eventually understand the SuperFunctor solution. I hope the technical explanation was not too tedious and didn't lack any important parts that are critical to your understanding. TMP isn't easy to grasp, as we all know.

I hope you enjoyed reading my articles. Please hit the comments, so I'll know that at least some of you found them interesting / useful, and that they were actually been read by a human and not only visited :) (Linkedin doesn't give any statistics on that, just #visits. Human comments are the best). I hope to write more content like that soon. Your comments might also suggest potential discussion topics or raise issues for me to research and discuss on future articles. Please comment. Any comments will be good.

Stay healthy. Be kind to one another. Thank you all.

Omer.P

‍