Effective and Efficient Techniques for a Rule-Based Test-Data Generator

This part explains some of the sophisticated software technology that is working behind the scenes in our rule-based test-data generator for form-centric applications. You will see that a simple enumeration of all possible ways to fill in a form is likely doomed to run longer than the age of the universe. Therefore more efficient techniques are needed to make the seemingly impossible possible.

In the previous part of this blog series, we have seen that it is necessary to generate test data containing extreme or special values (ESVs) which are capable of exerting some pressure onto the software under test. We have also seen that it is desirable to squeeze into a single test-data record as many ESVs as possible, in order to reduce the number of records to be fed into the test system. Running functional tests may take considerable time. Therefore we want to achieve a high compression, i.e. a high density of ESVs per data record.

As we mentioned in part 1 of this blog series, the main obstacle in the way of an automated generation of test data for form-centric applications is the presence of cross-field constraints. Before we get to these, let us look at constraints that can be formulated for individual fields alone.

Constraints on single fields — the easy problem

The kind of constraints that one may formulate for a field depends on the type of the latter. Consider a Boolean field. The field may assume only one of two values, true or false. A simple constraint on such a field is that it may not be empty. Additionally there may be attributes governing the external representation of the field values, such as “wahr” and “falsch”, or “1” and “0”.

Next, let’s look at a field representing an amount of Euro with Cents. Again a simple constraint on such a field is that it may not be empty. Other constraints may be that the amount may not be negative, or that the amount may not be zero. Invariably, there will be a constraint on the length of the field, which entails a lower and an upper limit for the amount the field can hold. Further attributes may involve the external representation such as the presence/absence of a positive sign in front of the amount, or the appearance of the decimal separator (a comma in Germany, a dot in the UK or the USA). Working with these constraints can be separated into a first phase, where a valid value is generated, followed by a second phase where an appropriate external representation is constructed. Our generator typically produces extreme and special values (ESVs), as explained in the previous part 2 of this blog series.

For a string field, we always have a maximum length constraint, plus a global constraint describing the character set that may be used. All we need is a simple string generator that looks at the string fields one after the other, and for each field generates data, while obeying the constraints that may exist (see e.g. [2]).

In our form-centric applications, we usually encounter a type of constraint on string fields that is a bit harder to cope with than the length and character set constraints mentioned above: it is the regular expression constraint. For instance in order to specify a valid German zip code number we may use a simple regular expression. Likewise, a regular expression may be devised constraining valid phone numbers, e.g. “(d+[ -])?d+“. A regular expression for valid e-mail addresses is considerably harder to come by.

Matching a given string using a regular expression is a common task in software engineering. The process of generating a valid (or non-valid) string from a regular expression is the inverse of matching. While there are numerous packages that perform matching, it is a lot more difficult to locate a good string-generator that produces a matching (or non-matching) string from a regular expression, while taking into account the side constraints (see e.g. [8]). Fortunately, the theory of finite automata (FA) comes to the rescue, which allows the conversion of any regular expression into a FA [3]. From there it is only a small step to an operational string generator that allows the production of matching or non-matching strings with relatively little effort. Not surprisingly, we have such a string-generator built into our test-data generator.

Figure 1 below shows the transition graph of a finite-state automaton (FA) for the regular expression “ab|cd”. By traversing paths from the start node (marked with an ‘S’) to one of the end nodes (marked with an ‘E’) one can generate the two strings “ab” or “cd” that match the given regular expression.

Transition graph of the corresponding finite-state automaton (FA) for the regular expression ab|cd.

The nice thing is that any valid data item generated for a field not participating in any cross-field constraints can be combined with any valid data item generated for another such field.

Let us now turn to the harder problem, namely to the fields whose values are restricted by cross-field constraints.

Cross-field constraints — the hard problem

Let us begin by looking at a relatively simple problem, namely, how to fill in two fields in an address block of a form. Assume that the fields are the given name and the surname of a person. The single-field constraints that will exist for each field individually may be dealt with in the way described above. Let us assume that there is a single cross-field constraint linking the given name to the surname, and assume that the constraint reads: “if the given name is present, the surname has also to be present”. Formally we might encode this in the assertion

absent(givenName) or present(surname)

Both fields may be empty, but the given name should not occur without the surname.

The simplest approach to solving this miniature constraint satisfaction problem (CSP) is “trial and error”, or “generate and test”. The task of the generator consists in completely filling in the first two columns of the following table:

givenName surname valid
absent absent yes
present absent no
absent present yes
present present yes

Then we use a validator to check whether the constraint is fulfilled.

While this simple strategy is effective and efficient enough for a small number of fields, we quickly run into problems as soon as that number grows. For 10 fields filling in a table like the one above might take as little as 1 microsec, but for 100 fields it might easily take 1000 times the age of the universe. We are witnessing the curse of the combinatorial explosion.

Note that the problem of validating the rows in the truth table only takes polynomial time.

What can be done?

A very successful strategy to cope with the combinatorial explosion is to interleave the data generation with the checking/validation steps. This mixing of steps allows an “early pruning” of the search tree.

Early pruning is a hallmark of practically all CSP-solvers, and, perhaps not surprisingly, CSP-solvers have been used for two decades in the area of automated test data generation (see [4], [5], [7], [8]). Also, the increasing importance of CSPs in main-stream computing is witnessed by the on-going attempt to define a Java-API for CSP-solvers [6].

While early pruning is essential, experience shows that it not sufficient to solve real-world CSPs that occur in the area of automated test data generation. As I mentioned in part 1 of this blog series, it is necessary to partition the CSP into independent components before submitting it to a CSP-solver.

The following figure illustrates this concept. It shows the 32 largest of more than 200 independent connected components. Each little square depicts a field, and a connection between two fields indicates that the two fields occur together in a constraint. This primal constraint graph has been derived from the validation rule base of a large form-centric application developed at mgm tp.

This example for partition the CSP into independent components shows the 32 largest independent connected components of more than 200.

One might think that such a partitioning algorithm would be an integral part of a CSP-solver, but in our experience this is not always the case.

Here is, in brief, how a decomposition of a CSP into independent components can be accomplished:

  1. Initially create a component for each field, and put the field into a set belonging to that component.
  2. Go through the constraints one by one. Identify all fields of the constraint. Identify all components to which these fields belong and merge their sets.
  3. Label the resulting sets.
  4. Go through the constraints one by one again. Identify a field in the constraint. Find its component, and label the constraint with the label of that component.

Clearly, each field that does not participate in any cross-field constraint ends up in its own independent component, where test data generation can take place undisturbed by the rest of the problem.

Implementation and data representation

The bulk of our test data generator has been implemented in Java. We greatly benefit from representing the CSP’s variables and constraints in Lisp data structures. The representation of a constraint in a Lisp-data structure with prefix notation looks like this:

(constraint (
2 "Regel_Zahlen_Vergleich" "4" "Test_Vordruck"
"Zahlen_Kontext" "Nicht_Negative_Zahl"
1 1 1 1 1 1)
(not (and (= 2
(+ num-def_positive_zahl_$v1_$u1_$z1
(>= $t_positive_zahl_$v1_$u1_$z1

As a matter of fact, we have never regretted the decision of choosing a Lisp representation, which was taken early on in the lifetime of the project. There are certain transformations of the original CSP that are easy to carry out, when the variables and constraints are encoded as Lisp data. Also intermediate results can easily be written back to a file for visual inspection.

A lot of configuration information is necessary in order to control our Rule-Based Test Data Generator, such as the number of forms that should be filled in, the maximum number of rows in the fields, etc. but we spare you the detail.

Our test-data generator produces data records for each component individually. Due to the independence of components, the records of the components can be freely combined.

Finally, we obtain a table with the generated test-data records. The figure below shows a portion of this table with several solutions of a test-data generation process. Each row represents a solution; its value consists of a mixture of extreme or special values (such as -99999999999.99 or 0.01), and of values that are the result of satisfying the validation constraints.

A portion of the table with the generated test-data records.

Most of the desired extreme and special values (ESVs) are present in at least one of the records (high test coverage). Since most of the time many ESVs are present in a single record, we attain a high compression rate (i.e. many ESVs in few records).


We are at the end of this blog series on the Rule-Based Test Data Generator (R-TDG). You must be left with the impression that the R-TDG is a complex machinery, and, undoubtedly, it is. Be assured, however, it is in operational use at mgm tp since about two years.

The constraints which the R-TDG needs are served to us on a silver tablet by the rule-based system whose main purpose consists in generating the validators for form-centric applications. This approach is radically different from earlier attempts in the area of automated test-data generation, where the constraints are extracted from source code. To the best of our knowledge the R-TDG is the only working automated test-data generator world-wide that can properly take into account cross-field constraints.


  • [1] Bergmann, V. (2008). Databene Benerator.
  • [2] Brüggemann-Klein, A. (1992). “Regular expressions into finite automata.” Lecture Notes in Computer Science 583: 87-98.
  • [3] DeMillo, R. A. and A. J. Offutt (1991). “Constraint-Based Automatic Test Data Generation.” IEEE Transactions on Software Engineering 17(9): 900-910.
  • [4] Feldman, J. (2011). JSR 331: Constraint Programming API, Java Community Process. 2011
  • [5] Gotlieb, A., B. Botella, et al. (1998). “Automatic test data generation using constraint solving techniques.” ACM SIGSOFT Software Engineering Notes 23 (2): 53–62.
  • [6] Ince, D. C. (1987). “The Automatic Generation of Test Data.” Computer Journal 30 (1): 63-69.
  • [7] msdn (2010). “The Regular Expression Generator”, Visual Studio Team System 2008 Database Developer Center, Microsoft Corporation.
  • [8] Zhan, Y. (2002). Constraint Solving in Test-Data Generation. Lecture Notes in Computer Science 2470. P. V. Hentenryck. Berlin-Heidelberg: 770-771.
Series Navigation<< Producing High-Quality Test Data