Form Validation with Rule Bases

A lot of our web applications contain a large number of forms with hundreds of fields and complex cross-field constraints. mgm’s quality assurance team uses rule bases and automatic form validation to verify the correctness of these apps. This blog series discusses the challenges in generating test data for this verification and explains our automated process for producing masses of test data by utilizing the rule bases.

Software applications for the insurance, tax or pension sectors and, to some extent, for banking and online shops as well, can be characterized as being form-centric. These applications require data input with lots of complex forms. Each of these forms usually holds many fields with a diversity of data types and allowed text patterns, like the email address.

mgm develops web applications, some of them with 100+ web forms.

Some of our projects contain 100+ forms (shown in the figure above), often with 1000+ distinct fields, each of which may have multiple rows. Since many of these forms can occur in more than one instance, there are potentially many thousands of individual data items the user may have to provide.

A distinguishing feature of a form-centric application is that the user may enter free-form text and numbers into the fields. Therefore, the user input must be validated before further processing in the backend may take place. Let us now briefly pursue the question of how the input to form-centric applications can be validated.

The validation rules for the user’s input are conveniently specified in a kind of data dictionary. The data dictionary may be decomposed into constraints on the values of individual fields, and a rule base consisting of cross-field constraints, i.e. contraints on multiple fields. In the simplest situation, a constraint validates the input to a single row of a single field. E.g. for a currency amount, lower and upper limits may be stated. For a string, a regular expression may be used for defining the set of admissible strings. However, input validation is more difficult when the values of several data items have to be checked simultaneously.

Do we need automated test data generation at all?

This is an almost heretic question, but we shall address it seriously here. Without doubt, form-centric applications, like all other software applications, require some software quality assurance (QA) measures. QA uses a mixture of various techniques, including dynamic methods consisting of manual and repeated automated testing.

Manual testing

Each of our form-based web application is manually tested well before the final deployment. Many of these tests are positive “test-to-pass” tests, meant to demonstrate that the application functions as specified. Some negative “test-to-fail” tests are also included, each of which provokes a single specific error condition.

The testing of large form-centric applications with their many forms, form instances, and fields with potentially multiple rows is rather tedious, and cannot achieve a high test-case coverage! Also, when it comes to negative tests, it is rather difficult for a human tester to make up a test case that specifically violates a given validation rule. Provoking a specific combination of several error conditions and no other is close to impossible for a human, due to the problem’s high complexity.

Because of the high labour cost involved, manual tests are usually carried out only once towards the end of each release cycle. However, it is very desirable to assess the code quality quasi-continuously in parallel to the development process. This can only be accomplished by automated tests.

Automated testing

An automated end-to-end regression test is performed for each Web application after its deployment in the staging system. Such deployments take place once per week on average, with an increasing frequency towards the end of the release cycle. In addition to functional tests, load tests are also regularly performed.

These kinds of automated tests are positive tests that require valid test data. It should be clear by now that for a form-centric application, such test data can hardly be created manually, particularly since the rule-base changes regularly. Any modification of the rule base requires a rapid response, and a manual process for creating and maintaining a database of test data would be too tedious, error prone, slow, and therefore costly. Thus some means for automatically generating test data is definitively needed.

Examples for Cross-Field Validation Constraints

Consider a simple example where we have two fields from an address block, one for the given name and one for the name. The name can relate to a natural person or to a company. Therefore, a name can occur without a given name, whereas a given name cannot occur on its own. The corresponding condition, a “value presence check”, can be encoded into a simple logical formula

if present(given_name) then present(name)

Another common case for a validation rule spanning several data items concerns the values in all non-empty rows of a single field, expense, and relates them to the value (in the first row) of another field, total_expenses:

total_expenses = expense.1 + expense.2 + expense.3

or equivalently

total_expenses = sum(expense.i)

where the upper limit of the running row index i has to be suitably specified somewhere.

Validation Rule Bases

The collection of validation constraints, supplemented with auxiliary information, is referred to as a validation rule base. To give you an indication of the potential size of such collections, the rule base of our biggest project contains 900 basic validation rules. Such rules may already be concrete and thus directly applicable to the data items. However, quite often they are still abstract in the sense that they contain some patterns that need to be concretized.

Let us look at an example again: suppose there are three fields total_expenses, expense_a, and expense_b, and a concrete rule relating three fields, whose constraint states for row 1 of these fields

total_expenses.1 = expense_a.1 + expense_b.1

The same concrete rule for row 2 of those fields would read

total_expenses.2 = expense_a.2 + expense_b.2

If the fields would allow more rows, more concrete, directly applicable rules were needed. Including all of these into a rule base is rarely advisable for several reasons: the resulting rule base would become rather large and difficult to maintain. Instead all the concrete rules above may succinctly be summarized in a single abstract rule using a wildcard pattern:

total_expenses.i = expense_a.i + expense_b.i

This shorthand obviously means: for each admissible row index of the relevant fields, the rule constraint should be applied. And it is this abstract form of rule constraints that are usually present in the rule bases in order to facilitate the build-up and maintenance of rule bases, when multiple form instances or multiple field rows play a role. Under these circumstances going from abstract rules with index patterns to concrete rules, where fields are decorated with specific form and/or row indices, may cause the number of rules to grow substantially. Smaller form-based web applications typically contain about 50 different fields, but many of these potentially have 1000+ rows each.

As we shall discuss in a later blog in this series, a rule-based test data generator reads the constraints associated with the rules in the rule base. It then attempts to find suitable data sets that are compatible with these constraints. What makes this a daunting exercise is the sheer number of constraints and their inherent complexity. It is therefore vitally important — even in the Teraflops age — to break down the initial problem into a collection of smaller ones. This is what we look into next.

Decomposing a Validation Rule Base

We might potentially gain a lot, if we could apply the well-known “divide and conquer” strategy to the whole data generation problem. In other words, if we could only decompose the data generation problem into a series of independent problems, we could then solve each of these separately, and subsequently combine solutions of the sub-problems to obtain a solution of the full problem.

Fortunately, “divide and conquer” works for form-based applications. Experience shows that, in each of these cases, the rule base turns out to be decomposable into more manageable independent connected components (see glossary of graph theory), each consisting of interrelated (potentially decorated) fields and rules.

Consider the example above: the two fields name and given_name together with their validation rule make up component #1, whereas the three fields expense_a, expense_b and total_expenses together with their validation rule make up component #2.

The relations between the fields can succinctly be represented in the following so-called primal constraint graph:

Primal constraint graph with two independent connected components in a simple rule base.

The primal constraint graph has two independent connected components in our simple rule base. Component 1 consists of a clique of two (undecorated) fields (given-name and name), with a single concrete rule for the validation of their values. Component 2 consists of a clique of three (undecorated) fields (expense_a, expense_b and total_expenses), with a related concrete validation rule.

A real-world primal constraint graph derived from two rules — one with 10 fields, the other one with 8 fields — is shown in the following figure. Both rules share a common field. Each rule entails a so-called clique in the constraint graph, i.e. a fully connected sub-graph of nodes:

Primal constraint graph of a small component of the rule base.

Each node in this graph represents an (possibly decorated) field. Two nodes are connected by an edge, if their (possibly decorated) fields occur in the same rule. The graph depicts two rules, one with 10 fields (left), the other one with 8 fields (right). One of the fields is shared by both rules (center).

For a practical example, consider our largest web-forms project: its largest independent connected components of the constraint graph are depicted in the next figure. There are about 240 components in total, provided we restrict ourselves to a single instance of each form, and a single row per field. The largest of these components comprises more than 200 (undecorated) fields, and about as many concrete rules.

Collection of the 32 largest of about 240 independent connected components of the primal constraint graph, derived from the rule base of our largest web-based forms application. Only one instance of each form and one row per field are considered here.

mgm’s Rule-Based Test Data Generator (R-TDG)

The quality assurance division of mgm technology partners is increasingly exploiting an elaborate automated process for generating a considerable amount of test data. These provide a nearly complete coverage of input variations for “form-centric” software applications. For the purpose of validating the input to web forms, mgm tp has implemented a sophisticated rule-based system, which will be the topic of another blog article.

Throughout the past two years, we have been busy designing and implementing an innovative software solution that automatically generates valid test data for those applications that rely on automated input validation. The software solution has acquired the name Rule-Based Test Data Generator (R-TDG), because it uses the very same rules that are employed for input validation.

Currently, the prime purpose of the R-TDG consists in producing valid test data for so-called positive tests of web applications. However, it can also be used for assessing the quality of a given validation rule base. For example, R-TDG can assert the consistency (i.e. absence of contradictions) and detect redundancies of a given rule base. Surprisingly often, the R-TDG has shown to be helpful in identifying incompleteness (i.e. a logical gap) in a given rule-base. The R-TDG is currently being extended to produce invalid test data sets for use in so-called negative tests.

In the following blogs in this series, I will discuss the features of our rule-based test data generator R-TDG in more detail, the difference between positive and negative tests, and details about how the test data may actually be generated by solving the constraint satisfaction problem associated with a given rule base.