Você está na página 1de 31

ATLS

Automatic Translation of First-order Predicate Logic to SQL


Ana-Maria Carcalete
Balarabe Ogbeha

Sebastian Delekta
Yubin Ng
Arjun Saraf
William Smith

Supervisor: Dr Fariba Sadri

Executive Summary
ATLS (pronounced atlas) is a system for automated translation of first order predicate logic
queries into Structured Query Language (SQL). The system consists of two components: the command line translator and the web user interface.
At Imperial College during the first year, logic is taught in the first term and databases are taught
in the second term focusing on the relational model. Predicate logic is closer to the relational
model than SQL, so students that have no prior experience with SQL will be able to use ATLS.
Also, people who prefer logic, such as academics will be able to use this system over direct SQL.
In order to use the system, users are able to connect to a database of their choice. Using a
simple mapping between predicate logic and SQL, they can construct predicate formulas, using the
database schema displayed on the web interface, that they can translate and run on the connected
database.
SQL is the most widely used database querying language; however, queries can also be expressed in first-order (predicate) logic sometimes more succinctly and intuitively. In practice,
most Relational Database Management Systems (RDBMSs) use some variant of the SQL standard
which means that SQL queries written for a range of RDBMSs will differ subtly. This is a good
reason for reasoning about databases in predicate logic since it provides a means for users to reason
about a database without subjecting themselves to the nuances of the specific SQL implementation
that is in use in the RDBMS in question. This allows users to express queries in a very general
manner only worrying about the correctness of the logical formula and not about the correctness
of the SQL query with regards to a particular RDBMS. ATLS thus forms a bridge between these
two paradigms by allowing users to convert their abstracted predicate formulas into runnable SQL
queries.

Contents
Executive Summary

Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3
3
3
4

Translation from Predicate Logic to SQL


2.1 Background Research . . . . . . . . .
2.1.1 Datalog . . . . . . . . . . . .
2.1.2 Tuple Relational Calculus . .
2.2 Syntax . . . . . . . . . . . . . . . . .
2.3 General interpretation of formulas . .
2.4 Quantification . . . . . . . . . . . . .
2.4.1 Existential quantifier . . . . .
2.4.2 Universal quantifier . . . . . .
2.5 Negation . . . . . . . . . . . . . . . .
2.6 Comparison operators . . . . . . . . .
2.7 Binary connectives . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

4
4
4
4
5
6
7
7
8
9
10
11

.
.
.
.
.
.
.
.
.

12
12
13
13
13
15
16
18
20
20

Project Management
4.1 Design methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Team organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21
21
21
23

Evaluation
5.1 Testing and translation correctness
5.2 Usability testing . . . . . . . . . .
5.3 Performance Evaluation . . . . . .
5.4 Project management . . . . . . . .
5.5 Evaluating deliverables . . . . . .

.
.
.
.
.

23
23
24
26
26
26

Conclusion and Future Extensions


6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27
27
27
28

A Appendices
A.1 Hallway testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Sample Translations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29
29
29

References

31

Design and Implementation


3.1 Choice of tools . . . .
3.2 Architecture overview .
3.3 Front-end . . . . . . .
3.3.1 Design . . . .
3.3.2 Implementation
3.3.3 Middleware . .
3.4 Translation . . . . . .
3.5 Technical challenges .
3.6 Risk management . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

Introduction

1.1

Motivation

Predicate logic and databases are some of the fundamental topics in the study of computer science. Some ideas
about how the former can be used to discuss the latter are straightforwardpredicate logic, with appropriately
defined predicates, allows us to express set membership, term equality, or truth/false values, all of which are
important theoretical ideas underpinning relational databases.
The goal of the project was to implement a tool that allows a user to manage a database using the language of
predicate logic rather than SQL, the factual standard for relational database management systems. It might be
the case that a database user finds it more intuitive to express a database concept using the language of logic
rather than SQL; eg., if they are unfamiliar with joins, they may prefer to write queries such as
e employee e.name = n p project p.manager = e
rather than ...

FROM employee JOIN project ON employee.id = project.manager.

Three main user groups that can benefit from the proposed tool have been identified:
Academics, who may prefer to query a database using predicate logic simply be reason of their specialization; indeed, our supervisor, a logic lecturer, has suggested that she often forgets the exact SQL rules
to retrieve the information she needs.
Students, who may want to verify their familiarity with either predicate logic or SQL, using the other as a
benchmark. It may also be that they are unfamiliar with SQL but want to use a relational databasethis
would be the case for first-year students of Computing at Imperial College, who are familiarized with
predicate logic in their first term but do not see SQL until the second.
Working professionals with knowledge of predicate logic, who for personal reasons prefer it over SQL
for database management.

1.2

Objectives

The objectives that we were looking to fulfil can be grouped under three main headings.
Functionality. The tool should provide extensive support for database querying using correspondences
between predicate logic and SQL that are as intuitive as possible. The resulting code should be correct
and free of redundancies, and use as many features of SQL as necessary to make the translation natural.
Portability. The tool should support users on different platforms or web browsers, and provide translation
that can be ran on various relational databases.
Ease of use. The tool should be quick to start with, without the need for extensive explanations or going
through tutorials. If required, the installation should be quick and intuitive.
After considering some of the possible design choices (see Section 3), a prioritized list of features was developed:
1. Basic command-line compiler supporting conjunction, quantification, and predicates
2. A graphical user interface
3. Extended compiler supporting joins, negation, NULL values
4. Support for multiple database standards
5. Support for non-logic operations, eg. sorting
6. SQL query optimizer

7. Support for write operations

1.3

Achievements

The outcome of the project is a web application that allows the user to connect to a database of their choice and
enter a read-only query provided in the language of first-order predicate logic, which is translated to SQL and
executed.
The features include:
A full compiler from predicate logic to SQL, complete with syntax checker and semantic analyzer, capable of recognizing quantification, negation, joins, and nested subqueries
A portable web interface, working with the major browsers, including a text editor with syntax highlighting and autocompletion
Database support for connecting to and querying multiple commercially available relational databases
are supported, including MySQL, SQL Server, Oracle, and Postgres
Offline support for locally hosted databases

Translation from Predicate Logic to SQL

2.1
2.1.1

Background Research
Datalog

Datalog is a logic based programming that is a subset of Prolog. It is based on first order logic and as such,
most first order logic can be easily converted to datalog. Once the predicate is in datalog then conversion to
SQL is relatively easy.
The power of Datalog, or rather deductive databases in general, comes from the fact that deductions can be
made using rules and facts, for example by writing
ancestor(X,Z) :- parent(X,Y), parent(Y,Z).

we can query for all ancestor-descendant pairs with


?- ancestor(A,B).
A = john, B = ann;
% ...

which might correspond to (with appropriate attributes added) the SQL


SELECT parent1.parent, parents2.child
FROM parents AS parents1, parents AS parents2
WHERE parents1.child = parent2.parent

Clearly the expressive power of deduction is far superior in Datalog. It would be desirable to add the ability for
the user to define their own predicates in this way, however there is no way of doing this in SQL with read-only
operations. This approach will therefore not be pursued; however, Datalog provides some hints about how the
queries can be structured and interpreted.
2.1.2

Tuple Relational Calculus

Tuple Relational Calculus (TRC) is a declarative query language. It constructs queries by stating logical
conditions which all tuples in the result must satisfy.

In TRC, range relations correspond to database tables. There are also tuple variables which range over the
relations (these correspond to rows in a database table). A general TRC query is of the form:
{ | F ( )},

(1)

where = (1 , . . . , m ) is a tuple of variables (or a set thereof) and F ( ) is a formula of first-order logic. The
query 1 will return the set of all such tuples that F ( ) evaluates to >.
This information can be used to construct some simple queries; for example, the SQL code
SELECT name, course
FROM students
WHERE age >= 20 AND course = Computing

could be written in TRC as


{s.name, s.course | s students s.age 20 s.course = Computing}.
|
{z
}

(2)

The result is the set of tuples T = {1 , . . .} with i = (namei , coursei such that the condition F (i ) >.
Clearly more complex conditions F can be supported using the full range of predicate logic syntax, eg.:
{s.name | s students s0 (s0 students s0 .age s.age)}.

(3)

The theoretical ideas of query interpretation in TRC play a crucial role in the development of our translation
framework, presented in the subsequent sections.

2.2

Syntax

The starting point for the formalization of the logic syntax accepted for translation was the usual syntax of
formulae of first-order predicate logic, as can be found eg. in [2]. The rules are restricted by specifying a
precise syntax for variables and constants (where constants are decimal numbers or strings in double quotes).
They are also extended by replacing every occurrence of a relation or function symbol in predicate logic with
either a mathematical expression hmath-expri or a predicate expression hpredicate-expri consisting of relation,
attribute, and argument variables. We will also write 1 , . . . , n to abbreviate 1 (. . . n (. . .) . . .) where
{, }.
A partial BNF grammar for those points of interest is:
hvariablei

::= (non-empty string of alphanumeric characters and , not starting with )

hvar-listi

::= hvariablei | hvariablei , hvar-listi

hconstanti

::= hnumeric-constanti | hstring-constanti

hmath-bin-opi

::= < | | > | | = | 6=

hmath-vari

::= hnumeric-constanti | hvariablei

hmath-expri

::= hmath-vari hmath-bin-opi hmath-vari

hpredicate-expri

::= hrelationi . hattributei ( hvariablei , hvar-listi )

hquantified-expri

::= hquantifieri hvar-listi hexpressioni

Note that the definition of hpredicate-expri and hvar-listi forces the arity of the predicate symbol to be at least 2
this requirement will resurface soon.

In the examples that follow, consider a simple university administration database consisting of the following
relations:
employee (eid, name, address, department, salary, dob)

2.3

project (pid, title, manager, cost)

General interpretation of formulas

Within Structured Query Language (SQL), originally based on relational algebra and Tuple Relational Calculus,
two distinct languages can be outlined.[7] One is Data Definition Language (DDL), used for defining data
structures (in particular database schemas) and consisting of such SQL keywords as CREATE, DROP or ALTER.
The other is Data Manipulation Language (DML) which uses SELECT, INSERT, or DELETE among others. For
the purposes of our project, we have decided to restrict the scope to read-only logical queries that will translate
to SQL code of the form SELECT ... FROM .... However, given an appropriately defined extension of the
language of logic accepted for translation (most likely by the way of special keyword predicates), support for
database modification could also be achieved.
The way the desired implementation should interpret logic formulas stems from the rules of the tuple relational
calculus: given a logical query, we want to retrieve all tuples that fulfill it. This is also the interpretation in
the target language, SQL. A TRC query to retrieve names and addresses of all employees in the Department of
Computing who earn more than 50,000 will be written as follows.
{e.name, e.address | e employee e.department = Computing e.salary > 50000}
An example of a translation might be:
SELECT name, address
FROM employee
WHERE department = Computing AND salary > 50000

Two issues arise when shifting from a TRC query to logic. Firstly, a TRC input is composed of two different
expressions. One is the tuple variable T = {e.name, e.address} of attributes in the desired output, which
corresponds to the SELECT operator in SQL. The other is F = e employee . . ., the condition formula,
corresponding to FROM and WHERE. To preserve the rules of predicate logic as far as we can, we would prefer
the user to worry about one kind of input only. Therefore, we have decided that the output tuple is automatically
instantiated to be the set of types of all free variables in the query, f v(F). Note that this implies that the order
of attributes in the SELECT statement cannot be guaranteed. Also, as the SQL language does not allow empty
SELECT statements, we additionally require that f v(W) 6= .
This decision requires that the type of variables can be inferred from the query. In the example above, the
inference is by an explicit set operator in e employee (to find out the table from which a tuple comes)
and dot dereference, as in e.salary (so type of s is employee.salary). However, predicate logic has no welldevised way of dealing with set inclusion or ordering. Hence we will need our own interpretation.
One way of resolving this introduces an explicit predicate _from_relation, which would accept a tuple
variable and relation name and check ownership, eg. _from_relation(e, employee). However, as we dont
have a dot dereference operator in logic, to extract an attribute from a tuple e we will need another predicate,
say: _get(e, name, n) that unifies a fresh variable n with the attribute name from tuple e.
However, the constructed translation is due to work on datasets that already provide a convenient interpretation
of set membership: keys. Consider a relation r = (k, a1 , a2 , . . .) with a single-column primary key k and
non-key attributes ai , i = 1, 2, . We express the fact that a variable is the value of attribute ai by writing:
r.ai (, ),

(4)

where is the key variable. Thus, if we want to instantiate s to be the salary attribute of a tuple whose key is
, we write
employee.salary(, s).

By default, the first position in a predicates tuple argument is occupied by the key, hence is instantiated
automatically and can be used in any context where we need to refer to the tuples key. However, we still need
a way of instantiating when we dont use any other attributes (such as salary). As a key is a perfectly valid
attribute, we are allowed to use expression (4), writing r.k(, ), or employee.eid(, ), where eid is the
name of the key in the table. This is an acceptable solution, but to eliminate the redundancy and the need to
type out the name of the key we permit shortening r.k(, ) to just r().1
The utility of this approach is more evident in the case of keys composed of multiple columns, k = (k1 , . . . , kn ).
Request for an attribute is now of the form r.ai (1 , . . . , n , ). Note that each of the n key variables i is thus
instantiated. Using the knowledge of the databases schema, we can decide whether the arity of the predicate is
correctfor a relation with an n-column primary key, the arity must be n + 1. As in the n = 1 case above, the
keys can be instantiated by themselves, without the need to refer to any other attribute. This is done either via
n expressions of the form r.ki (1 , . . . , n , i ) for all i, which treats i as regular attributes, or more concisely
by r(1 , . . . , n ).
We allow replacing any variable that is a predicate argument with a constant. Writing r.a(. . . , C, . . .), where C
is a key or attribute constant, is semantically equivalent to r.a(. . . , , . . .) = C, where is a fresh variable
not occurring anywhere else in the query with the restriction that is considered bound.

2.4
2.4.1

Quantification
Existential quantifier

Consider the case of an existential quantification W = xP x at the topmost level (ie., not in a nested formula).
According to the rules borrowed from TRC, the query returns a set of all tuples (with attributes restricted to the
types of f v(W)) such that the query evaluates to true. It is obvious that the variable x is bound and x 6 f v(W),
so its type will not be returned.
To request the names of all employees, we might call:
x(employee.name(x, n))

(5)

Note f v(W) = {n}, whose type is employee.namethis is the only attribute that can be returned. Consider
all possible substitutions of the variable n. If an employee John Doe exists in the relation, then
W[n 7 John Doe] = x(employee.name(x, John Doe)) >
with the primary key of the tuple substituted for x. Hence the tuple containing John Doe, restricted to
employee.name , will be returned. A translation to SQL is
SELECT name
FROM employee

The role of in binding the variables is clear. However, it is partially redundant for just ensuring the existence
of a tuple. After all, the query
employee.name(x, n),
(6)
with f v(W) = {x, n}, also has a truth/falsity interpretation for a particular substitution. Eg.,
W[x 7 16f5a3, n 7 Jane Roe] = employee.name(16f5a3, Jane Roe) >
if there is a tuple with primary key 16f5a3 and name attribute Jane Roe, and evaluates to otherwise.
The only difference between (5) and (6) is the variable bindingthe latter has the key variable x as free and
therefore will be translated as
1

This demonstrates the requirement that the query for a database can only exists in a model where the schema of every relation
mentioned in the query is givenotherwise the key name would remain unknown.

SELECT eid, name


FROM employee

Consider now requesting the name and salary for every employee. As the name and salary must come from the
same tuple every time, we have to use the same instantiation of the primary key variable. An example query is
x(employee.name(x, n) employee.salary(x, s)).

(7)

The free variables n and s, whose types evaluate to employee.name and employee.salary, are bound to the same
tuple identified uniquely by x, the primary key variable. For a particular substitution of x there is at most one
possible substitution hn, si 7 . . . for which the query evaluates to true and the values will be returned. The
translation is
SELECT name, salary
FROM employee

The existential quantification can also occur in a nested formula. We have up to now considered simple query
conditionsname equal to John Doe or salary greater than 50,000. The existence of some other tuple may also
be such a condition. Suppose we wish to find all dates of birth shared by some employees. This is expressed as
x(employee.dob(x, d) y(employee.dob(y, d) x 6= y))

(8)

The nested quantification should be read as ...and there exists a different employee whose birthday is the
same. That the birthday is the same is expressed by the fact that the first occurrence of the variable d is unified
with the nested occurrence, hence a single substitution replaces both. We also need x 6= y to prevent trivial pairs
of employees with themselves from being considered. Such a nested quantification is most readily expressed
by a nested SELECT:
SELECT employee1.dob
FROM employee AS employee1
WHERE (EXISTS (SELECT employee2.eid,
employee2.dob
FROM employee AS employee2
WHERE employee1.eid <> employee2.eid
AND employee2.dob = employee1.dob))

Note that binding issues dictate that y is not returned. Also, as we are running a query that depends on the result
of a different query ran on the same table, the tables have to be aliased.
A query with similar semantics and no nested quantifiers can be constructed that allows for the variable y to be
returned. The discussion is relegated until the section that concerns the JOIN operator.
It was remarked at the end of the previous section that variables as predicate arguments can be replaced with
constants. Thus, instead of writing
x(employee.name(x, n) n = John Doe),
we may write
x(employee.name(x, John Doe)).
Note that these are not exactly equivalent, as the latter does not have n as a free variablehence it will not be
returned. Indeed, it does not have it as a variable at all, which means it cannot be referred to in other contexts.
2.4.2

Universal quantifier

Consider a case of negated universal quantification W = xP x. This formula of predicate logic is equivalent
to xP x. Consider its TRC representation for some set of tuples T = types(f v(W)):
{T | P (x)}.

(9)

Call a domain of a TRC expression the set of all values that appear in the expression or in any tuple of any
relation occuring in the expression [5]. An expression is safe if its result can only come from its own domain.
The query (9) is therefore unsafe, becauseas any query involving outermost negationit can be satisfied by
an infinitely large answer set. Using non-negated universal quantification on the outermost level, W 0 = xP x,
also produces outermost negation, the result being the complement C of the result set of (9). Therefore we
consider any query involving outermost universal quantification to be ill-defined.
However, there is nothing in the way of using universal quantification in nested formulas. Consider the query:
find names of employees whose all projects were within the budget of 100,000. This might be expressed as:
x(employee.name(x, n) y(project.manager(y, x) project.cost(y, c) c 100000))

(10)

which is equivalent to
x(employee.name(x, n) y(project.manager(y, x) project.cost(y, c) c 100000))
|
{z
}
()

It is not immediately clear how to treat the part indicated by (). This is discussed in the next section on
negation.

2.5

Negation

Consider the () expression from the previous section:


y(project.manager(y, x) project.cost(y, c) c 100000)

(11)

A straightforward reasoning might suggest that at least one of the three predicates project.manager(y, x),
project.cost(y, c), c 100000 has to be false to falsify the whole conjunction. However, it is unclear how
to represent in SQL the fact that project.manager(y, x) while keeping it true that project.cost(y, c) and staying true to the desired semantics. We interpret query (11) as saying that there exists a project whose cost is
greater than 100,000. Indeed, we only really wish to negate the predicate c 100000not the other two,
which might impact the interpretation of variables, especially if they occur in other predicates in the outer
scope! This is equivalent to saying that the set containing all projects except those with cost less or equal to
100,000 is not empty. We thus have a way of finding a condition for negating the parenthesized expression in
(11) without breaking unification. This is achieved with SQLs EXCEPT operator:
SELECT project.pid
FROM project
EXCEPT SELECT project.pid
FROM project
WHERE project.cost <= 100000

will select projects with cost greater than 100,000. We can revisit query (10) now and apply this idea to arrive
at:
SELECT employee1.name
FROM employee AS employee1
WHERE (NOT ((EXISTS
(SELECT project1.pid,
project1.cost
FROM project AS project1
WHERE project1.manager = employee1.eid
EXCEPT SELECT project1.pid,
project1.cost
FROM project AS project1
WHERE project1.cost <= 100000
AND project1.manager = employee1.eid))))

We further implement the use of negation to approximate the SQLs treatment of NULL (Codds ). Although
there is no formally agreed interpretation of NULL, [4] notes that it is closest to something that might be
present in the [Universe of Discourse], but we do not know its value at present. SQL implements three valued
logic, in which some comparisons involving NULL result in the value Unknown. We restrict our treatment to the
simple comparisons x IS (NOT) NULL, which ensures that the computation results is a standard boolean.
NULL comparison is simply done by negating the appropriate predicate with the operator. For example,
x, s(employee.name(x, n) emp.salary(x, s)),
which returns names of all employees whose salary is missing, translates to
SELECT emp.name
FROM emp
WHERE emp.salary IS NULL

Negations over mathematical operations are translated directly, eg. (c 100000) becomes
NOT (c <= 100000)

2.6

Comparison operators

Supported comparison operators include =, 6=, <, , >, . In a translation to an SQL query, those operators
will affect the WHERE clause. Suppose we want to get the names of all employees in the Department of Computing. The query
x, d(emp.name(x, n) employee.department(x, d) d = Computing))
will translate to
SELECT emp.name
FROM emp
WHERE emp.department = Computing

Conversely, asking for names of employees not in the Department will modify the query with ... d 6= Computing,
and the SQL will read
-- as above
WHERE emp.department <> Computing

Both of these can be expressed more succinctly as


... employee.department(x, Computing)

and

... employee.department(x, Computing),

which modifies the set of explicitly named variables in the query.


The operators can include comparisons between any combination of variables or constants. For example, to get
all names that at least two employees have, we may write
x, y, n2 (employee.name(x, n1 ) employee.name(y, n2 ) n1 = n2 x 6= y),
where the comparison is between variables and which will translate to
SELECT emp1.name
FROM emp AS emp1,
emp AS emp2
WHERE emp1.name = emp2.name
AND emp1.eid <> emp2.eid

The use of inequality operators is naturally restricted to values that typically support it in databases. The
compiler is not at all concerned with verifying this usage, so it is treated by a straightforward translation of
to <=, etc.
10

2.7

Binary connectives

Predicate logic allows , , , as connectives between clauses.


Suppose we wish to find all possible pairs of employee names and project titles. The desired SQL will be
SELECT emp.name,
project.title
FROM emp,
project

where the comma operator between table names emp and project realizes the Cartesian product, , in relational algebra. This is most intuitively expressed as a conjunction of predicates over different relations:
x(employee.name(x, n) project.title(y, t)).

(12)

In modern SQL, CROSS JOIN is typically used instead of the comma.[4] Note that this query does not use the
keys of either relation for the join. More typically, we carry out a join on tables based on some condition. To
find the names of employees with the projects they manage, we may write
x(employee.name(x, n) project.title(y, t) project.manager(y, x))

(13)

which now includes a condition that projects manager attribute is equal to the key of the employee tuple. SQL
expresses it with the extra clause:
-- as above
WHERE project.manager = employee.eid

The detail of the implementation is that it will consume the predicates in the conjunction of query (13) from
left to right and consider them pairwise. Hence, when the portion employee.name(x, n) project.title(y, t) is
analyzed, the result is a simple Cartesian product as there are no attributes or variables in common, just as in
(12). If we reorder query (13) to read
x(employee.name(x, n) project.manager(y, x) project.title(y, t))

(14)

it will first consider employee.name(x, n)project.manager(y, x), where the predicates have different attributes
but share a variable, x. This case will return an equi-join, where the join is based on equality constraints only:
SELECT emp.name,
project.title
FROM emp
JOIN project ON (emp.eid = project.manager)

Even though the semantics behind both queries is the same, we factually violate the well-established law that
a b b a.
Consider a conjunction of attributes a1 , a2 in relations r1 = (k1 , a1 , . . .), r2 = (k2 , a2 , . . .) whose keys are
k1 , k2 , respectively:
r1 .a1 (~1 ) r2 .a2 (~2 ),
(15)
where ~1 , ~2 are tuples of variables, key or non-key.
If ~1 ~2 = , the attributes dont share variables, hence the join is unconstrained and will result in a
Cartesian cross join
FROM r1, r2

between the relations.


If ~1 ~2 3 , we need to join the entities that are the types of in the two attributes. For example,
might be a key variable in r1 , representing k1 , and a non-key variable in r2 , representing a2 , in which
case the join is expressed as
FROM r1 JOIN r2 ON r1.k1 = r2.a2

11

The interpretation thus depends on the position of in each of the tuples ~1 , ~2 . Note also that once again
we need to have the full knowledge of a schema for a table to carry out a translation, otherwise the key
names are unknown.
The assumption so far was that r1 6= r2 in all cases. However, when r1 = r2 and a1 = a2 , we are potentially running a query on the same table twicehence the SQL code will need aliasing to resolve ambiguities.
Consider the request for all pairs of (distinct) employees that work in the same department.
d(employee.department(x, d) employee.department(y, d) x 6= y)

(16)

A possible translation is
SELECT emp1.eid,
emp2.eid
FROM emp AS emp1
JOIN emp AS emp2 ON (emp1.dep = emp2.dep)
WHERE emp1.eid <> emp2.eid

The other binary connectives, , are not implemented directly. Instead we rely on the laws of predicate logic, whereby P Q (P Q), P Q P Q, P Q (P Q) (P Q). The
operator can be used in queries such as
d(employee.department(x, d) (d = Computing d = Mathematics))

(17)

which returns all employees from either Computing or Mathematics and translates to
SELECT employee.eid
FROM employee
WHERE (employee.department = Computing)
OR (employee.department = Mathematics)

3
3.1

Design and Implementation


Choice of tools

The initial meeting with the supervisor helped us clarify the goals and settle on some of the questions crucial
to the success of the project. The developed tool was meant to support a users experience in database querying
and managementas there exists a plethora of tools and standards of doing this, the scope of the project
was possibly very large. The tool could be, for example, a wrapper around a command-line interface such as
psql used to query a database, or a plugin for available database management tools, such as MySQL or Toad.
However, as portability was one of our objectives, we did not want to commit to supporting one particular
standard or software piece. Hence we decided to develop a standalone tool.
The next decision was that between a native desktop application or a web interface. Desktop software would
be more appropriate for users who do not want to rely on an Internet connection for proper functionality. In the
present day, this is becoming an increasingly less valid concern. Also, none of us had extensive prior experience
with writing desktop applications, and it would likely consume some time to produce even a rudimentary
version. Conversely, we had prior experience with web applications, which allow excellent portability and easy
continuous deployment of revisions. Hence we decided on the latter.
The next step was deciding on the main implementation language. After a brief discussion, where the requirement to build a compiler-like tool emerged, we have narrowed down the choices to three: Haskell, Python,
and Java. All of us had prior experience with Java, however we did not know how it would support a web
application and also were not satisfied with development speed it provides. Haskell had parser and compiler
tools available, but our experience was limited to toy programs and we were not sure how that would impact
the velocity. Even though only one member of the team had prior experience with Python, it was generally
regarded as the easiest general purpose language to pick up, and provides excellent development speed and
support for web programming. We decided to use it for our tool, also as a consequence of learning interests of
team members.
12

3.2

Architecture overview

Figure 1: Overview of project architecture


Broadly, the project can be split into a user interface and a translator. The core of the tool, the translator, was
reduced to the problem of creating a compiler that takes predicate logic formulas as input and returns SQL
queries as output. Naturally, this task was split into creating a scanner, a parser and the actual translator. This
was different from creating a standard compiler due to the fact that semantic analysis of the formula required a
connection to the users database in order to check the correctness of the relations, keys and other details. To
make the tool usable, we decided on creating a user-friendly web interface that allows users to do the following:
Connect to a database with the relevant details.
View the database schema.
Input predicate logic formulas (that can be built from the database schema).
View the result of translating the query.
View the result of running the translated query on the connected database.

3.3

Front-end

The user interface for this project is a key element. It provides access to the translator in a more use friendly
manor than a command line interface. To make it useable to as many people as possible (and with minimum
effort in both development and usage) the user interface is a JavaScript web application.
3.3.1

Design

The first step in designing the user interface was to create a list of functional requirements, features that are
necessary for the interface. The initial list was created by discussing with our supervisor what was required.
The list of requirements evolved over time as designs materialised and usability was evaluated.
The first mocks of the layout of the user interface were done on paper and whiteboards to allow us to quickly
discuss and change it. These designs were then translated into basic HTML and CSS layouts that showed
potential problems that were visible on paper, such as too much negative space or the page being too crowded.
The first implemented design2 was a very simple layout, consisting of two editors, a results panel, and a toolbar.
2

https://gitlab.doc.ic.ac.uk/g1436227/fopl-ui/commit/02567756

13

When a user accesses the interface, she will be prompted to enter the database connection settings. Once a
connection has been successfully established, she can enter the logic query into the top editor. Logical symbols
can be enter by hitting the buttons displayed at the top of the window. To facilitate editing queries, the symbols
are also found under the expected key shortcuts, eg. Ctrl+Shift+6 will enter the symbol (with the lookalike
ASCII caret found under Shift+6). Two processing options are provided: SQL will translate the logical query
into SQL and display it in the bottom text window, and RUN will additionaly run the query on the database and
display the results in the result tab to the right of the editors.

Figure 2: Initial design of the user interface


The colours were picked to match the editors default theme. The next design3 had more space to make give
better boundaries between the individual areas of the program. It also had adjustable sizes and a tab control to
hide some of the content when it was not needed. As noted in a meeting with our supervisor, the user needs the
ability to continuously refer to the schema of the databasewhich we somehow missed in the initial design.
Upon connecting to the database, the user now sees the schema, complete with the keys indicated and data
types displayed.

Figure 3: Second iteration of the UI design


3

https://gitlab.doc.ic.ac.uk/g1436227/fopl-ui/commit/e5389978

14

The final design4 is fully customizable, allowing the user to modify the layout according to their preferences.
This was the best way of laying out the application because everyone had his or her own opinion of what is the
correct layout. It was inspired in parts by JetBrains IntelliJ IDEA5 and Microsoft Visual Studio6 , two popular
integrated development environments. We kept all the functionality of the previous iteration, adding some minor
improvements to the user experience, such as syntax highlighting or tracking the opening parenthesis when
constructing a query. Clicking on an attribute name in the schema will place it in the editor, and autocompletion
suggests the full names of tables, attributes, or previously used variables.

Figure 4: Final iteration of the user interface

3.3.2

Implementation

One of the first decisions made about the user interface was that it would be a web application. This meant
that if any real time user interaction was wanted then JavaScript would have to be involved, even if indirectly.
JavaScript is used to provide AJAX7 , a method of loading data asynchronously after the page has loaded, and
the rich syntax highlighted text editors. The JavaScript could have been abstracted to some other language such
as Dart8 or CoffeeScript9 , both of which compile to JavaScript, however the benefit of them was negated by the
simplicity of the application, the framework being used and an ES6 transpiler10 .
The application was developed using CommonJS11 as a method of splitting the code into separate modules.
This made encapsulation easier because modules could hide internal representations and just provide a public
interface. CommonJS was designed for server side JavaScript so a tool called browserify12 was used to compile
all the modules into a single file that could be used by browsers.
The framework used for the application was AngularJS.13 There are other similar frameworks such as Ember.js14 and Backbone.js15 available but members of the group already had experience using angular giving
advantages in terms of speed of development and avoiding pitfalls. AngularJS provides a modular model-viewcontroller architecture that allows easy testing and strong bindings to HTML. Components in angular are called
directives.16
4

https://gitlab.doc.ic.ac.uk/g1436227/fopl-ui/commit/master
https://www.jetbrains.com/idea/
6
http://www.visualstudio.com/
7
https://en.wikipedia.org/wiki/Ajax_%28programming%29
8
https://www.dartlang.org/
9
http://coffeescript.org/
10
https://github.com/thlorenz/es6ify
11
http://wiki.commonjs.org/wiki/Modules/1.0
12
http://browserify.org/
13
https://angularjs.org/
14
http://emberjs.com/
15
http://backbonejs.org/
16
https://docs.angularjs.org/guide/directive
5

15

The main controller links together shared data (such as the database settings and error list) and events from the
components. This allowed all the components to communicate without being directly coupled to each other. All
AJAX requests were handled through the Api service that wrapped the $http service provided by angular.
The first component is the database settings panel; it collects the settings from the user and allows them to test
them. The settings and the validity of the settings are stored in the shared data in the main controller. The
database schema directive watches the settings and when they change it clears the current schema, if they are
valid it loads the schema based on those settings. The schema is displayed in a tree control provided by the
angular-ui-tree17 package. When one of the attributes is clicked it send a signal to the main controller, this is
relayed to the editor and the corresponding predicate is inserted. The editor and sql viewer are small wrapper
directives around angular-ui-ace18 handling things like resizing. The editor also runs the predicate using the
Api service, it tells the main controller about the results and any errors. The error and results directives are
wrappers around the angular-ui-grid19 directive.
We used a text editor called ACE.20 It provided most of the required functionality out of the box but we made
some custom modifications to better suit our needs. We created a regular expression based syntax highlighter
that highlighted the predicate queries and modified the editor code to support mixed-width characters, which
was needed to properly display the logic symbols.
The dynamic layout was done using a code based on library called dock-spawn.21 It structures the window as
a series of panels, tab controls and splitters, and allows the panels to be rearranged and resized, providing a
customisable interface. We converted it to the CommonJS module system and fixed some bugs in the modularized code. The panels are easily extensible which allowed it to be integrated with AngularJS and ACE with no
major problems.
The style sheets are written using SCSS,22 an extension to standard CSS that has features such as variables
and functions. This allowed the styles to use common variables and facilitated making the large changes that
happened during the development of the project.
The development packages are managed using NPM,23 the node package manager. It automatically installs
all the dependencies required to build the frontend assets. The third party assets are managed using bower;24
another package manager targeted more at client side packages. The frontend assets are compiled using grunt,25
a task automation tool. Grunt was configured to either produce a release build, that combines all the assets and
minifies them, or to watch the source files and automatically recompile when they change, which is very useful
during development.
3.3.3

Middleware

In between the JavaScript user interface and the Python translator is a lightweight middleware layer that acts as
an adapter, converting the AJAX HTTP requests into the relevant executions in the translation and converting
the results from the translator into JSON for the app to display.
The middleware layer was originally written in PHP26 because it is easy to setup and available from most hosting providers. It implemented a lightweight MVC style architecture, with a router taking the model (serialized
as JSON), finding the appropriate controller and executing it, returning the result as JSON. Whilst this was easy
to setup and work with initially, we started having problems when more information was being passed between
the components.
17

https://github.com/JimLiu/angular-ui-tree
https://github.com/angular-ui/ui-ace
19
https://github.com/angular-ui/ng-grid
20
http://ace.c9.io/
21
https://github.com/coderespawn/dock-spawn/commit/eaf12d58
22
http://sass-lang.com/
23
https://www.npmjs.com/
24
http://bower.io/
25
http://gruntjs.com/
26
http://php.net/
18

16

We decided that it would be easier to recode the middleware in Python, allowing us to import the translator
directly. We kept the format of all the requests and responses the same but changed the URLs to be more
descriptive; the modular design of the frontend application made this a very simple change.
For managing connections to user databases and query execution we are using sqlalchemy,27 a dedicated Python
SQL toolkit. It allows an effortless management of multiple relational database standards with a single common
interface. We are thus able to support various commercially available relational database systems. As the translator uses only standard (ANSI) SQL, the functionality of generated queries is not implementation-dependent.
The final middleware layer is a very lightweight Python program using a micro framework called Flask.28
app.py bootstraps the application and maps the http requests to translator and the results to JSON. custom json.py is a custom JSON serializer that handles the schema and sqlachemy types. translate.py is a small
wrapper around the translator that changes the output of the translator into the correct format.

27
28

www.sqlalchemy.org
http://flask.pocoo.org/

17

3.4

Translation

The initial attempt at implementing the translation consisted of a simple one-pass interpreter tried over trivial
examples. However, as more functional queries were proposed, it became apparent that support will be needed
for variable analysis, scope resolution, unification etc. Therefore we decided to implement a compiler from
predicate logic formulas to SQL.
Python is not a very common choice for compiler construction, with languages like C++ or Java being the
preferred choice. Consequently the range of tools available is rather limited. Our choice for the lexer and
parser was PLYa Python implementation of the familiar lex and yacc tools. [1]
A sample definition of lexer tokens is
1
2
3
4
5
6
7
8

t_LESS
t_LEQ

= u<
= u\u2264 #

def t_STRINGCONSTANT(t):
r(\[\]*\|"["]*")
t.type = reserved.get(t.value, STRINGCONSTANT)
t.value = str(t.value)
return t

where t_LESS and t_LEQ are simple fixed-length tokens, and t_STRINGCONSTANT builds a custom token
object according to the recipe in the definition, having recognized the pattern from the regex on line 5.
The PLY parser is capable of LALR(1), which is sufficient for the purposes of our grammar. Unlike parsing
tools such as Javas A NTLR, PLY does not support automatic generation of abstract syntax treesit merely
provides a bare-bones syntax recognizer. Therefore, the grammar rules must be accompanied by specifications
of the code that constructs the AST representation. Consider the following code that handles mathematical
expressions.
1
2
3

def p_binoperation(p):
binoperation : identifier binoperator identifier
p[0] = BinaryOperationNode(p[2], [p[1], p[3]])

The AST node constructor for the binary operations uses the BNF definition hbinopi ::= hidi hbinoperatori hidi,
declared on line 2. The argument p is the array of terminal and nonterminal elements on the right-hand side of
the rule. The elements of the rule are accessible by referring to the index number. Here, a BinaryOperationNode
is created, using the p[2] operator argument as well as the list of childrenidentifiers p[1] and p[3]. The
result is stored in the place of the nonterminal symbol on the left hand side, p[0]. An abstract syntax tree for a
sample query is presented below.
Quantifier {type=, ids=[x, s]}
Binop {type=}
Predicate {relation=emp, attribute=name}
Variable {id=x}
Variable {id=n}
Binop {type=}
Predicate {relation=emp, attribute=salary}
Variable {id=x}
Variable {id=s}
Mathop {operator=}
Variable {id=s}
Numconst {val=50000}
Figure 5: Sample AST for the query x, s(emp.name(x, n) emp.salary(x, s) s > 50000).
After the AST has been constructed we run basic semantic analysis over the query. This involves verifying
it against the obtained database schema and determine whether (a) all relations and attributes mentioned in
18

the query are available, and (b) that the predicates have correct arity. We considered extending the semantic
analysis stage to other possible errors such as type checking, but decided that such errors should be returned on
the database end as predicate logic is by default not many-sorted.
In order to facilitate variable analysis, a symbol
table is generated once the AST construction is
complete. This allows us to uniquely identify the
entity that a variable represents. Note that the
rules of the grammar allow writing formulas with
the shape
x(. . . x(P x) . . .),

Identifier
emp.name
emp.salary
n
s
x

Bound value
<PredicateNode@0x14b4f50>
<PredicateNode@0x14b4f90>
<VariableNode@0x14b4e90>
<BindingVariableNode@0x14b71d0>
<BindingVariableNode@0x14b7150>

that is, reusing a variable symbol in a different Table 1: Top-level symbol table for the query from Figbinding within a subformula. For this reason the ure 5.
symbol table is hierarchicalit is constructed on
a per-level basis, where the occurrence of a quantifier {, } signals the introduction of a new scope level.
A sample top-level symbol table is presented to the right.
It is not feasible to directly use an abstract syntax tree to generate SQL code. This is because the AST, even with
the symbol table, still does not contain all the necessary information about, say, variable resolution. Therefore,
a tree walker generates an intermediate language struct that will then be used by the SQL code generator.
Consider Figure 6, the intermediate language struct for the query
d(employee.department(x, d) employee.department(y, d) x 6= y employee.salary(y, s))
-list:

(18)

{emp1@emp}.eid, {emp2@emp}.eid, {emp2@emp}.salary

-list:

6=
{emp1@emp}.eid

./1
{emp2@emp}.eid

emp1@emp

emp2@emp

./1 : {emp1@emp}.dep={emp2@emp}.dep
Figure 6: An intermediate language struct for query (18)
The IL struct consists of two items. The -list, corresponding roughly to the relational algebras projection
operator, indicates the predicates that will be part of SQLs SELECT query. Query (18) asks for the primary keys
of two employees and the salary of the latter, hence the -list consists of three items. The notation {p@r}.a
denotes attribute a from table r which is aliased as p.
The -list in turn is a list of condition trees that the code generator will use to construct FROM and SELECT.
The condition tree can express various constraints, such as the need to carry out a join or make sure that two
variables compare in a particular way. The query contains x 6= y, which expresses that the primary keys of two
employees are differentthis is indicated in the first tree with 6= at the root. Also, the two employees are part on
the same departmentthus we need to carry out a join on the variable d, whose type is employee.department.
This is expressed in the second tree, with the join condition ./1 at the root and the joined relations on the
branches.
The compiler uses the common technique of stack-based intermediate representation [8], depicted in Figure 7.
The AST is visited in a bottom-up fashion, with a node accepting a Visitor object, dispatching it to the nodes
children, and then processing the result of the visits to the children. The Visitor keeps an internal stack of
intermediate language structs. The node in query dispatches the visitor to the two children nodes (predicate
nodes). When the control is returned, the visitor pops the two IL structs off the stack and combines them in an
appropriate manner. This manner is dependent on the type of node that is currently visited. In the case of the
node, the -trees have to be inspected to resolve clashing variable names to possibly introduce a join or a
WHERE-type condition, and can then be combined to make sure that all conditions from the two children nodes
19

are satisfied. Now, after aliases were possibly introduced, they can be applied to the two -lists, which are then
simply combined. The result is pushed back onto the stack.
...

...

emp.dep(x, d)

emp.dep(y, d)

L1

L2
L3

1
2
3
4
5
6
7

# class AstVisitor:
def visit_AndNode(self, node):
l2 = self.stack.pop()
l1 = self.stack.pop()
# node-specific processing...
l3 = self.combine(l1, l2)
self.stack.append(l3)

Figure 7: The visitor consumes the representations L1 , L2 of subqueries and combines them into L3
Once the translation of the AST to the intermediate language has been completed, the topmost IL struct accepts
an SQL visitor that produces the final SQL query. Given that most of the semantic processing has been done at
the stage of translation to IL, the bulk of the SQL visitors functionality is string generation. Nonetheless, it is
also equipped with enough introspection power to appropriately process the -treeseg. to decide whether an
incoming condition is a JOIN that should augment the FROM stack, or a simple selection to be appended onto
the WHERE stack. The visitor is similarly stack-based, visiting child nodes and passing the result up to the parent
nodes. The usefulness becomes apparent in the processing of full subqueries of the form
-- ...
WHERE NOT EXISTS (SELECT ...)

In such cases we create another instance of the visitor that processes the inner query and returns the result in
the form of the SQL string, which can be then appended to the top-level WHERE operator.

3.5

Technical challenges

The biggest hurdle in the project was the implementation of the translator. Although we had prior experience
with a simple compiler project for a toy language, this time we did not have a reference compiler, nor a clearly
defined way of what the translation should result in. We owe the final result to lots of fundamental work,
pen-and-paper exercises, and discussion with the supervisor.
All our code is stored in two git repositories. Initially this caused some problems due to the .gitignore file being
wrong and temporary files being committed, and there were problems due to merging going badly. Once we
fixed the .gitignore file and added pre-commit hooks for testing these problems mostly resolved themselves.
We needed to be able to connect to a variety of different types of database server: PostgreSQL, MySQL, Microsoft SQL Server, etc. This caused problems initially because all three use different Python modules, needed
slightly different parameters for connecting and had different interfaces. Our initial attempt only supported
Postgres with a plan of creating a base class from that and extending it for each database server. During development we found a library called sqlalchemy that encapsulated the different interfaces
Automated testing of the user interface proved difficult because there was no clear way of testing it automatically. We tried some user interface automation testing libraries such as Selenium but found that the overhead
of managing the tests combined with the frequency of changes meant that it was not a viable option. As the
interface was fairly simply we have decided to test it manually in the end.

3.6

Risk management

The use of Agile methods has helped us immensely to contain the risks common to software projects. However,
the tools provided by Agile are focused on scheduling commitments. Shore and Warden in [6, p. 232] indicate
that this should not overshadow the main goal of delivering an organizational success, that is, a piece of
software that delivers on the stakeholders needs and requirements.
20

From the very beginning of the project, we relied heavily on efficient planning and evaluation. However, we
still faced some predicted risks with our project that ranged from technical to management risks.
One particular technical risk was the decision to use Python as the main programming language for our project
despite the fact that only two out of six group members had previous experience with it. However, our initial
decision to choose this based on its anecdotal ease of delivering results even by beginners proved itself correct
- by frequent group discussions and pair programming sessions we have eliminated any overhead stemming
from unfamiliarity with language features.
A risk that has emerged in the implementation stage were the breakdowns of Gitlab, the versioning system
used. These turned out far more frequent than expected, especially given that it was chosen for the in-house
support provided and history of reliability. Although the issues were quickly mitigated by the responsible staff,
we still found ourselves on occasion having to share code via Google Docs or makeshift projects set up in other
version control systems.
A crucial commitment in terms of deliverables was a functional compiler that supports a wide range of queries.
This objective proved difficult to fulfill; consequently, the suite of compiler tests that we developed in the initial
stage of the project was very helpful in containing this risk and qualitative tracking of the progress.

4
4.1

Project Management
Design methodology

In our initial meeting we made the decision to follow the Agile methodology. This allowed us to construct
our tool incrementally and respond to the requirements that were clarified over the duration of the project,
especially with regard to the functionality of the compiler.
We recognised that the most efficient way of discussing the project is via a face-to-face conversation. The
industry practice of holding a standup every day would not be feasible due to the differing schedules of team
members, hence we decided to hold it twice a week. Those meetings were used to discuss the achievements
since the last meeting, as well as define the tasks that should be completed before the next iteration and their
estimated manpower. On top of the team meetings we were also meeting our supervisor once a week to gather
feedback on the state of the project so far as well as to clarify requirements.
On occasion the team meetings converted into development sessions, mostly for the benefit of the work that
would impact different parts of the project. The benefit of that was that members assigned to different tasks
were often able to advise each other of a possible solution to a related problem, thus making the integration of
those tasks far smoother.
We realized from the beginning that the project, although technically challenging, is rather narrow and precise in
its scopethe bulk of the work would be devoted to creating a functional compiler. As the development process
was very much explorative in terms of determining the actual workings of the tool, we were constantly updating
a list of possible extensions to our applicationsuch as direct optimization of the resulting SQL queries to
produce more refined queries that are more intuitive to users. It was natural to think of the product utility as
built from the ground up. We have decided to use a simple cost-value approach to prioritizing requirements.
[3] In the spirit of the agile method, the insights from discussions at team meetings were further granulated
when subteams assigned to particular parts of the project individually decided on the prioritized feature and its
estimation while keeping to the team-wide agreement whenever a wider feature had to be delivered on time.
Although our supervisor has left us considerable latitude in terms of implementation, she had some very clear
ideas about what the project would ideally deliver, and we have taken these into account.

4.2

Tools

There are many applications available that implement Scrum and other agile methods but we decided to use

21

cost

Supporting multiple relations, nulls, joins


Functional
compiler for
predicates, ,
SQL query optimization
Real-world database querying

Ad-hoc database management

Web UI
Offline
support for

Support for multiple database standards

local databases

value
Figure 8: Cost-value diagram used in the initial planning stages
TargetProcess,29 a project management software that focuses on Scrum and Kanban software development
methodologies, to keep track of our work. It provides a powerful interactive scrum board that allows management and planning of iterations, features, user stories, task allocation, bug logging, and others. When a feature
was proposed at a team meeting, it would be divided into tasks revised with the progression of the work. The
persons allocated to the task would use planning poker to decide an estimate of the manpower. As work on
the task was carried out, it would move between the progress lanes named Proposed, Planned, In progress, and
Completed. The time estimation and actual development time were logged to allow us quantitative verification
of the projects progress.
Even though TargetProcess provided excellent project management support, we feel that the tool was much
more involved that our needs would dictateit was designed with far larger projects in mind. Initially we may
have spent too much time trying to understand the details of the system rather than simply hitting the ground
running. In the future we would consider simpler tools, such as the shoutout board Trello, for project of similar
size.

Figure 9: Snapshot of the TargetProcess Scrum board


Our tool of choice for version control was Git, primarily due to extensive previous experience of all members of the group, as well as the support provided by departmental facilities. Two major branches were created, translator and web_ui, to allow interference-free development of the largely independent parts of the
project. Git provides excellent support for versioning of different implementation ideas which we made extensive use of during development. It also allowed us to enforce sanity checkpoints during development time by
preventing failing commitssection 5 includes more detail on testing procedures.
29

www.targetprocess.com

22

For our continuous integration / deployment server we used Jenkins30 , installed on our group virtual machine.
The user interface and the translator both had package managers that allowed them to be setup automatically.
This meant that the deployment server could be setup very quickly if it were ever lost. The continuous deployment system meant that we could quickly preview the global effect of a change on the system and we could
show it to users and our supervisor.
We used Google Drive for collaborative editing of coursework reports, storing reference articles etc. Discussions were carried out using a dedicated Facebook group.

4.3

Team organization

We decided on a free form structure where we do not have a formal group leader but where we make decisions
based on everyones evaluation and consensus. Some decisions that directly affect the output of the application
were made after discussions with the supervisor on the topic.
After the first meeting with our supervisor we had a group meeting to decide how to split the project into self
contained tasks that could be assigned to members of the group. For the first week we decided that we needed
to do more in-depth research on predicate-logic, but at the same we wanted to start working on the actual
application. So we delegated the work as follows:
Front-end: William Smith, Ana-Maria Carcalete
Back-end: Yubin Ng, Samuel Ogbeha
Translation: Sebastian Delekta, Arjun Saraf
This work allocation was based on our estimations as to what parts of the project were most important at the
time along with previous experience of the group members. However, over the time, we kept changing the
priorities of most of these tasks, as it turned out that the translation of the implementation was the part of the
project that required most manpower.
Moreover, we scheduled to meet twice a week to discuss our progress, problems and to delegate new work.
Usually the work was allocated such that it is completed within a week. However, we always had an active
Facebook conversation where one could mention the problem one is facing and seek help from everyone.
We made use of the practice of pair-programming in order for everyone to understand different aspects of the
project and advise each other on implementation points. This allowed a free flow of ideas throughout the team
and facilitated collaborative effort, allowing us to make the best use of development time.

Evaluation

5.1

Testing and translation correctness

To ensure that we were building the correct product, we decided to begin the project on a theoretical basis,
and to build up on the extensive research already done in the area of interfacing relational databases and logic.
This led us to create an extensive theoretical model for our project which included a comprehensive mapping
between the two languages that was our point of reference during the implementation.
Next, we had to ensure our implementation of this framework was correct, so before writing any translation
code we had meetings to discuss various problems and concepts and try to solve instances of them on paper
using our framework to see how robust it was. This involved pen-and-paper translations of multiple sample
queries, both in team meetings as well as those coming from our supervisor during the weekly catch-up.
Due to dealing with the very essence of programming languages, a compiler project must be made especially
robust to invalid inputs, and the expected output must be clarified as soon as it is feasible. Before attempting
30

http://jenkins-ci.org/

23

to write any code, we consulted our theory and wrote extensive unit tests to make sure that the developed
system would be robust. To make best possible use of the Python nosetests testing framework, we have added
pre-commit hooks to the codebase to make sure that failing tests prevent a commit to any of the branches.

Figure 10: A commit failing due to not passing all tests


The weekly meetings with our supervisor have provided us with lots of precious feedback about the implementation of the actual translation. We have prepared a separate suite of tests that are not used to validate
day-to-day code edits, but help us track the progress in the functionality of the compiler. It compares the result
of executing the output of the translator with the result of executing the expected query. Note the significance
of comparing the query resultseven a non-optimal query can return the correct result, which would not be
the case if we attempted to do string comparisons of the generated query with the benchmark query. (Such
comparison is rather too involved anyway, as even the queries that are functionally the same can be written in
different ways by reordering arguments etc.)
1
2
3
4
5
6
7

import unittest
class CompilerTest(unittest.TestCase):
# setup...
def test_two_attributes(self):
query = uemployee.name(x, n)
correct_sql = "SELECT eid, name FROM employee"
assert self.run(query) == self.run(correct_sql)

Figure 11: Compiler unit test for query: employee.name(x, n)

5.2

Usability testing

As soon as the web interface was completed and basic compiler functionality was implemented, we created a
dedicated testing environment and began running ad-hoc usability studies on fellow students in the department.
We were deliberately vague in our introductory explanations as we wanted to determine whether the tool is intuitive enough for the user to simply jump straight to query execution. We ran those brief sessions on volunteers
in the labs as time permitted, asking them to attempt retrieving certain information from a provided database
and to fill out a brief questionnaire afterwards. (See Appendix A.1) All queries that the users tried to execute,
even when failing, were logged to allow us to see what confused users the most about how queries should be
written.
This survey has allowed us to discover a few potential issues with the tool. Firstly, it was not at all clear what it
exactly means to translate a predicate logic formula into an SQL query. Inspecting the log has shown that users
were often confused about restricting free variables to enforce projection and giving appropriate arguments to
the predicates. This is something we were aware of from the beginning, as we also had to get used to the idea
24

of such translations. We thus became reassured that a concise but comprehensive help guideor at the very
least a set of exampleswill be necessary for the best user experience. More importantly, we discovered that
the user interface was not very user friendly. This was a surprise to uswe became victims of creators bias.
As the expected functionality was clear cut in our minds we thought the UI was perfectly intuitive.

Figure 12: A feedback sample from the usability study


The feedback from the users that grasped the basics of the application in the limited time was generally positive.
However they mostly welcomed the tool as a novelty rather than one of substantial help in database management
or related tasks. Some of the concerns raised were:
It is not clear how to write the query in logic in the first place. In spite of the introduction to the rules and
the examples we provided, some users have still struggled in trying to formulate a query in logic. This is
an important obstacle for the user to use the tool to their advantage, but we believe the learning curve to
not be very steep. In particular, all the queries that we asked the users to translate from English to logic
had their direct counterparts in the examples provided.
Given the exact rules of what constitutes an acceptable translation, the reasoning needed to reach it
is not much easier than formulating the SQL. This point has varied between different users, with many
mentioning that they would actually welcome logic queries as a viable alternative. Given the proliferation
of SQL as a de-facto standard, it was difficult to expect that everyone would welcome such a paradigm
shift, but it was positive to see that at least some people could notice the potential of the tool.
The tool needs to make more use of the richness of SQL to be practically useful. This was a very valid
criticism, which comes as a consequence of the sophistication of the compiler. Currently it is able to
translate logic only to a very limited subset of SQL.

25

5.3

Performance Evaluation

One of our priorities was to make sure that our web application was highly responsive all the time. We used
client-side JavaScript and AngularJS framework for our front-end. In order to improve the user performance
and to make the web application more responsive, we used techniques such as AJAX that allowed to load data
dynamically and asynchronously. Moreover, we used lightweight JSON format for making the transfer of data
from frontend to backend and vice-versa fast to avoid redundantly using resources of the server. Thus, this
enhanced our productivity by making deployment faster.
The translator did not have a large performance requirement, if it had been slow then loading indicators could
have been added to the user interface. The majority of the time spent in the translator is actually during
communication with the database server, this is something cannot really be optimised however by caching the
schema some time could be saved for later queries. This caching strategy was not added because performance
was not an issue.

5.4

Project management

Along with evaluating our project we also evaluated our team and the progress we made each week. As we
choose to follow parts of the Scrum development method we used an online platform to keep track of our
progress during development and assign tasks to each member.
As Agile requires periodic development cycles we decided that the best time frame for one cycle is one week.
Our initial tasks were based on our project specifications and the first meeting with our supervisor. As the
project progressed we added more tasks to our online board according to what our supervisor requested.
Our progress chart shows that initially our progress was made slowly until we had a better idea on what the
project entitles. Also the chart shows that our initial tasks were rather large to fix this we further split one large
task into smaller parts. Some of our development cycles were affected by the amount of outside work we had
in that week, this can be observed as we had cycles where most of the work was done at the end of the week.
Most of the new tasks were added to our planning board after each weekly meeting with our supervisor, although
there were times when we added new tasks after one of our team meetings.

Figure 13: Development cycle task process

5.5

Evaluating deliverables

A straightforward way of evaluating the success of a project is to revisit the requirements. Here we refer to
Section 1.2 where the key objectives of the project were defined.
26

Functionality. We have been successful in developing a translator capable of converting predicate logic
queries to SQL. Support is provided for quantification, negation, binary connectives, resulting in possibly
complex SQL queries that include joins. There are some minor issues that we have not managed to fully
look into due to time constraints, and are therefore not implementedas an example, using an variable
from an outer scope inside an inner scope introduced by the general quantifier () will be ignored. Two
major ideas that were not implemented are the query optimizer and support for write operations. The
former was not a firm requirement and we have mostly considered it as a possible extension of the
primary tool that supports querying. As for the optimization, some of our queries are still far from
optimal. As an example, because binary connectives are considered pairwise without regard for their
possible commutativity, a query of the form
r.a1 (. . .) r.a2 (. . .) . . . r.an (. . .)
will result in n 2 joins, even though none are required. For the most part, this is not a performance
problem, given that the SQL client is most likely able to apply an optimization of its own. By way of
this, our tool is perhaps more clearly useful as a query executor rather than a translator.
Ease of use. This was one of our main concerns throughout the stages of the project, from the compiler
to the web front-end. The syntax of the logic accepted by the translator is not far removed from the usual
syntax, and we tried to not diverge too far from it. The results of the usability study suggest that, even
in spite of the learning curve being somewhat steep for some users, it did not seem to represent enough
of a hindrance that a user may give up on trying to use logic for database querying. The user interface
was designed with clarity in mind. End-users seemed to find their way around it easily, from entering
the database connection settings and establishing a connection to quickly formatting the query thanks to
syntax highlighting, autocompletion, and dedicated support for symbol insertion.
Portability. The tool is fully portable; using standard web frameworks and technologies it can run on any
major browser. Offline support is provided and a locally hosted database can be queried just as well as a
remote one. Third-party database library and using only ANSI SQL code in the translator allowed us to
support most relational database standards.
As a group we are satisfied with the outcome of the project. Given the time constraints, we have managed to
developed a usable tool that satisfied all crucial requirements, and which the targeted users are certain to find at
least somewhat useful.

6
6.1

Conclusion and Future Extensions


Conclusion

Overall we are satisfied with the outcome of the project. We managed to achieve our initial target of translating
predicate logic to SQL queries with a fully functioning web application as our front end. Although not all goals
were definitely met, we are confident that our target user groups will find it beneficial to their learning and
database management experience.
Developing ATLS was not an easy task. It required us to first strengthen our understanding of both predicate
logic and SQL, and to develop a large theoretical framework before any translation code could be attempted.
We believe it was the right approach and the implementation work would require much more effort otherwise.

6.2

Front-end

The core requirements of the user interface have been met however there is still room for improvement. Two
major extensions were planned if there was extra time. The first extension is having a user account system
that would allow the customized layout and queries to be saved. This would mean that users could resume
working instantly instead of having to setup the editor to their preferred layout. The second extension was code
27

completion. ACE has good support for code completion so the major work would be in the translator. It would
require the lexer and parser to support partial parsing. Then analysis of the final state of the parsing would need
to be performed to determine which tokens are valid, as well as checking what variables and predicates might
be valid in the context. Implementing dockspawn using native angular concepts would have created slightly
clean code in places, removing the need for some tight coupling.
During development of the user interface, two main areas proved difficult, changing requirements and testing.
Whilst the main functional requirements remained constant throughout the project, there were numerous small
additions and changes made. This meant that code and layouts were having to be rewritten quite often. The
problem was mostly resolved with the final, fully customizable layout because the layout could be changed on
the spot without having to change any code. Overall, the frontend and middleware was fully rewritten at some
point during the projects duration.
Automated testing also proved difficult for the user interface because there was no clear way of testing it
automatically. We tried some user interface automation testing libraries such as Selenium but found that the
overhead of managing the tests combined with the frequency of changes meant that it was not a viable option.
As the interface was fairly simply we have decided to test it manually in the end.
If this system were to be used in production an SSL certificate would be required to prevent the database
connection details (and user account details) being stolen in transit.

6.3

Translation

Though we implemented the translation for all basic SQL qeuries, there is still a number of improvements and
possible extensions that would make our application even better.
For instance, we are only translating two joins, cross and equi-join, at the moment. Given sufficient time, we
can extend the translator to translate the remaining joins. Secondly, our produced queries are correct but not
always optimal as they follow a rigid formula in translation. Optimization of queries can be done on a case by
case basis but it takes a very long time.
Additionally, as expected, our translator also suffers from producing queries which are not readable. This is
due to the same reasons as above. Our queries might not be user friendly and intuitive. This is something that
can be improved by optimizing the translator. Also, we also have not included support for some advance SQL
functions like sorting. Moreover, currently we only support read function, given enough time, we would like to
explore write function too.

28

Appendices

A.1

Hallway testing

Below is the interface presented to users during hallway testing.

A.2

Sample Translations

The following examples demonstrate some of the more involved queries accepted by ATLS. Note that in those
instances the generated code can often be simplified.
Suppose we want to obtain the name and department of the employee who has the highest salary. This
can be written in logic as:

29

e, s(employee.name(e, n) employee.department(e, d) employee.salary(e, s)


x, y(employee.salary(x, y) employee.department(x, d) (s < y)))
Ann SQL translation generated by ATLS will be:
SELECT employee.name, employee.department
FROM employee JOIN employee AS employee1 ON (employee.eid = employee1.eid)
WHERE (NOT ((EXISTS (SELECT employee.eid, employee.salary, employee.department
FROM employee
WHERE employee1.salary < employee.salary
AND employee.department = employee.department))))

In another instance, we may want to obtain the name of all employees that are project managers of
projects with a cost bigger than 100000. This can be written in logic as:
x(employee.name(x, n) y(project.manager(y, x) (project.cost(y, c) c > 100000)))
The SQL generated by ATLS in this instance will be:
SELECT employee1.name
FROM employee AS employee1
WHERE (NOT ((EXISTS (SELECT project1.pid, project1.cost
FROM project AS project1
WHERE project1.manager = employee1.eid
EXCEPT SELECT project1.pid, project1.cost
FROM project AS project1
WHERE (NOT (project1.cost > 100000))
AND project1.manager = employee1.eid))))

Again, we may want to obtain the name of all employees that are project managers of projects with a cost
greater than 100,000. We may write this in logic as:

x(employee.name(x, n) y, z(project.manager(y, x) project.manager(z, x) z 6= y)


The SQL generated by ATLS for this formula will be:
SELECT employee1.name
FROM employee AS employee1
WHERE (EXISTS (SELECT project1.pid, project2.pid
FROM project AS project1
JOIN project AS project2 ON (project1.manager = project2.manager)
WHERE (project2.pid <> project1.pid)
AND (project1.manager = employee1.eid
AND project2.manager = employee1.eid)))

30

References
[1] David Beazley. PLY (Python Lex-Yacc). http://www.dabeaz.com/ply/.
[2] Ian Hodkinson. Predicate logic. C140 Logic lecture notes, Department of Computing, Imperial College
London, 2014.
[3] J. Karlsson and K. Ryan. A cost-value approach for prioritizing requirements. Software, IEEE, 14(5):67
74, Sep 1997.
[4] Peter
McBrien.
SQL:
An
Implementation
of
http://www.doc.ic.ac.uk/ pjm/adb/lectures 2014/sql-lecture.pdf, 2014.

the

Relational

Algebra.

[5] Peng
Ning.
Database
Management
Systems:
Tuple
Relational
http://discovery.csc.ncsu.edu/Courses/csc742-S02/T08 RelationalCalculus 6.pdf, 2014.

Calculus.

[6] James Shore and Shane Warden. The Art of Agile Development. OReilly Media, Inc, 2008.
[7] Wikipedia. SQL. http://en.wikipedia.org/wiki/SQL, 2014.
[8] Steve
Zdancewic.
CIS
341:
Compilers.
http://www.seas.upenn.edu/ cis341/current/lectures/lec09.pdf.

31

Intermediate

representations.