4

I got two queries:

'UPDATE foo SET bar = baz WHERE a = b AND c = d'

and

'UPDATE foo SET bar = baz WHERE c = d AND a = b'

both are semantically equal (they do the same), but a simple compare would state that they are different as the first has a = b AND c = d whereas the second uses c = d AND a = b.

How can I check whether both queries are semantically equal?

This is an obviously easy example that can be solved by simple alphabetic sorting of the syntax tree at the WHERE node. I am interested in a generic approach that can also solve more complex queries - even with subqueries.

A further restriction is that I do not have access to the database and can only use the strings of the queries. Thus RUNNING the queries is out of question as it would NOT reflect on the equality of the queries.

An example for the bold text above:

FooTable:

A |  B |  C
1 | xx | xx
2 | yy | zz

FooTable': (FooTable' is FooTable on a different database)

A |  B |  C
1 | xx | xx
2 | ee | zz
3 | ss | xx

Example why running the queries will NOT yield valid results:

1) Queries on the same database:

UPDATE FooTable SET B = 'rr' WHERE C = 'xx'

AND

UPDATE FooTable SET B = 'rr' WHERE C = 'xx' OR B = 'ss'

Both queries will RESULT in exactly the same but are trivially not equal.

2) Queries when including different databases (same schema but different data):

SELECT A,B,C FROM FooTable where C = 'xx'

AND

SELECT A,B,C FROM FooTable' where C = 'xx'

Those two queries are trivially semantically equal but will NOT yield the same results.

18
  • 1
    When you say "check", do you mean an automatic check, or a manual verification? Commented Mar 30, 2016 at 9:42
  • 1
    So this query using IN: select * from core.[Group] where Id IN(select groupId from core.GroupPermission) should be equal to this one using EXISTS: select * from core.[Group] where exists (select null from core.GroupPermission where core.GroupPermission.GroupId = core.[Group].Id)? Commented Mar 30, 2016 at 9:51
  • 1
    @Sim Then I think it would be impossible to solve by just analyzing the query. Commented Mar 30, 2016 at 9:57
  • 2
    Consider a table, t. Consider a synonym, s, that is applied to t (or, if your database system doesn't support synonyms, a view v which selects all columns from t and applies no filters). There is no way to determine just by inspecting two queries, where one references t and the other references s (or v) that they are identical. Commented Mar 30, 2016 at 11:57
  • 2
    @VladimirBaranov - with a lot of provisos, usually. You'd need to ensure that you're the only user accessing that DB, that your queries don't modify the data, that you're not facing off against a malicious sys-admin, etc. Otherwise, there are plenty of ways for two identical queries to be optimized differently. Commented Mar 30, 2016 at 12:31

2 Answers 2

1

This task is indeed not trivial.

In essence, you'll have to build your own query parser and optimizer. This is the task of optimizer - to transform the query operators in an execution plan in such a way that the final result of the query remains the same for any possible data in the underlying tables (taking all constraints into account). Smart optimizers are able to generate identical plans for queries that look very different (such as IN vs EXISTS), they simplify and unify logic conditions in WHERE clause, can push predicates along the execution tree and do many other things.

Writing such optimizer from scratch would be hard, but you can have a look at existing open source databases (Postgres?) and see if you can borrow something from there.

Another, more practical approach could be utilizing one of the existing databases and instead of running the query, ask the optimizer to return you the generated execution plan. Then, instead of comparing original SQL text you can compare execution plans. If plans are the same, then the original queries are 100% equivalent. If plans are different, it is still possible that optimizer was not smart enough to deduce that the queries are equivalent, but you'll have to accept the possibility of false negatives.

I'd have a look at several different databases and see what kind of information you can get from their optimizers using built-in capabilities. In any case, the generated execution plan should be much more structured that original SQL text and it should be easier to compare them automatically.

Sign up to request clarification or add additional context in comments.

2 Comments

this sounds like a sound option. I'll look into this and if it proofs viable I'll accept this as an answer. The current problem is that it is more an abstract idea than a (yet) practical solution, thus accepting it without verification might be premature.
I opted for a simpler but not 'perfect' method. I now remove all values in comparisons or assignments, remove all whitespaces and then order the syntax tree alphabetically to finally compare the strings. This is by far not what I initially wanted but simpler and currently sufficient for my problem. But as this is the only answer so far and sounds sound I accept this as the answer to the question.
0

This can be achieved using a sqlglot project, for example:

def sql_distance(actual_query: str, expected_query: str) -> int:
    """
    Compute the SQL distance between two queries as the number of edit
    operations needed to convert one to the other.
    """
    actual = optimize(parse_one(actual_query))
    expected = optimize(parse_one(expected_query))
    edit_script = diff(expected, actual)
    return sum(0 if isinstance(e, Keep) else 1 for e in edit_script)

See Semantic Diff document in sqlglot repo.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.