Checking, Mining, and Exploring Functional Dependencies in Python

Published in

Level Up Coding

9 min readApr 4, 2024

Hi everyone! In this article, we will show how you can perform various tasks concerning functional dependencies using Python. For this, you’ll need Pandas and Desbordante libraries.

Introduction

Functional dependency is a widely-used and well-known concept, taught in almost every database course. In databases, they are used as a basis for normal forms, which, in turn, form the core of database design. This is the most prominent application of functional dependencies. A more recent point of view considers them to be hidden knowledge that could be mined from data and applied in data cleaning, query optimization, schema design, and so on. FDs are also very helpful in hypothesis verification and drawing conclusions about data.

All of this leads to many tasks involving functional dependencies which frequently arise in everyday practice of data engineers and data scientists. These tasks are usually solved in an ad-hoc manner, requiring a significant amount of coding and showing a rather limited performance, as the computation core is typically implemented in Python or SQL. In this article we will show how to solve them with minimum coding effort while obtaining maximum performance. For this we will employ Desbordante — a Python library for data profiling which is implemented in C++ and which allows our approach to achieve the native-grade run times.

Let’s first recap the basics. Informally, an FD X->Y states that if any two rows of a table have equal values in attributes X, then they will have equal values in attributes Y. X is the left-hand side (LHS) of the dependency, and Y is the right-hand side (RHS). In the example below, we can see that FD A->B holds (left), but A->C does not, as shown to the right — there is a conflict between the first and third record.

Some additional details: first of all, both sides can contain several attributes, although usually only dependencies with unit RHS are considered. Next, there are minimal and non-minimal dependencies, since, for example, adding any attribute to the LHS of a holding dependency will result in another holding dependency. Therefore, we say that a dependency is minimal if we can’t remove any column from its LHS without stopping it from holding. There are also trivial dependencies, which hold on any dataset. For example, a dependency A->A is a trivial one. Informally, we say that a dependency is non-trivial if LHS and RHS do not share any column. Finally, there is a formal theory of functional dependencies which allows various manipulations on dependencies. For example, we can safely decompose any dependency by splitting the RHS part like this: AB -> CD is equivalent to AB -> C and AB -> D. More information as well as the formal definition can be found in any database textbook, for example [1].

Desbordante

Desbordante (Spanish for boundless) — is a science-intensive, high-performance and open-source data profiling tool implemented in C++. Science-intensive means that, unlike other data profilers (e.g. ydata-profiling), it is focused on discovery of complex data patterns, such as various kinds of dependencies.

Desbordante comes with web, console, and Python interfaces. In this article we are interested in the third option: it makes calling pattern discovery and validation tasks from Python programs possible. Desbordante provides users with the opportunity to seamlessly integrate it with other data science libraries, including Pandas. Its primary objective is to empower end-users to apply pattern discovery and validation techniques in a range of ad-hoc scenarios, with the ultimate aim of solving real-life problems.

How to check a functional dependency

Suppose that you have to check whether a given functional dependency holds on a given dataset. This task is called functional dependency verification or functional dependency validation. Desbordante allows you to perform it in a simple manner:

First, you create an algorithm object for functional dependency verification. For such tasks, the Desbordante library may offer several algorithms, with the default one usually being the best. Then, load_data() method is called. It accepts file location, separator, and a boolean value that is used to indicate whether there is a header in the file. Next, the execute method accepts two list parameters (lhs_indices and rhs_indices) which contain attribute numbers of the left and right hand sides respectively. The method fd_holds() returns the result as a boolean value.

Alternatively, you can use a pandas dataframe:

The program is basically the same, except for the data loading part.

How to check a functional dependency inside dirty data

As we have seen, checking a functional dependency is very easy. However, real data contains all kinds of imperfections — missing values, measurement errors, and typos, among others. At the same time, the definition of a functional dependency is very strict. Even a single typo can mess up a dependency, making it impossible to locate that FD. What’s the proper approach of validating a functional dependency in this case?

To address this, the concept of approximate functional dependency (AFD) was introduced [2]. AFD is a relaxed form of FD that allows a fraction of records to violate the FD rule — some of tuple pairs that match on attribute X may not match on attribute Y. The idea is to compute some error metric and measure it against a predetermined threshold. If that metric is less than the threshold, we can conclude that an AFD is present in data. Informally, the error metric is the ratio of the count of pairs which contain violations to the total count of pairs. It is computed as follows:

Consider the following example:

Here is an excerpt from a table that lists managers, their phone numbers, and departments they head. Consider an AFD Manager->Company, which has 0.1 error metric. How do we calculate it? The numerator equals two, since there is a violation induced by Robin. According to the definition, we count all possible violating pairs — (Robin@2, Robin@3) and (Robin@3, Robin@2), where @ denotes the row number. The denominator uses the excerpt cardinality (5 records) and, consequently, evaluates to 20. Note that, while the resulting error lies in [0, 1], it is not the fraction of violating records. Instead, it is just a measure which says that there are almost no violations when it is close to zero, and that there are almost no matches in RHS when it is close to one. Compared to the exact FD, this type of dependency captures the nature of real data with its inconsistencies much better.

The code for validating AFDs is as follows:

The code is almost the same as before, but note several differences. Firstly, we are creating an object for approximate functional dependency verification. Secondly, we are invoking the get_error() method, which returns the “true” threshold of the specified dependency. If the exact dependency holds, it will return 0.0. Similarly to the previous section, it is possible to use a pandas dataframe as the source of data.

How do you determine an “acceptable” error threshold? There is no straightforward answer — it depends on the domain, quality of your data, and the task. Sometimes you expect only several violations, in which case the threshold should be somewhere close to zero. If you work with really dirty data, it should be in the range of 0.1–0.3. It is possible to increase threshold even further, but from my experience larger threshold values do not show anything of value. Finally, there are many other relaxed definitions of functional dependencies, which are suitable for various different tasks. We will consider them in the future.

How to find which rows violate a given functional dependency

There are several cases in which it is useful to look at rows that prevent a given functional dependency from holding. For example, if you are absolutely sure that a specific exact functional dependency should hold, but only the approximate one holds with a small threshold. Such a situation may happen due to typos in data, and it may be necessary to locate and fix them. Desbordante library can help you with this task, too. Consider the following code:

Our goal in this program is just to output all clusters present in data and print their details. Let’s go through the code. The first lines are the same as in our previous examples, and the differences start with the get_num_error_clusters() method. It returns the number of clusters found while checking the specified functional dependency. A cluster (with respect to a fixed FD) is a collection of rows that share the same left-hand side part but differ on the right-hand side one. Next, the get_highlights() method returns a list of these clusters. Each of them has the following fields:

.cluster — a list of row ids that constitute the cluster
.num_distinct_rhs_values — a number of distinct RHS values in the cluster, may be useful for filtering out clusters
.most_frequent_rhs_value_proportion — a ratio of the most frequent value to all others, may be useful for filtering out clusters

How to discover functional dependencies from data

Finally, with Desbordante it is possible to mine functional dependencies (both exact and approximate) from the data directly. It may come in handy when analyzing a new dataset or building various complex pipelines. Discovering functional dependencies is also very simple. Consider the following code:

The difference from our previous examples is twofold:

To discover exact dependencies, we create the desbordante.fd.algorithms.Default() object. In order to mine approximate functional dependencies we have to use desbordante.afd.algorithms.Default(). Note that this operation is very resource-consuming in both cases. It may require several hours of run time and a lot of memory. Desbordante comes with several different algorithms for this task, each of which excels for different datasets (e.g. on “wide” or “long” tables) with the default one usually being the best. If you encounter problems with run time or memory consumption, try another algorithm that is better suited for that particular dataset. Another option is to switch to an approximate algorithm, which can “lose” some of the functional dependencies or return non-valid ones. This algorithm type is usually an order of magnitude faster than the classic ones. The final option is to reduce the supplied dataset size by discarding rows and columns.
The method get_fds() returns a list of functional dependencies which hold on the dataset. However, be careful, since a functional dependency is a logical notion, so a found dependency may not hold in general (for every admissible table with this schema), even if it does hold over the given dataset. E.g. found VIN -> EngineCapacity is a valid dependency, while Name->Surname is obviously not. Such situations may occur when the dataset happens to not contain a counterexample record. With an increase in the number of columns, you will need more and more data to prevent this case.

The following example demonstrates the discovery of approximate dependencies.

The only difference with the previous example is that we have to pass an error threshold. Note that all discovery algorithms of Desbordante (except for approximate ones) return minimal and non-trivial dependencies.

Conclusion

In this tutorial we have described how to perform various functional dependency–related tasks in python. We have demonstrated how to check a functional dependency, how to handle dirty data while checking, how to find rows that violate functional dependency, and how to discover functional dependencies from data. A comprehensive example, encompassing all these use-cases, is located in the repository of Desbordante.

In the future we plan to describe more modern data patterns, such as graph functional dependencies, differential dependencies, matching dependencies, and others.

References

[1] Avi Silberschatz, Henry F. Korth, S. Sudarshan. Database System Concepts, Seventh Edition (2019), McGraw-Hill, ISBN 9780078022159

[2] Sebastian Kruse and Felix Naumann. Efficient discovery of approximate dependencies (2018). Proc. VLDB Endow. 11, 7 (March 2018), 759–772.