FFTED - Fast and Flexible Tree Edit Distance
FFTED - Fast and Flexible Tree Edit Distance
Disciplines
Computer Sciences (100%)
Keywords
-
Tree Edit Distance,
Tree Similarity,
Hierarchical Data,
Similarity Search,
Approximate Matching
Data that are structured in hierarchies are called trees. An example of tree data is the organizational hierarchy of a company with departments, their managers, and subordinate employees. Other examples include enterprise assets, bills of material, natural or computer language syntax trees, directories in file systems, or molecular structures in bioinformatics. When tree data change, new tree versions are generated. An important task is to compute the difference (called diff) between two version of a tree. A good diff is as concise as possible. A common approach to compute diffs for trees is the so-called tree edit distance (TED), which computes the minimal diff. The computation of the minimal diff is challenging. Current solutions suffer from two problems that make them impractical in many scenarios: (1) Efficiency: The computation of the minimal diff is too slow. When the tree size doubles, the runtime increase by a factor of four or more. Current techniques are applicable to hierarchies with about ten thousand elements, but many interesting trees have millions of elements (e.g., bills of material). (2) Flexibility: TED computes the most concise diff and disregards any application semantics, which may lead to non-intuitive results. For example, TED may turn a department into an employee to keep the diff small. Various attempts have been made to overcome these problems, but all solutions suffer from at least one of the following issues: quality is traded off against speed (i.e., the diffs are too verbose), often without any guarantee about the deviation from the optimal result; the application semantics are hard-wired in the solution, thus restricting its scope to very specific applications; or only one of the issues (efficiency or flexibility) is addressed. The goal of this project is to solve the efficiency and flexibility problems of TED without trading off quality. We will allow users to specify constraints that fit their application semantics as part of the input (e.g., departments cannot be turned into employees), and the minimal diff that respects these constraints will be computed. This will make our solution applicable to a wide range of different domains. In order to gain efficiency, we will exploit two observations that have received little attention in the past. First, changes between tree versions typically affect only a small portion of the tree; by focusing the computational effort on the parts that actually changed we hope to achieve considerable runtime improvements. Second, user constraints limit the number of allowable diffs, and fewer options must be considered; we plan to leverage this fact as well to gain performance. If successful, our solution will be the first one to deal with user-defined constraints and compute the minimal diff efficiently.
The goal of this project was to develop efficient techniques for similarity queries over tree data. Data items are represented as trees when the individual data values show hierarchical dependencies, for example, the name of an author belongs to the authored book and can only be interpreted in this context. Tree data are common in various application scenarios and have led to the development of popular hierarchical data formats like XML or JSON. A query of great interest retrieves trees based on their pairwise similarities, expressed as the minimal difference between two trees: the so-called tree edit distance (TED). While queries based on TED provide users with intuitive results, they are computationally challenging, calling for efficient techniques to answer such queries. This project developed a number of novel approaches for common types of tree similarity queries that advance the state of the art and outperform previous approaches in terms of runtime or memory consumption. (1) TopDiff: Computing the minimal difference between a pair of trees is a useful but expensive operation. The TopDiff algorithm developed in this project leverages the similarity of the input trees to substantially speed up the distance computation. For similar trees, the runtime of TopDiff grows only linearly in the tree size in contrast to the cubic growth of the TED algorithm. (2) SlimCone: The top-k subtree similarity query identifies the k subtrees of a document tree that are closest to a given query tree. Previous solutions were based on an indexing technique that required a large amount of memory, which limited the applicability of this solution to small instances. The SlimCone index developed in this project is based on a new idea that avoids the memory issue, scales to large instances, and returns the top-k result faster than all previous techniques. (3) TJoin: The tree similarity join identifies all tree pairs in a collection of trees that are within a user-defined TED threshold. This query is typically used to identify duplicates in a dataset. Comparing all pairs of trees does not scale to large tree collections. The TJoin algorithm introduces an index and inexpensive yet effective filters that utilize the structural information of the trees to avoid most of the pairwise TED computations. For the tree pairs that pass all filters, TJoin uses a novel verification algorithm that leverages the user-defined threshold to avoids the expensive TED computation. All techniques developed in this project have been experimentally compared to previous solutions and often outperform their competitors by several orders of magnitude in runtime for large input instances.
- Universität Salzburg - 100%
- Helene Touzet, Cité Scientifique - France
- Alfons Kemper, TU München - Germany
- Thomas Neumann, TU München - Germany
Research Output
- 133 Citations
- 20 Publications
- 4 Datasets & models
- 3 Scientific Awards
- 2 Fundings
-
2021
Title Scaling Density-Based Clustering to Large Collections of Sets Type Conference Proceeding Abstract Author Augsten N Conference International Conference on Extending Database Technology (EDBT) Pages 109-120 Link Publication -
2021
Title Scalable Similarity Queries over Complex Data Type PhD Thesis Author Daniel Kocher Link Publication -
2021
Title Similarity Queries over Hierarchical Data Type PhD Thesis Author Thomas Hütter -
2019
Title A Scalable Index for Top-k Subtree Similarity Queries Type Conference Proceeding Abstract Author Augsten N Conference International Conference on Management of Data (SIGMOD) Pages 1624-1641 Link Publication -
2019
Title Effective Filters and Linear Time Verification for Tree Similarity Joins Type Conference Proceeding Abstract Author Hütter T Conference International Conference on Data Engineering (ICDE) Pages 854-865 -
2019
Title Effective Filters and Linear Time Verification for Tree Similarity Joins Type Conference Proceeding Abstract Author Hütter T Conference International Conference on Data Engineering (ICDE) Pages 854-865 Link Publication -
2019
Title SWOOP: Top-k Similarity Joins over Set Streams Type Other Author Augsten N Link Publication -
2019
Title A Scalable Index for Top-k Subtree Similarity Queries DOI 10.1145/3299869.3319892 Type Conference Proceeding Abstract Author Kocher D Pages 1624-1641 Link Publication -
2022
Title JEDI: These aren't the JSON documents you're looking for... Type Conference Proceeding Abstract Author Augsten N Conference International Conference on Management of Data (SIGMOD) Pages 1584-1597 Link Publication -
2020
Title DeSignate: detecting signature characters in gene sequence alignments for taxon diagnoses DOI 10.1186/s12859-020-3498-6 Type Journal Article Author Hütter T Journal BMC Bioinformatics Pages 151 Link Publication -
2019
Title A Link is not Enough – Reproducibility of Data DOI 10.1007/s13222-019-00317-8 Type Journal Article Author Pawlik M Journal Datenbank-Spektrum Pages 107-115 Link Publication -
2022
Title JEDI: These aren't the JSON documents you're looking for... (Extended Version*) DOI 10.48550/arxiv.2201.08099 Type Preprint Author Hütter T -
2022
Title JEDI: These aren't the JSON documents you're looking for... DOI 10.1145/3514221.3517850 Type Conference Proceeding Abstract Author Hütter T Pages 1584-1597 Link Publication -
2020
Title Minimal Edit-Based Diffs for Large Trees DOI 10.1145/3340531.3412026 Type Conference Proceeding Abstract Author Pawlik M Pages 1225-1234 Link Publication -
2020
Title Minimal Edit-Based Diffs for Large Trees Type Conference Proceeding Abstract Author Augsten N Conference International Conference on Information and Knowledge Management Pages 1225-1234 Link Publication -
2017
Title A New Perspective on the Tree Edit Distance DOI 10.1007/978-3-319-68474-1_11 Type Book Chapter Author Schwarz S Publisher Springer Nature Pages 156-170 -
2018
Title A roadmap towards declarative similarity queries Type Conference Proceeding Abstract Author Augsten N Conference 21st International Conference on Extending Database Technology Link Publication -
2017
Title A new perspective on the tree edit distance Type Conference Proceeding Abstract Author Pawlik M Conference International Conference on Similarity Search and Applications Pages 156-170 Link Publication -
2019
Title Effective Filters and Linear Time Verification for Tree Similarity Joins Thomas DOI 10.1109/icde.2019.00081 Type Conference Proceeding Abstract Author Hütter T Pages 854-865 -
2018
Title Set similarity joins on mapreduce DOI 10.14778/3231751.3231760 Type Journal Article Author Fier F Journal Proceedings of the VLDB Endowment Pages 1110-1122
-
2020
Link
Title Additional file 1 of DeSignate: detecting signature characters in gene sequence alignments for taxon diagnoses DOI 10.6084/m9.figshare.12262484 Type Database/Collection of data Public Access Link Link -
2020
Link
Title Additional file 1 of DeSignate: detecting signature characters in gene sequence alignments for taxon diagnoses DOI 10.6084/m9.figshare.12160239 Type Database/Collection of data Public Access Link Link -
2020
Link
Title DeSignate Type Computer model/algorithm Public Access Link Link -
2019
Link
Title Tree Edit Distance Library Type Computer model/algorithm Public Access Link Link
-
2020
Title Marshall Plan Scholarship Type Research prize Level of Recognition Continental/International -
2019
Title Young Investigators Award Type Research prize Level of Recognition Regional (any country) -
2019
Title ACM SIGMOD 2019 Reproducibility Package Type Research prize Level of Recognition Continental/International
-
2020
Title Austrian Marshall Plan Scholarship Type Fellowship Start of Funding 2020 Funder Austrian Marshall Plan Foundation -
2021
Title DESQ - Declarative and Efficient Similarity Queries Type Research grant (including intramural programme) Start of Funding 2021