Projectdetail

Disciplines

Computer Sciences (100%)

Keywords

Tree Edit Distance, Tree Similarity, Hierarchical Data, Similarity Search, Approximate Matching

Abstract

Final report

Data that are structured in hierarchies are called trees. An example of tree data is the organizational hierarchy of a company with departments, their managers, and subordinate employees. Other examples include enterprise assets, bills of material, natural or computer language syntax trees, directories in file systems, or molecular structures in bioinformatics. When tree data change, new tree versions are generated. An important task is to compute the difference (called diff) between two version of a tree. A good diff is as concise as possible. A common approach to compute diffs for trees is the so-called tree edit distance (TED), which computes the minimal diff. The computation of the minimal diff is challenging. Current solutions suffer from two problems that make them impractical in many scenarios: (1) Efficiency: The computation of the minimal diff is too slow. When the tree size doubles, the runtime increase by a factor of four or more. Current techniques are applicable to hierarchies with about ten thousand elements, but many interesting trees have millions of elements (e.g., bills of material). (2) Flexibility: TED computes the most concise diff and disregards any application semantics, which may lead to non-intuitive results. For example, TED may turn a department into an employee to keep the diff small. Various attempts have been made to overcome these problems, but all solutions suffer from at least one of the following issues: quality is traded off against speed (i.e., the diffs are too verbose), often without any guarantee about the deviation from the optimal result; the application semantics are hard-wired in the solution, thus restricting its scope to very specific applications; or only one of the issues (efficiency or flexibility) is addressed. The goal of this project is to solve the efficiency and flexibility problems of TED without trading off quality. We will allow users to specify constraints that fit their application semantics as part of the input (e.g., departments cannot be turned into employees), and the minimal diff that respects these constraints will be computed. This will make our solution applicable to a wide range of different domains. In order to gain efficiency, we will exploit two observations that have received little attention in the past. First, changes between tree versions typically affect only a small portion of the tree; by focusing the computational effort on the parts that actually changed we hope to achieve considerable runtime improvements. Second, user constraints limit the number of allowable diffs, and fewer options must be considered; we plan to leverage this fact as well to gain performance. If successful, our solution will be the first one to deal with user-defined constraints and compute the minimal diff efficiently.

The goal of this project was to develop efficient techniques for similarity queries over tree data. Data items are represented as trees when the individual data values show hierarchical dependencies, for example, the name of an author belongs to the authored book and can only be interpreted in this context. Tree data are common in various application scenarios and have led to the development of popular hierarchical data formats like XML or JSON. A query of great interest retrieves trees based on their pairwise similarities, expressed as the minimal difference between two trees: the so-called tree edit distance (TED). While queries based on TED provide users with intuitive results, they are computationally challenging, calling for efficient techniques to answer such queries. This project developed a number of novel approaches for common types of tree similarity queries that advance the state of the art and outperform previous approaches in terms of runtime or memory consumption. (1) TopDiff: Computing the minimal difference between a pair of trees is a useful but expensive operation. The TopDiff algorithm developed in this project leverages the similarity of the input trees to substantially speed up the distance computation. For similar trees, the runtime of TopDiff grows only linearly in the tree size in contrast to the cubic growth of the TED algorithm. (2) SlimCone: The top-k subtree similarity query identifies the k subtrees of a document tree that are closest to a given query tree. Previous solutions were based on an indexing technique that required a large amount of memory, which limited the applicability of this solution to small instances. The SlimCone index developed in this project is based on a new idea that avoids the memory issue, scales to large instances, and returns the top-k result faster than all previous techniques. (3) TJoin: The tree similarity join identifies all tree pairs in a collection of trees that are within a user-defined TED threshold. This query is typically used to identify duplicates in a dataset. Comparing all pairs of trees does not scale to large tree collections. The TJoin algorithm introduces an index and inexpensive yet effective filters that utilize the structural information of the trees to avoid most of the pairwise TED computations. For the tree pairs that pass all filters, TJoin uses a novel verification algorithm that leverages the user-defined threshold to avoids the expensive TED computation. All techniques developed in this project have been experimentally compared to previous solutions and often outperform their competitors by several orders of magnitude in runtime for large input instances.

Research institution(s)

Universität Salzburg - 100%

International project participants

Helene Touzet, Cité Scientifique - France
Alfons Kemper, TU München - Germany
Thomas Neumann, TU München - Germany

Research Output

133 Citations
20 Publications
4 Datasets & models
3 Scientific Awards
2 Fundings

Publications

Title	Scaling Density-Based Clustering to Large Collections of Sets
Type	Conference Proceeding Abstract
Author	Augsten N
Conference	International Conference on Extending Database Technology (EDBT)
Pages	109-120
Link	Publication

Title	Scalable Similarity Queries over Complex Data
Type	PhD Thesis
Author	Daniel Kocher
Link	Publication

Title	Similarity Queries over Hierarchical Data
Type	PhD Thesis
Author	Thomas Hütter

Title	A Scalable Index for Top-k Subtree Similarity Queries
Type	Conference Proceeding Abstract
Author	Augsten N
Conference	International Conference on Management of Data (SIGMOD)
Pages	1624-1641
Link	Publication

Title	Effective Filters and Linear Time Verification for Tree Similarity Joins
Type	Conference Proceeding Abstract
Author	Hütter T
Conference	International Conference on Data Engineering (ICDE)
Pages	854-865

Title	Effective Filters and Linear Time Verification for Tree Similarity Joins
Type	Conference Proceeding Abstract
Author	Hütter T
Conference	International Conference on Data Engineering (ICDE)
Pages	854-865
Link	Publication

Title	SWOOP: Top-k Similarity Joins over Set Streams
Type	Other
Author	Augsten N
Link	Publication

Title	A Scalable Index for Top-k Subtree Similarity Queries
DOI	10.1145/3299869.3319892
Type	Conference Proceeding Abstract
Author	Kocher D
Pages	1624-1641
Link	Publication

Title	JEDI: These aren't the JSON documents you're looking for...
Type	Conference Proceeding Abstract
Author	Augsten N
Conference	International Conference on Management of Data (SIGMOD)
Pages	1584-1597
Link	Publication

Title	DeSignate: detecting signature characters in gene sequence alignments for taxon diagnoses
DOI	10.1186/s12859-020-3498-6
Type	Journal Article
Author	Hütter T
Journal	BMC Bioinformatics
Pages	151
Link	Publication

Title	A Link is not Enough – Reproducibility of Data
DOI	10.1007/s13222-019-00317-8
Type	Journal Article
Author	Pawlik M
Journal	Datenbank-Spektrum
Pages	107-115
Link	Publication

Title	JEDI: These aren't the JSON documents you're looking for... (Extended Version*)
DOI	10.48550/arxiv.2201.08099
Type	Preprint
Author	Hütter T

Title	JEDI: These aren't the JSON documents you're looking for...
DOI	10.1145/3514221.3517850
Type	Conference Proceeding Abstract
Author	Hütter T
Pages	1584-1597
Link	Publication

Title	Minimal Edit-Based Diffs for Large Trees
DOI	10.1145/3340531.3412026
Type	Conference Proceeding Abstract
Author	Pawlik M
Pages	1225-1234
Link	Publication

Title	Minimal Edit-Based Diffs for Large Trees
Type	Conference Proceeding Abstract
Author	Augsten N
Conference	International Conference on Information and Knowledge Management
Pages	1225-1234
Link	Publication

Title	A New Perspective on the Tree Edit Distance
DOI	10.1007/978-3-319-68474-1_11
Type	Book Chapter
Author	Schwarz S
Publisher	Springer Nature
Pages	156-170

Title	A roadmap towards declarative similarity queries
Type	Conference Proceeding Abstract
Author	Augsten N
Conference	21st International Conference on Extending Database Technology
Link	Publication

Title	A new perspective on the tree edit distance
Type	Conference Proceeding Abstract
Author	Pawlik M
Conference	International Conference on Similarity Search and Applications
Pages	156-170
Link	Publication

Title	Effective Filters and Linear Time Verification for Tree Similarity Joins Thomas
DOI	10.1109/icde.2019.00081
Type	Conference Proceeding Abstract
Author	Hütter T
Pages	854-865

Title	Set similarity joins on mapreduce
DOI	10.14778/3231751.3231760
Type	Journal Article
Author	Fier F
Journal	Proceedings of the VLDB Endowment
Pages	1110-1122

Datasets & models

Public Access
Title	Additional file 1 of DeSignate: detecting signature characters in gene sequence alignments for taxon diagnoses
DOI	10.6084/m9.figshare.12262484
Type	Database/Collection of data
Link	Link

Public Access
Title	Additional file 1 of DeSignate: detecting signature characters in gene sequence alignments for taxon diagnoses
DOI	10.6084/m9.figshare.12160239
Type	Database/Collection of data
Link	Link

Public Access
Title	DeSignate
Type	Computer model/algorithm
Link	Link

Public Access
Title	Tree Edit Distance Library
Type	Computer model/algorithm
Link	Link

Scientific Awards

Title	Marshall Plan Scholarship
Type	Research prize
Level of Recognition	Continental/International

Title	Young Investigators Award
Type	Research prize
Level of Recognition	Regional (any country)

Title	ACM SIGMOD 2019 Reproducibility Package
Type	Research prize
Level of Recognition	Continental/International

Fundings

Title	Austrian Marshall Plan Scholarship
Type	Fellowship
Start of Funding	2020
Funder	Austrian Marshall Plan Foundation

Title	DESQ - Declarative and Efficient Similarity Queries
Type	Research grant (including intramural programme)
Start of Funding	2021

Go to overview page Discover

Go to overview page Funding

Go to overview page About Us

Go to overview page News

FFTED - Fast and Flexible Tree Edit Distance

FFTED - Fast and Flexible Tree Edit Distance

Disciplines

Keywords

Research Output

Contact

General information

Go to overview page Discover

Go to overview page Funding

Go to overview page About Us

Go to overview page News

SOCIAL MEDIA

SCILOG

FFTED - Fast and Flexible Tree Edit Distance

FFTED - Fast and Flexible Tree Edit Distance

Disciplines

Keywords

Research Output