• Skip to content (access key 1)
  • Skip to search (access key 7)
FWF — Austrian Science Fund
  • Go to overview page Discover

    • Research Radar
      • Research Radar Archives 1974–1994
    • Discoveries
      • Emmanuelle Charpentier
      • Adrian Constantin
      • Monika Henzinger
      • Ferenc Krausz
      • Wolfgang Lutz
      • Walter Pohl
      • Christa Schleper
      • Elly Tanaka
      • Anton Zeilinger
    • Impact Stories
      • Verena Gassner
      • Wolfgang Lechner
      • Georg Winter
    • scilog Magazine
    • Austrian Science Awards
      • FWF Wittgenstein Awards
      • FWF ASTRA Awards
      • FWF START Awards
      • Award Ceremony
    • excellent=austria
      • Clusters of Excellence
      • Emerging Fields
    • In the Spotlight
      • 40 Years of Erwin Schrödinger Fellowships
      • Quantum Austria
    • Dialogs and Talks
      • think.beyond Summit
    • Knowledge Transfer Events
    • E-Book Library
  • Go to overview page Funding

    • Portfolio
      • excellent=austria
        • Clusters of Excellence
        • Emerging Fields
      • Projects
        • Principal Investigator Projects
        • Principal Investigator Projects International
        • Clinical Research
        • 1000 Ideas
        • Arts-Based Research
        • FWF Wittgenstein Award
      • Careers
        • ESPRIT
        • FWF ASTRA Awards
        • Erwin Schrödinger
        • doc.funds
        • doc.funds.connect
      • Collaborations
        • Specialized Research Groups
        • Special Research Areas
        • Research Groups
        • International – Multilateral Initiatives
        • #ConnectingMinds
      • Communication
        • Top Citizen Science
        • Science Communication
        • Book Publications
        • Digital Publications
        • Open-Access Block Grant
      • Subject-Specific Funding
        • AI Mission Austria
        • Belmont Forum
        • ERA-NET HERA
        • ERA-NET NORFACE
        • ERA-NET QuantERA
        • ERA-NET TRANSCAN
        • Alternative Methods to Animal Testing
        • European Partnership Biodiversa+
        • European Partnership BrainHealth
        • European Partnership ERA4Health
        • European Partnership ERDERA
        • European Partnership EUPAHW
        • European Partnership FutureFoodS
        • European Partnership OHAMR
        • European Partnership PerMed
        • European Partnership Water4All
        • Gottfried and Vera Weiss Award
        • netidee SCIENCE
        • Herzfelder Foundation Projects
        • Quantum Austria
        • Rückenwind Funding Bonus
        • WE&ME Award
        • Zero Emissions Award
      • International Collaborations
        • Belgium/Flanders
        • Germany
        • France
        • Italy/South Tyrol
        • Japan
        • Luxembourg
        • Poland
        • Switzerland
        • Slovenia
        • Taiwan
        • Tyrol–South Tyrol–Trentino
        • Czech Republic
        • Hungary
    • Step by Step
      • Find Funding
      • Submitting Your Application
      • International Peer Review
      • Funding Decisions
      • Carrying out Your Project
      • Closing Your Project
      • Further Information
        • Integrity and Ethics
        • Inclusion
        • Applying from Abroad
        • Personnel Costs
        • PROFI
        • Final Project Reports
        • Final Project Report Survey
    • FAQ
      • Project Phase PROFI
      • Project Phase Ad Personam
      • Expiring Programs
        • Elise Richter and Elise Richter PEEK
        • FWF START Awards
  • Go to overview page About Us

    • Mission Statement
    • FWF Video
    • Values
    • Facts and Figures
    • Annual Report
    • What We Do
      • Research Funding
        • Matching Funds Initiative
      • International Collaborations
      • Studies and Publications
      • Equal Opportunities and Diversity
        • Objectives and Principles
        • Measures
        • Creating Awareness of Bias in the Review Process
        • Terms and Definitions
        • Your Career in Cutting-Edge Research
      • Open Science
        • Open-Access Policy
          • Open-Access Policy for Peer-Reviewed Publications
          • Open-Access Policy for Peer-Reviewed Book Publications
          • Open-Access Policy for Research Data
        • Research Data Management
        • Citizen Science
        • Open Science Infrastructures
        • Open Science Funding
      • Evaluations and Quality Assurance
      • Academic Integrity
      • Science Communication
      • Philanthropy
      • Sustainability
    • History
    • Legal Basis
    • Organization
      • Executive Bodies
        • Executive Board
        • Supervisory Board
        • Assembly of Delegates
        • Scientific Board
        • Juries
      • FWF Office
    • Jobs at FWF
  • Go to overview page News

    • News
    • Press
      • Logos
    • Calendar
      • Post an Event
      • FWF Informational Events
    • Job Openings
      • Enter Job Opening
    • Newsletter
  • Discovering
    what
    matters.

    FWF-Newsletter Press-Newsletter Calendar-Newsletter Job-Newsletter scilog-Newsletter

    SOCIAL MEDIA

    • LinkedIn, external URL, opens in a new window
    • , external URL, opens in a new window
    • Facebook, external URL, opens in a new window
    • Instagram, external URL, opens in a new window
    • YouTube, external URL, opens in a new window

    SCILOG

    • Scilog — The science magazine of the Austrian Science Fund (FWF)
  • elane login, external URL, opens in a new window
  • Scilog external URL, opens in a new window
  • de Wechsle zu Deutsch

  

FFTED - Fast and Flexible Tree Edit Distance

FFTED - Fast and Flexible Tree Edit Distance

Nikolaus Augsten (ORCID: 0000-0002-3036-6201)
  • Grant DOI 10.55776/P29859
  • Funding program Principal Investigator Projects
  • Status ended
  • Start March 1, 2017
  • End August 31, 2022
  • Funding amount € 388,836

Disciplines

Computer Sciences (100%)

Keywords

    Tree Edit Distance, Tree Similarity, Hierarchical Data, Similarity Search, Approximate Matching

Abstract Final report

Data that are structured in hierarchies are called trees. An example of tree data is the organizational hierarchy of a company with departments, their managers, and subordinate employees. Other examples include enterprise assets, bills of material, natural or computer language syntax trees, directories in file systems, or molecular structures in bioinformatics. When tree data change, new tree versions are generated. An important task is to compute the difference (called diff) between two version of a tree. A good diff is as concise as possible. A common approach to compute diffs for trees is the so-called tree edit distance (TED), which computes the minimal diff. The computation of the minimal diff is challenging. Current solutions suffer from two problems that make them impractical in many scenarios: (1) Efficiency: The computation of the minimal diff is too slow. When the tree size doubles, the runtime increase by a factor of four or more. Current techniques are applicable to hierarchies with about ten thousand elements, but many interesting trees have millions of elements (e.g., bills of material). (2) Flexibility: TED computes the most concise diff and disregards any application semantics, which may lead to non-intuitive results. For example, TED may turn a department into an employee to keep the diff small. Various attempts have been made to overcome these problems, but all solutions suffer from at least one of the following issues: quality is traded off against speed (i.e., the diffs are too verbose), often without any guarantee about the deviation from the optimal result; the application semantics are hard-wired in the solution, thus restricting its scope to very specific applications; or only one of the issues (efficiency or flexibility) is addressed. The goal of this project is to solve the efficiency and flexibility problems of TED without trading off quality. We will allow users to specify constraints that fit their application semantics as part of the input (e.g., departments cannot be turned into employees), and the minimal diff that respects these constraints will be computed. This will make our solution applicable to a wide range of different domains. In order to gain efficiency, we will exploit two observations that have received little attention in the past. First, changes between tree versions typically affect only a small portion of the tree; by focusing the computational effort on the parts that actually changed we hope to achieve considerable runtime improvements. Second, user constraints limit the number of allowable diffs, and fewer options must be considered; we plan to leverage this fact as well to gain performance. If successful, our solution will be the first one to deal with user-defined constraints and compute the minimal diff efficiently.

The goal of this project was to develop efficient techniques for similarity queries over tree data. Data items are represented as trees when the individual data values show hierarchical dependencies, for example, the name of an author belongs to the authored book and can only be interpreted in this context. Tree data are common in various application scenarios and have led to the development of popular hierarchical data formats like XML or JSON. A query of great interest retrieves trees based on their pairwise similarities, expressed as the minimal difference between two trees: the so-called tree edit distance (TED). While queries based on TED provide users with intuitive results, they are computationally challenging, calling for efficient techniques to answer such queries. This project developed a number of novel approaches for common types of tree similarity queries that advance the state of the art and outperform previous approaches in terms of runtime or memory consumption. (1) TopDiff: Computing the minimal difference between a pair of trees is a useful but expensive operation. The TopDiff algorithm developed in this project leverages the similarity of the input trees to substantially speed up the distance computation. For similar trees, the runtime of TopDiff grows only linearly in the tree size in contrast to the cubic growth of the TED algorithm. (2) SlimCone: The top-k subtree similarity query identifies the k subtrees of a document tree that are closest to a given query tree. Previous solutions were based on an indexing technique that required a large amount of memory, which limited the applicability of this solution to small instances. The SlimCone index developed in this project is based on a new idea that avoids the memory issue, scales to large instances, and returns the top-k result faster than all previous techniques. (3) TJoin: The tree similarity join identifies all tree pairs in a collection of trees that are within a user-defined TED threshold. This query is typically used to identify duplicates in a dataset. Comparing all pairs of trees does not scale to large tree collections. The TJoin algorithm introduces an index and inexpensive yet effective filters that utilize the structural information of the trees to avoid most of the pairwise TED computations. For the tree pairs that pass all filters, TJoin uses a novel verification algorithm that leverages the user-defined threshold to avoids the expensive TED computation. All techniques developed in this project have been experimentally compared to previous solutions and often outperform their competitors by several orders of magnitude in runtime for large input instances.

Research institution(s)
  • Universität Salzburg - 100%
International project participants
  • Helene Touzet, Cité Scientifique - France
  • Alfons Kemper, TU München - Germany
  • Thomas Neumann, TU München - Germany

Research Output

  • 133 Citations
  • 20 Publications
  • 4 Datasets & models
  • 3 Scientific Awards
  • 2 Fundings
Publications
  • 2021
    Title Scaling Density-Based Clustering to Large Collections of Sets
    Type Conference Proceeding Abstract
    Author Augsten N
    Conference International Conference on Extending Database Technology (EDBT)
    Pages 109-120
    Link Publication
  • 2021
    Title Scalable Similarity Queries over Complex Data
    Type PhD Thesis
    Author Daniel Kocher
    Link Publication
  • 2021
    Title Similarity Queries over Hierarchical Data
    Type PhD Thesis
    Author Thomas Hütter
  • 2019
    Title A Scalable Index for Top-k Subtree Similarity Queries
    Type Conference Proceeding Abstract
    Author Augsten N
    Conference International Conference on Management of Data (SIGMOD)
    Pages 1624-1641
    Link Publication
  • 2019
    Title Effective Filters and Linear Time Verification for Tree Similarity Joins
    Type Conference Proceeding Abstract
    Author Hütter T
    Conference International Conference on Data Engineering (ICDE)
    Pages 854-865
  • 2019
    Title Effective Filters and Linear Time Verification for Tree Similarity Joins
    Type Conference Proceeding Abstract
    Author Hütter T
    Conference International Conference on Data Engineering (ICDE)
    Pages 854-865
    Link Publication
  • 2019
    Title SWOOP: Top-k Similarity Joins over Set Streams
    Type Other
    Author Augsten N
    Link Publication
  • 2019
    Title A Scalable Index for Top-k Subtree Similarity Queries
    DOI 10.1145/3299869.3319892
    Type Conference Proceeding Abstract
    Author Kocher D
    Pages 1624-1641
    Link Publication
  • 2022
    Title JEDI: These aren't the JSON documents you're looking for...
    Type Conference Proceeding Abstract
    Author Augsten N
    Conference International Conference on Management of Data (SIGMOD)
    Pages 1584-1597
    Link Publication
  • 2020
    Title DeSignate: detecting signature characters in gene sequence alignments for taxon diagnoses
    DOI 10.1186/s12859-020-3498-6
    Type Journal Article
    Author Hütter T
    Journal BMC Bioinformatics
    Pages 151
    Link Publication
  • 2019
    Title A Link is not Enough – Reproducibility of Data
    DOI 10.1007/s13222-019-00317-8
    Type Journal Article
    Author Pawlik M
    Journal Datenbank-Spektrum
    Pages 107-115
    Link Publication
  • 2022
    Title JEDI: These aren't the JSON documents you're looking for... (Extended Version*)
    DOI 10.48550/arxiv.2201.08099
    Type Preprint
    Author Hütter T
  • 2022
    Title JEDI: These aren't the JSON documents you're looking for...
    DOI 10.1145/3514221.3517850
    Type Conference Proceeding Abstract
    Author Hütter T
    Pages 1584-1597
    Link Publication
  • 2020
    Title Minimal Edit-Based Diffs for Large Trees
    DOI 10.1145/3340531.3412026
    Type Conference Proceeding Abstract
    Author Pawlik M
    Pages 1225-1234
    Link Publication
  • 2020
    Title Minimal Edit-Based Diffs for Large Trees
    Type Conference Proceeding Abstract
    Author Augsten N
    Conference International Conference on Information and Knowledge Management
    Pages 1225-1234
    Link Publication
  • 2017
    Title A New Perspective on the Tree Edit Distance
    DOI 10.1007/978-3-319-68474-1_11
    Type Book Chapter
    Author Schwarz S
    Publisher Springer Nature
    Pages 156-170
  • 2018
    Title A roadmap towards declarative similarity queries
    Type Conference Proceeding Abstract
    Author Augsten N
    Conference 21st International Conference on Extending Database Technology
    Link Publication
  • 2017
    Title A new perspective on the tree edit distance
    Type Conference Proceeding Abstract
    Author Pawlik M
    Conference International Conference on Similarity Search and Applications
    Pages 156-170
    Link Publication
  • 2019
    Title Effective Filters and Linear Time Verification for Tree Similarity Joins Thomas
    DOI 10.1109/icde.2019.00081
    Type Conference Proceeding Abstract
    Author Hütter T
    Pages 854-865
  • 2018
    Title Set similarity joins on mapreduce
    DOI 10.14778/3231751.3231760
    Type Journal Article
    Author Fier F
    Journal Proceedings of the VLDB Endowment
    Pages 1110-1122
Datasets & models
  • 2020 Link
    Title Additional file 1 of DeSignate: detecting signature characters in gene sequence alignments for taxon diagnoses
    DOI 10.6084/m9.figshare.12262484
    Type Database/Collection of data
    Public Access
    Link Link
  • 2020 Link
    Title Additional file 1 of DeSignate: detecting signature characters in gene sequence alignments for taxon diagnoses
    DOI 10.6084/m9.figshare.12160239
    Type Database/Collection of data
    Public Access
    Link Link
  • 2020 Link
    Title DeSignate
    Type Computer model/algorithm
    Public Access
    Link Link
  • 2019 Link
    Title Tree Edit Distance Library
    Type Computer model/algorithm
    Public Access
    Link Link
Scientific Awards
  • 2020
    Title Marshall Plan Scholarship
    Type Research prize
    Level of Recognition Continental/International
  • 2019
    Title Young Investigators Award
    Type Research prize
    Level of Recognition Regional (any country)
  • 2019
    Title ACM SIGMOD 2019 Reproducibility Package
    Type Research prize
    Level of Recognition Continental/International
Fundings
  • 2020
    Title Austrian Marshall Plan Scholarship
    Type Fellowship
    Start of Funding 2020
    Funder Austrian Marshall Plan Foundation
  • 2021
    Title DESQ - Declarative and Efficient Similarity Queries
    Type Research grant (including intramural programme)
    Start of Funding 2021

Discovering
what
matters.

Newsletter

FWF-Newsletter Press-Newsletter Calendar-Newsletter Job-Newsletter scilog-Newsletter

Contact

Austrian Science Fund (FWF)
Georg-Coch-Platz 2
(Entrance Wiesingerstraße 4)
1010 Vienna

office(at)fwf.ac.at
+43 1 505 67 40

General information

  • Job Openings
  • Jobs at FWF
  • Press
  • Philanthropy
  • scilog
  • FWF Office
  • Social Media Directory
  • LinkedIn, external URL, opens in a new window
  • , external URL, opens in a new window
  • Facebook, external URL, opens in a new window
  • Instagram, external URL, opens in a new window
  • YouTube, external URL, opens in a new window
  • Cookies
  • Whistleblowing/Complaints Management
  • Accessibility Statement
  • Data Protection
  • Acknowledgements
  • IFG-Form
  • Social Media Directory
  • © Österreichischer Wissenschaftsfonds FWF
© Österreichischer Wissenschaftsfonds FWF