Entity resolution is a Python package that provides fast, extensible methods for applying complex logic in order to merge and transitively link records between disparate datasets.
- Create proto-entities ("entlets") representing entities from singular data sources
- Define complex "Strategies" to determine rulesets by which records should merge
- Create an entity resolution pipeline
University of Leipzig's DBLP-ACM dataset provides two files, both describing published papers, with similar columns:
- A unique id (scoped to the file)
- A title
- A list of authors
- A venue
- A year
The titles vary slightly between files and different authors may be listed for a given paper. There's no common unique
id to execute a JOIN
on, and there may even be duplicates that are only slightly different.
Entity Resolution lets you specify a set of complex comparison metrics, called a "strategy", for any combination of features present in the data. For example, you may want to use the following rules:
title
using Levenshtein distanceauthors
using Jaccard ratio of tokensyear
using exact match
For more in depth workflows and explanations of the methodology, reference the notebooks folder.
Install the latest version of Entity Resolution:
$ pip install entity-resolution
Entity Resolution is (most notably) inspired by the below publications:
- Collective Entity Resolution in Relational Data
- Comparative Analysis of Approximate Blocking Techniques for Entity Resolution
Released under standard MIT license (see LICENSE.txt):
Copyright (c) 2021 entity-resolution Developers
Carl Best
Jessica Moore