Skip to content

ccbest/Entity-Resolution

Repository files navigation

Entity Resolution

What is it?

Entity resolution is a Python package that provides fast, extensible methods for applying complex logic in order to merge and transitively link records between disparate datasets.

Main Features

  • Create proto-entities ("entlets") representing entities from singular data sources
  • Define complex "Strategies" to determine rulesets by which records should merge
  • Create an entity resolution pipeline

Simple Example

University of Leipzig's DBLP-ACM dataset provides two files, both describing published papers, with similar columns:

  • A unique id (scoped to the file)
  • A title
  • A list of authors
  • A venue
  • A year

The titles vary slightly between files and different authors may be listed for a given paper. There's no common unique id to execute a JOIN on, and there may even be duplicates that are only slightly different.

Entity Resolution lets you specify a set of complex comparison metrics, called a "strategy", for any combination of features present in the data. For example, you may want to use the following rules:

  • title using Levenshtein distance
  • authors using Jaccard ratio of tokens
  • year using exact match

In Depth examples

For more in depth workflows and explanations of the methodology, reference the notebooks folder.

Install

Install the latest version of Entity Resolution:

$ pip install entity-resolution

Research

Entity Resolution is (most notably) inspired by the below publications:

License

Released under standard MIT license (see LICENSE.txt):

Copyright (c) 2021 entity-resolution Developers
Carl Best
Jessica Moore

About

Small Batch ER

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •