Docker container for Python dedupe
This will be the first post in a series I will be doing on an Entity Resolution solution I am putting together for Clarity Insights.
I will give a full write-up of Entity Resolution in another post. For now, I am going to describe the platform so we can easily get to the MySQL dedupe example!
This container is built around the Dedupe Python library, built as an open-source API available at Dedupe.io by DataMade.
Requirements - Cutting-edge Machine Learning Tools
The container has been built with capabilities beyond those minimum necessary for Dedupe to run. This is to allow a complete package that may be expanded.
- Linux base installation (minimal Debian for Docker)
- Python - Anaconda for Python 3 was used as the core install. This package offers a wide variety of Data Science-related packages, libraries, and tools.
- Dedupe - Core application for deduplication. Read the dedupe documentation for detailed information. This is a simple install with
pip install dedupe
- Libpostal - Address parser application. This is an additional feature that makes a perfect pairing with Dedupe by splitting single address fields into component parts. Open source, free, and accurate! Read the Libpostal write-up and view the Libpostal Python Github. This is not a Python package, but a stand-alone C-based application with Python bindings.
- Jupyter allows you to execute code through a web interface. This is enabled by default, installed through the Conda package manager.
- MySQL connectors - utilized PyMySQL for the connection, which is a drop-in repalcement for the outdated MySQLdb.
pip install PyMySQL.
How to use this container
If you haven't done it, you will need to install Docker.
Grab the container from (https://hub.docker.com/r/mattguide/python-dedupe/) with
docker pull mattguide/python-dedupe .
This is a large download (over 3.2GB uncompressed on my system), so grab a coffee while you wait!
Run the bare-bones container
Once you have the download, it is a quick path to running it. If you just
docker run this package, you will find it immediately exists without any message. To avoid this, you will want to specify to run it in interactive and tty mode (-i and -t parameters) so you can log in or just keep it alive:
docker run -i -t mattguide/python-dedupe --name="anaconda"
Run with a Jupyter web front-end
The bigger bang for your buck will be to start Jupyter immediately so you can access via a web interface. This will make it easy to hit the ground running:
docker run -d --name="anaconda" --net="bridge" -p 8888:8888/tcp -v "/myvolume/":"/opt/notebooks/":rw -i -t mattguide/python-dedupe /bin/bash -c "/opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip='*' --port=8888 --no-browser --allow-root"
There are a lot of arguments there read about them at the Docker run docs, but let's break them down:
- -d - to run in detached mode
- --name= - to give a friendly name to the container for easier reference
- --net= - to tell the type of network to use, will be the same IP as source Docker install (probably your machin)
- -p - port to bind relative to container, external:internal
- -v - volume, use this to save internal container files somewhere permanantly, externally on your system (this is important, if you Terminate your Docker instance any data not in this location is destroyed!)
- -i -t - run in interactive mode (stdin) and tty so you can SSH into it with a
- /bin/bash -c - this is a command internal to the container. It tells the container to run the bash command in quotes (starts Jupyter). Read the Jupyter docs for the commands relevant there.
Additional requirement - MariaDB (MySQL) container
Are we there yet?? The python-dedupe container will give you everything you need to run dedupe. However, to run the large-ish MySQL dedupe example you will need a MySQL instance. For this, I use MariaDB since it is the open source, community-driven drop-in replacement for MySQL.
Side note: Did you know MySQL was bought by Oracle and no longer an open source project?? I did not until a year ago! MariaDB is mostly the same contributors who started the project to keep it free and open.
Run the MariaDB container
Go to the Docker Store MariaDB page for detailed information.
docker pull mariadb
Now, you will need only a few arguments to get it up and running. Similar to the dedupe container, you will want to map a volume so you can save your database if it is ever terminated:
docker run --name="mariadb-dedupe" -v /my/custom:/etc/mysql/conf.d -e MYSQL_ROOT_PASSWORD=my-secret-pw -d -i -t mariadb
You will be able to connect to the mariadb instance through the standard MySQL port
Bonus: In order to work with MySQL, you can directly SSH into the instance. Alternatively, you can install ANOTHER CONATINER! This one works natively with MySQL (it used to be PHPmyAdmin, do you remember that??):
docker pull adminer
docker run --name="adminer" adminer -p 8080:8080
Once it completely initializes, you will be able to access the MySQL admin screen from
That should give you plenty to chew on for a while. If you want to get down to it, you should check out the MySQL dedupe example. The dedupe developers, DataMade, have provided demo data and code for you to use and enjoy!
Note: I had to modify several parts of the both the init and example program code. For example, MariaDB uses strict settings, so some non-standard formats in the data will cause it to error out. I will leave that for you to investigate, but I will upload my code at some point here.
Featured image credits
Featured image is from Unsplash.com by frank mckenna