In my previous post I wrote on how we scaled the analytics database using PostgreSQL.
One of the key requirements was the possibility to replicate and obfuscate the data in real time from our main MySQL database.
The tool which is managing this particular task is pg_ninja, now available under the terms of the Apache 2.0 license.
In our previous implementation we operated a MySQL replica for our analytics database, with some views exposing obfuscated data to the analysts.
With PostgreSQL I needed a dedicated tool for the job, as that functionality was not present.
Luckily at that time I was playing with a MySQL-to-PostgreSQL migrator tool called pg_chameleon, which was just a proof of concept.
In my spare time I converted this little experiment into a tool capable of replicating data from MySQL to PostgreSQL.
I forked pg_chameleon, as a starting point for building pg_ninja for Transferwise.
After all, pg_chameleon was not fit for production and the obfuscation requirements were out of pg_chameleon's scope.
During the development I struggled a lot to find a robust way to convert the DDL from the MySQL dialect to PostgreSQL's. I decided first to not reinvent the wheel. However, after several failures in using the python sqlparse library, I decided to write my own implementation using regular expressions, taking the occasion to learn how to use them.
Learning regex required time, and so I wrote pg_ninja's DDL support in a very tricky way. This worked quite well for first few months.
However, this method showed some serious limitations and I decided to use pg_chameleon's regular expression tokenisation, which is now very efficient.
The project status
pg_ninja is compatible with Cpython 2.7.
At moment is not present a daemonisation process but is very simple to automate the start using a cron job.
Multiple replica sources are possible with separate configuration files.
The obfuscation strategy is managed in a separate yaml file with four different strategies.
- normal: the value si converted in a sha256 hash. It's possible to specify the start and the length of a not ashed value.
- date: the date value is converted to the 1st of January preserving the year only.
- setnull: the value is set to null
- numeric: the value is set to 0
pg_ninja will be distributed soon via pypi.
The code change to make pg_ninja compatible with python 3.3+ is not complex and will happen soon in the future.
In the future I'll write the docstrings on the libraries in order to get a proper API documentation.
In my free time I'm currently exploring a complete rewrite of pg_chameleon with full daemonisation support, parallel copy when initialising and separate processes for read and replay.
When this project will be stable enough I'll build the same for pg_ninja.
Contributing to the project
PR on github are absolutely welcome, you are encouraged to fork the repository https://github.com/transferwise/pg_ninja