Scaling our analytics database

Business intelligence is at the core of any great company, and Transferwise is no exception.
When I started my job as a data engineer in July 2016 my initial task was to solve a long running issue with the database used for the analytic queries.

The gordian knot of the analytics database

The original configuration was a MySQL community edition, version 5.6, with an Innodb buffer of 40 GB. The virtual machine’s memory was 70 GB with 18 CPU assigned. The total database size was about 600 GB.

The analysts ran their queries using SQL, Looker and Tableau. In order to get data in almost real time our live database was replicated into a dedicated schema. In order to protect our customer’s personal data a dedicated schema with a set of views was used to obfuscate the personal information. The same schema was used for pre-aggregating some heavy queries. Other schemas were copied from the microservice database on a regular basis.

The frog effect

If you drop a frog in a pot of boiling water, it will of course frantically try to clamber out. But if you place it gently in a pot of tepid water and turn up the heat it will be slowly boiled to death.

The performance issues worsened slowly over time. One of the reasons was the size of the database constantly increasing, combined with the personal data obfuscation.
When selecting from a view, if the dataset returned is large enough, the MySQL optimiser materialises the view on disk and executes the query. The temporary files are removed when the query ends.

As a result, the analytics tools were slow under normal load. In busy periods the database became almost unusable. The analysts had to spend a lot of time tuning the existing queries rather than write new ones.

The general thinking was that MySQL was no longer a good fit. However the new solution had to satisfy requirements that were quite difficult to achieve with a single product change.

  • The data for analytics should be almost real time with the live database
  • The PII(personally identifiable information) should be obfuscated for general access
  • The PII should be available in clear for restricted users
  • The system should be able to scale for several years
  • The systems should offer modern SQL for better analytics queries

The eye of the storm

The analyst team shortlisted a few solutions covering the requirements. These were:

Google BigQuery did not have the flexibility required for the new analytics DB. Redshift had more capability but was years behind snowflake and pure PostgreSQL in terms of modern SQL. So both were removed from the list.

Both PostgreSQL and Snowflake offered very good performance and modern SQL.
But neither of them was able to replicate data from a MySQL database.

Snowflake

Snowflake is a cloud based data warehouse service. It’s based on Amazon S3 and comes with different sizing. Their pricing system is very appealing and the preliminary tests showed Snowflake outperforming PostgreSQL.

The replica between our systems and Snowflake would happen using FiveTran, an impressive multi-technology data pipeline. Unfortunately there was just one little catch.
Fivetran doesn’t have native support for obfuscation.

Customer data security is of the highest priority at TransferWise - If for any reason customer data needs to move outside our perimeter it must always be obfuscated.

PostgreSQL

Foreseeing this issue, I decided to spend time building a proof of concept based on the replica tool pg chameleon. The tool is written in python and uses the python-mysql-replication library to read the MySQL replica protocol and replay the changes into a PostgreSQL database.

The initial tests on a reduced dataset were successful and adding support for the obfuscation in real time required minimal changes.

The initial idea was to use PostgreSQL to obfuscate the data before feeding it into FiveTran.

However, because PostgreSQL’s performance was good with margins for scaling as our data grows, we decided to use just PostgreSQL for our data analytics and keep our customer’s data behind our perimeter.

A ninja elephant

PostgreSQL offers better performance, and a stronger security model with improved resource optimisation.

The issues with the views validity and speed are now just a bad memory.

Analysts can now use the complex analytics functions offered by version PostgreSQL 9.5.
Large tables, previously unusable because of their size, are now partitioned with pg pathman and their data is usable again.

Some code was optimised inside, but actually very little - maybe 10-20% was improved. We’ll do more of that in the future, but not yet. The good thing is that the performance gains we have can mostly be attributed just to PG vs MySQL. So there’s a lot of scope to improve further.
Jeff McClelland - Growth Analyst, data guru

Timing

Procedure MySQL PgSQL PgSQL cached
Daily ETL script 20 hours 4 hours N/A
Select from small table
with complex aggregations
Killed after 20 minutes 3 minutes 1 minute
Large table scan with simple filters 6 minutes 2 minutes 6 seconds

Resources

Resource MySQL PostgreSQL
Storage 940 GB 670 GB
CPU 18 8
RAM 68 GB 48 GB
Shared Memory 40 GB 5 GB

Lessons learned

Never underestimate the resource consumption

During the development of the replica tool the initialisation process required several improvements.

The resources are always finite and the out of memory killer is always happy to remind us this simple, but hard to understand concept. Some tables required a custom slice size because the size of row length triggered the OOM killer when pulling out the data.

However, even after fixing the memory issues the initial copy took 6 days.

Tuning the copy speed with the unbuffered cursors and the row number estimates improved the initial copy speed which now completes in 30 hours, including the time required for the index build.

Strictness is an illusion. MySQL doubly so

MySQL's lack of strictness is not a mystery.

The replica stopped because of the funny way the NOT NULL is managed by MySQL.

To prevent any further replica breakdown the fields with NOT NULL added with ALTER TABLE after the initialisation are created in PostgreSQL as NULLable fields.

MySQL truncates the strings of characters at the varchar size automatically. This is a problem if the field is obfuscated on PostgreSQL because the hashed string could not fit into the corresponding varchar field. Therefore all the character varying on the obfuscated schema are always text.

Idle in transaction can kill your database

Overtime I saw the PostgreSQL tables used for storing the MySQL's row images growing to unacceptable size (10th of GB). This was caused by misbehaving sessions left idle in transaction.

An idle in transaction session holds a database snapshot until it is committed or rolled back. This is bad because the normal vacuuming doesn't reclaim the dead rows which could be seen by the snapshot.

The quick fix was a cron job which removes those sessions. The long term fix was to address why those sessions appeared and fix the code causing the issue.