Postgres to Elasticsearch/OpenSearch sync
MIT License
PGSync is a middleware for syncing data from Postgres to Elasticsearch/OpenSearch effortlessly. It allows you to keep Postgres as your source of truth and expose structured denormalized documents in Elasticsearch/OpenSearch.
Changes to nested entities are propagated to Elasticsearch/OpenSearch. PGSync's advanced query builder then generates optimized SQL queries on the fly based on your schema. PGSync's advisory model allows you to quickly move and transform large volumes of data quickly whilst maintaining relational integrity.
Simply describe your document structure or schema in JSON and PGSync will continuously capture changes in your data and load it into Elasticsearch/OpenSearch without writing any code. PGSync transforms your relational data into a structured document format.
It allows you to take advantage of the expressive power and scalability of Elasticsearch/OpenSearch directly from Postgres. You don't have to write complex queries and transformation pipelines. PGSync is lightweight, flexible and fast.
Elasticsearch/OpenSearch is more suited as as secondary denormalised search engine to accompany a more traditional normalized datastore. Moreover, you shouldn't store your primary data in Elasticsearch/OpenSearch.
So how do you then get your data into Elasticsearch/OpenSearch in the first place? Tools like Logstash and Kafka can aid this task but they still require a bit of engineering and development.
Extract Transform Load and Change data capture tools can be complex and require expensive engineering effort.
Other benefits of PGSync include:
At a high level, you have data in a Postgres database and you want to mirror it in Elasticsearch/OpenSearch. This means every change to your data (Insert, Update, Delete and Truncate statements) needs to be replicated to Elasticsearch/OpenSearch. At first, this seems easy and then it's not. Simply add some code to copy the data to Elasticsearch/OpenSearch after updating the database (or so called dual writes). Writing SQL queries spanning multiple tables and involving multiple relationships are hard to write. Detecting changes within a nested document can also be quite hard. Of course, if your data never changed, then you could just take a snapshot in time and load it into Elasticsearch/OpenSearch as a one-off operation.
PGSync is appropriate for you if:
PGSync is written in Python (supporting version 3.8 onwards) and the stack is composed of: Redis, Elasticsearch/OpenSearch, Postgres, and SQlAlchemy.
PGSync leverages the logical decoding feature of Postgres (introduced in PostgreSQL 9.4) to capture a continuous stream of change events. This feature needs to be enabled in your Postgres configuration file by setting in the postgresql.conf file:
wal_level = logical
You can select any pivot table to be the root of your document.
PGSync's query builder builds advanced queries dynamically against your schema.
PGSync operates in an event-driven model by creating triggers for tables in your database to handle notification events.
This is the only time PGSync will ever make any changes to your database.
NOTE: If you change the structure of your PGSync's schema config, you would need to rebuild your Elasticsearch/OpenSearch indices. There are plans to support zero-downtime migrations to streamline this process.
There are several ways of installing and trying PGSync
To startup all services with docker. Run:
$ docker-compose up
Show the content in Elasticsearch/OpenSearch
$ curl -X GET http://[Elasticsearch/OpenSearch host]:9201/reservations/_search?pretty=true
Setup
Ensure the database user is a superuser
Enable logical decoding. You would also need to set up at least two parameters at postgresql.conf
wal_level = logical
max_replication_slots = 1
To prevent your server logs from growing too large e.g when running on cloud infrastructure where there is a cost implication. You can optionally impose a ceiling on the replication slot size using max_slot_wal_keep_size
max_slot_wal_keep_size = 100GB
Installation
$ pip install pgsync
bootstrap --config schema.json
pgsync --config schema.json
pgsync --config schema.json -d
Key features of PGSync are:
Consider this example of a Book library database.
Book
isbn (PK) | title | description |
---|---|---|
9785811243570 | Charlie and the chocolate factory | Willy Wonkas famous chocolate factory is opening at last! |
9788374950978 | Kafka on the Shore | Kafka on the Shore is a 2002 novel by Japanese author Haruki Murakami. |
9781471331435 | 1984 | 1984 was George Orwells chilling prophecy about the dystopian future. |
Author
id (PK) | name |
---|---|
1 | Roald Dahl |
2 | Haruki Murakami |
3 | Philip Gabriel |
4 | George Orwell |
BookAuthor
id (PK) | book_isbn | author_id |
---|---|---|
1 | 9785811243570 | 1 |
2 | 9788374950978 | 2 |
3 | 9788374950978 | 3 |
4 | 9781471331435 | 4 |
With PGSync, we can simply define this JSON schema where the book table is the pivot. A pivot table indicates the root of your document.
{
"table": "book",
"columns": [
"isbn",
"title",
"description"
],
"children": [
{
"table": "author",
"columns": [
"name"
]
}
]
}
To get this document structure in Elasticsearch/OpenSearch
[
{
"isbn": "9785811243570",
"title": "Charlie and the chocolate factory",
"description": "Willy Wonkas famous chocolate factory is opening at last!",
"authors": ["Roald Dahl"]
},
{
"isbn": "9788374950978",
"title": "Kafka on the Shore",
"description": "Kafka on the Shore is a 2002 novel by Japanese author Haruki Murakami",
"authors": ["Haruki Murakami", "Philip Gabriel"]
},
{
"isbn": "9781471331435",
"title": "1984",
"description": "1984 was George Orwells chilling prophecy about the dystopian future",
"authors": ["George Orwell"]
}
]
Behind the scenes, PGSync is generating advanced queries for you such as.
SELECT
JSON_BUILD_OBJECT(
'isbn', book_1.isbn,
'title', book_1.title,
'description', book_1.description,
'authors', anon_1.authors
) AS "JSON_BUILD_OBJECT_1",
book_1.id
FROM book AS book_1
LEFT OUTER JOIN
(SELECT
JSON_AGG(anon_2.anon) AS authors,
book_author_1.book_isbn AS book_isbn
FROM book_author AS book_author_1
LEFT OUTER JOIN
(SELECT
author_1.name AS anon,
author_1.id AS id
FROM author AS author_1) AS anon_2 ON anon_2.id = book_author_1.author_id
GROUP BY book_author_1.book_isbn) AS anon_1 ON anon_1.book_isbn = book_1.isbn
You can also configure PGSync to rename attributes via the schema config e.g
{
"isbn": "9781471331435",
"this_is_a_custom_title": "1984",
"desc": "1984 was George Orwells chilling prophecy about the dystopian future",
"contributors": ["George Orwell"]
}
PGSync addresses the following challenges:
Contributions are very welcome! Check out the Contribution Guidelines for instructions.
This project is licensed under the terms of the MIT license. Please see LICENSE for more details.
You should have received a copy of the MIT License along with PGSync. If not, see https://opensource.org/license/mit/.