Intro

Within a scope of my my current contract as part of a team I have to develop and deliver a forecast sales system for an arline company.

It has many challenges, but one of them is to work with big amount of data. It is not yet a BigData, but something which is close to it.

This publications described challenges discovered so far on a database level and ways to overcome them.

Speed-up the development

During development you often have to prepare sql queries to extract necessary data from a database. For sophisticated queries it requires several run&check loops.

Such queries should be run relatively fast, but when you have a table with over a billion records which should be joined with another table with near one hundred millions records with performing extra operations such run&check loop takes more than hour.
The rest of the code which is built upon this queries might might take execution time more than a business day. Each error you have made preparing a right sql query here would delay your feature completion on hours or even days! We have to look for options to mitigate this issue.

The solutions I have been found is to prepare a smaller subset of dataset to test and tune your queries. It will give you a fast feedback and after this you could switch on your real dataset.

For my specific use case I could build a dev table by either any of this two ways:

                CREATE TABLE dev_inventory AS
                SELECT * FROM inventory2223_sab
                WHERE fltdatetime >= NOW() - INTERVAL '1 month';
            


                CREATE TABLE dev_inventory_sample AS
                SELECT * FROM inventory2223_sab
                TABLESAMPLE SYSTEM (10); -- 10% sample
            


The case shown in the last code snippet should give you a normally distributed dataset which is a some percent from your original data.

Optimize queries

When I had finished with preparing reliable dev environment and a draft of my sql query, it was a time to further optimize it.

PostgreSQL has built-in benchmark tool which helps you to figure out performance of your queries.

                EXPLAIN ANALYZE SELECT * 
                FROM inventory2223_sab 
                WHERE fltdatetime >= NOW() - INTERVAL '1 year';
            


It will show you time per stage breakdown to help you figure out bottlenecks and impact on performance by your recent tunning.

The second step was in preparing indexes on the most used columns.

Obviously you have to include indexes on columns which participate in JOIN, but adding indexes has its own drawbacks -- when it increases search by your dataset, at the same time it increases time to insert or update your data. When you have such intensive stream of data like global airline booking system adding indexes could bring more harm than good. Anyway a several the most important fields had been added to indexation and wait for time to be checked with real message queue to checkup performance.

                CREATE INDEX CONCURRENTLY idx_inventory_fltdatetime 
                ON inventory2223_sab (fltdatetime);
            


Another interesting option is to check to use a partial index based on a subset data range. Only a period of time is used to building forecast, so it might be worth to investigate such way.

                CREATE INDEX CONCURRENTLY idx_inventory_recent_fltdatetime 
                ON inventory2223_sab (fltdatetime)
                WHERE fltdatetime >= NOW() - INTERVAL '1 year';
            


The next step would be to include parallelization on database level. The keyword 'CONCURRENTLY' in creating index above should help to speed-up indexation even further.

Further step would be partitioning the database by date. Due business specific nature of sql request to a database, I expect a huge gain in performance by using this technique, but it has not applied yet.

                CREATE TABLE inventory2022 
                PARTITION OF inventory2223_sab 
                FOR VALUES FROM ('2022-01-01') TO ('2022-12-31');
                
                CREATE TABLE tariffs2022 
                PARTITION OF tariffs2223 
                FOR VALUES FROM ('2022-01-01') TO ('2022-12-31');
            


Next step would be to try materialized views or caching systems like Redis or Memcached to speed-up performance of queries. It is arguable solution, because new data comes in millions records per day which force re-indexation and makes such caching not useful, but it might be useful for some part of our selects and we have to receive some data to check does it give as performance gains or not.

                CREATE MATERIALIZED VIEW mv_inventory_tariffs AS
                SELECT i.*, t.*
                FROM inventory2223_sab i
                JOIN tariffs2223 t ON i.sfltno = t.fltnum AND i.fltdatetime = t.fltdat
                WHERE i.fltdatetime >= NOW() - INTERVAL '1 year';

                REFRESH MATERIALIZED VIEW mv_inventory_tariffs;
            

Robust migration of data on remote server

Finally I had to migrate the dataset from a one database to another. The task takes ~30 hours and we had to make sure it would not go down.

When you disconnect from your SSH server, all processes associated with your ssh session are terminated as well. To make sure long running process continues its execution in the background, you have to use several extra steps.

Instruments like 'nohup', 'tmux' and 'systemd' services can help there. 'systemd' was the best fit for this task.

You have to create a system unit file under the path /etc/systemd/system which would look like this:

                [Unit]
                Description=My Long Running Python Script
                After=network.target

                [Service]
                Type=simple
                User=your-username
                WorkingDirectory=/home/your-username/path-to-your-script
                ExecStart=/usr/bin/python3 /home/your-username/path-to-your-script/your_script.py
                Restart=on-failure
                StandardOutput=append:/home/your-username/path-to-your-script/logs/output.log
                StandardError=append:/home/your-username/path-to-your-script/logs/error.log

                [Install]
                WantedBy=multi-user.target
            


The file is self explanatory, but I want you to draw your attention to several things.

The first one is username you want this service to be run on. Without it the service will be run under the root which might be not what you are looking for.
The second is it has a property for self-restart in case of error which might convenient in some scenarios.
The third one is it has dependency on some set of system events which would help you to schedule your work in pipeline.

To enable it you have to execute this commands:

                $ sudo systemctl daemon-reload
                $ sudo systemctl enable my_python_script.service (optional: add service to start on OS launch)
                $ sudo systemctl start my_python_script.service
            


To check the system status you could run this command:

                $ sudo systemctl status my_python_script.service
            

A few related publications:
Coderbunker white paper: mobile app in unstable network connection environment
About principles behind my work
Broad vs Narrow specialisation