Worker

📁 See the Worker Folder →

The Worker must perform three steps:

  • Retrieving the data from the Product Hunt API, using a Python script;
  • Processing the data by converting it into SQL language, using shell commands and a Python script;
  • Storing the SQL data inside the Postgres DB.

Let’s explore each of these steps.

Creating the DB

Even before retrieving the data, it is first necessary to create the database that will store it. To do this, we create a Docker image from the basic Postgres image, in which we insert the environment variables necessary to create the DB (Database name + credentials). 📃 See Dockerfile

Retrieving the data

To request the Product Hunt API and retrieve the data we are interested in, we must make a GET Request and fill in the Client Token in the Header of the request. Moreover, the request must be performed 25 times to retrieve 500 posts since each request allows to retrieve a maximum of 20 posts (20 * 25 = 500).

A script in Python is the perfect answer to this need. 📃 See python script getTopPosts.py

In output, the data is retrieved in JSON format.

Processing the data

Now that the data has been retrieved, it must be processed so that it can be stored in a Postgres database.

The data is cleaned up using sed commands, which delete emojis, reformat fields name and add id field. These shell commands are declared inside the Gitlab CI configuration file.

In order to be migrated from JSON format to Postgres SQL format, we use the API of the famous SQLizer converter, and this via another Python script. 📃 See python script convertJsonToSql.py

Storing the data

The data is now stored in a file in Postgres SQL format, all that remains is to import it into the Postgres database.

There are two approaches:

  • The file can be imported via a psql command after the DB has started;
  • The file is imported directly into the DB when the Postgres Docker container is started.

The first approach is the simplest but the least sophisticated. It allows to import the data into any Postgres DB from the base Postgres Docker image, but if the container is destroyed and then rebuilt, the data is lost and must be reinserted.

The second approach is more complex but much more sophisticated. It requires modifying the Postgres Docker image by inserting the SQL file in the docker-entrypoint-initdb.d folder at the root of the container, and each time the container is rebuilt, the data will be inserted at startup. It is therefore necessary to edit the Dockerfile which will have to COPY the file in question. 📃 See Dockerfile

The first approach will be used during the tests in the Dev environment (Merge requests), then the second approach will be prioritized when deploying the application in the Review and Production environments.