Continuous integration (CI) is the practice of automating routine tasks on a software project. For instance, after pushing code changes to your project repository, CI can trigger building, testing, and linting functions to ensure the quality of your commit. This eases the process of contributing to a project by removing the need to run these stages manually.
CI for a data project
Similar to a software project, a CI process is also useful for data projects, especially now that data projects contain code responsible for building the data pipeline and model transformations.
When used in a data project, a CI process can perform the model transformations, make modifications to databases after merging a pull request, and deploy the pipeline, among other actions.
The benefit of using PipeRider with a CI process
When integrated with a CI process, PipeRider can add a data profiling step after new models are transformed. This data profile can then be compared with the data profile of the production environment, conveniently showing how the data has changed between environments.
Provide full data profiling reports in the form of artifacts
Generate a data profile comparison summary that compares the production and modified environments
The comparison summary is generated in Markdown and is designed to be added to the pull request comment for easy review. The summary contains useful metrics about the table schema and data, such as:
New or deleted columns
Percentage of valid rows
percentage of distinct rows
The following image show an example of a PipeRider comparison summary taken from our article on detecting schema change:
Example of PipeRider data profile comparison added to pull request comment
In the following section we demonstrate some suggested guidelines for integrating PipeRider into your data project's CI process.