PipeRider
Search…
⌃K

AWS S3 + GitHub CI

This is a HOW-TO of generating PipeRider reports, save it in AWS S3 and comparing the latest report with previously saved report in S3 by GitHub CI.
The PipeRider profiling report could give you an overview of your data from time to time. Integrating PipeRider into your CI workflow could benefit you with the auto-generated data profiling report, furthermore, you could save the report of each run in AWS S3 and have a comparison report of latest two run. Moreover, adding the Slack incoming webhook in the workflow to receive these reports from Slack in real time.
In this HowTo, we will show the scenario below
  1. 1.
    Generate a PipeRider latest run report and upload to S3
  2. 2.
    Download a previously saved report from S3
  3. 3.
    Compare these two report and upload the comparison report to S3
  4. 4.
    Push a notification to Slack with the links of latest run report and the comparison report

Prerequisites

S3 Bucket

Prepare a S3 bucket. Enable ACL under Permissions tab and Static website hosting under Properties tab.
Enable ACL
Enable Static website hosting

Access key ID and secret access key

Prepare a user account for the aws-cli and save the generated key pair.
Create a key pair (Access key ID & Secret access key) for aws-cli and save them for the later use.

IAM Policy

Prepare a IAM Policy dedicated to the S3 bucket and grant an account used by aws-cli the permission by assign the policy.
You can create your own or just replace DOC-EXAMPLE-BUCKET1 with your bucket in the following context and import it.
IAM Policy

GitHub Repository

Prepare your repository with following configurations and files.

Environment secrets

Create the following environment secrets in your repository:
  • AWS_ACCESS_KEY_ID The Access key ID for the aws-cli
  • AWS_SECRET_ACCESS_KEY The Secret access key for the aws-cli
  • AWS_DEFAULT_REGION The default AWS region
  • PIPERIDER_BUCKET_NAME The S3 bucket name
  • SLACK_INCOMING_WEBHOOK The WebHook URL. You will need to install the Incoming WebHooks integration into Slack and create a configuration that specifies a channel where notifications go to. Then you will have the url.

Workflow YAML

A workflow yaml is required to GitHub CI. In your repo, create the path, necessary directories and the file of .github/workflows/piperider.yml. In this file, we define a event of pushing to main branch will trigger the workflow.
piperider.yml
There are two major steps in the workflow:
Step: Install PipeRider and check tools
We need PipeRider, aws-cli, and curl tools. The Ubuntu provided by GitHub has installed aws-cli and curl tools. So in this step, PipeRider installation is the only required
Step: get-started project
We adopt the another repo, PipeRider Getting Started, as our data project. daily.sh is a script to accomplish the whole of scenario.

daily.sh

Create the file with the following script and put it at the root of the repo. It will run through the scenario.
daily.sh
Senario
  • preparation
    • PipeRider project
    • decide the output path and create the necessary directory/sub-directory
      • decide the name of the output directory by the current datetime $(date +"%Y%m%d%H%M")
      • create the directory/sub-directory
      • fetch the name of the previous report from S3
  • execution
    • generate a single latest report by piperider run with -o to specify where to save the copy of generated report
    • upload the latest report to S3
    • download the previous report from S3 to the default .piperider/outputs path
    • make a comparison by piperider compare-reports with -o and --last
      • —last will compare the latest two reports. One we generated, the other is one we downloaded from S3.
      • -o will save a copy of the comparison report where you specify
    • upload the comparison report to S3
  • Publish and notify
    • form two URLs for the latest run report and the comparison report
    • upload reports to S3 by aws s3 sync with --acl public-read for public accessibility
    • push a notification containing links of two reports hosted in S3 to Slack
Try to push a commit to the main branch of your repository to trigger the workflow, then check the S3 bucket.
S3 bucket