Profiling
Profile statistics
PipeRider helps you understand of your data by providing profile statistics and data distribution information about the table and columns in your data source.
Enable Profiling
Profiling is supported for dbt models, seeds, and sources. To enable the profiling, please add piperider
tag on the corresponding resources.
and run the command to check if it is configured correectly.
Table statistics
Profile Field | Description |
---|---|
| The number of rows in the table |
| The number of columns in the table |
| The number of rows profiled |
| The percentage of rows profiled |
| The volume size of the table in bytes |
| The time that the table was created, including time zone, in ISO 8601 format |
| The last time the table was modified, including time zone, in ISO 8601 format |
| The time differentiation between the current time and table's last altered time |
| The number of duplicate rows in the table |
| The percentage of duplicate rows in the table |
* These statistics are only available for certain data sources. Please refer to the platform dependent statistics table below for availability information. ** Table-level duplicate row are not enabled by default. To enable this settings please refer to the profiler settings.
Profile Field | Snowflake | BigQuery | Redshift |
---|---|---|---|
| ✔ | ✔ | ✔ |
| ✔ | ✔ | |
| ✔ | ✔ | |
| ✔ | ✔ |
Column statistics
Column statistics are profiling statistics of a column. Some statistics are only avaialble on certain generic type. There are six generic types
string
integer
numeric
datetime
boolean
other
Schema
In addition to logging the schema type of a column as defined in the data source, PipeRider will also apply a generic type to a column that will determine how this column is treated by the PipeRider profiler.
Profile Field | Description | Column Type | PipeRider Version |
---|---|---|---|
| The column type defined in the data source | All | All |
| A generic schema type of | All | All |
The following statistics are produced based on the generic type that has been applied to the column.
Data composition
The composition of the data contained within a column.
Profile Field | Description | Column Type | Assertion Available |
---|---|---|---|
| The number of rows in the table | All | ✔ |
| The number of rows profiled | All | ✔ |
| The percentage of rows profiled | All | ✔ |
| The number of null values | All | ✔ |
| The percentage of null values | All | ✔ |
| The number of non-null values | All | ✔ |
| The percentage of non-null values | All | ✔ |
| The number of values that do not match the column's schema type. E.g. A string in a numeric column (SQLite only) | All | ✔ |
| The percentage of invalid values (SQLite only) | All | ✔ |
| The count of non-null values minus invalid values | All | ✔ |
| The percentage of non-null values, minus invalid values | All | ✔ |
| The number of zeros | integer, numeric | ✔ |
| The percentage of zeros | integer, numeric | ✔ |
| The number of negative values | integer, numeric | ✔ |
| The percentage of negative values | integer, numeric | ✔ |
| The number of positive values | integer, numeric | ✔ |
| The percentage of positive values | integer, numeric | ✔ |
| The number of empty strings | string | ✔ |
| The percentage of empty strings | string | ✔ |
| The number of non-empty strings | string | ✔ |
| The percentage of non-empty strings | string | ✔ |
| The number of true values | boolean | ✔ |
| The percentage of true values | boolean | ✔ |
| The number of false values | boolean | ✔ |
| The percentage of false values | boolean | ✔ |
General statistics
The general statistical information of a column.
Profile Field | Description | Column Type | Assertion Available | PipeRider Version |
---|---|---|---|---|
| The minimum value | integer, numeric, datetime | ✔ | All |
| The maximum value | integer, numeric, datetime | ✔ | All |
| The column average | integer, numeric | ✔ | All |
| The column sum | integer, numeric | ✔ | All |
| The standard deviation of values | integer, numeric, | ✔ | 0.4.0 |
Text length statistics
The text length statistics of a column.
Profile Field | Description | Column Type | Assertion Available |
---|---|---|---|
| The minimum string length | string | ✔ |
| The maximum string length | string | ✔ |
| The average string length | string | ✔ |
| The standard deviation of string length | string | ✔ |
Uniqueness
The uniqueness of a column.
Profile Field | Description | Column Type | Assertion Available |
---|---|---|---|
| The number of distinct items | integer, string, datetime | ✔ |
| The percentage of distinct items | integer, string, datetime | ✔ |
| The number of recurring items | integer, numeric, string, datetime | ✔ |
| The percentage of duplicate items | integer, numeric, string, datetime | ✔ |
| The number of non-recurring items | integer, numeric, string, datetime | ✔ |
| The percentage of non-duplicate items | integer, numeric, string, datetime | ✔ |
For example, the following dataset (NULL, a, a, b, b, c, d, e)
would be categorized as so:
Distinct count = 5,
(a, b, c, d, e)
Duplicate count = 4,
(a, a, b, b)
Non-duplicate count = 3,
(c, d, e)
Missing values (nulls) = 1
Therefore, the total number of rows for a table = missing (nulls) + duplicates + non-duplicates.
Quantiles
The calculated quantiles of a numeric or integer column.
Profile Field | Description | Column Type | Assertion Available | PipeRider Version |
---|---|---|---|---|
| 0th percentile | integer, numeric | ✔ | All |
| 5th percentile | integer, numeric | ✔ | 0.4.0 |
| 25th percentile | integer, numeric | ✔ | 0.4.0 |
| 50th percentile | integer, numeric | ✔ | 0.4.0 |
| 75th percentile | integer, numeric | ✔ | 0.4.0 |
| 95th percentile | integer, numeric | ✔ | 0.4.0 |
| 100th percentile | integer, numeric | ✔ | All |
Distribution
Profile Field | Description | Column Type |
---|---|---|
| The most frequently occurring n items and and counts | integer, string |
| Evenly-split bins for numerical columns and counts for each bin | integer, numeric |
| Evenly-split bins for text length and counts for each bin | string |
| Histogram of date, month, or year. Bin split depends on the min/max range | datetime |
Last updated