PandasSchema¶
Introduction¶
PandasSchema is a module for validating tabulated data, such as CSVs (Comma Separated Value files), and TSVs (Tab Separated Value files). It uses the incredibly powerful data analysis tool Pandas to do so quickly and efficiently.
For example, say your code expects a CSV that looks a bit like this:
Given Name,Family Name,Age,Sex,Customer ID
Gerald,Hampton,82,Male,2582GABK
Yuuwa,Miyake,27,Male,7951WVLW
Edyta,Majewska,50,Female,7758NSID
Now you want to be able to ensure that the data in your CSV is in the correct format:
import pandas as pd
from io import StringIO
from pandas_schema import Column, Schema
from pandas_schema.validation import LeadingWhitespaceValidation, TrailingWhitespaceValidation, CanConvertValidation, MatchesPatternValidation, InRangeValidation, InListValidation
schema = Schema([
Column('Given Name', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
Column('Family Name', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
Column('Age', [InRangeValidation(0, 120)]),
Column('Sex', [InListValidation(['Male', 'Female', 'Other'])]),
Column('Customer ID', [MatchesPatternValidation(r'\d{4}[A-Z]{4}')])
])
test_data = pd.read_csv(StringIO('''Given Name,Family Name,Age,Sex,Customer ID
Gerald ,Hampton,82,Male,2582GABK
Yuuwa,Miyake,270,male,7951WVLW
Edyta,Majewska ,50,Female,775ANSID
'''))
errors = schema.validate(test_data)
for error in errors:
print(error)
PandasSchema would then output
{row: 0, column: "Given Name"}: "Gerald " contains trailing whitespace
{row: 1, column: "Age"}: "270" was not in the range [0, 120)
{row: 1, column: "Sex"}: "male" is not in the list of legal options (Male, Female, Other)
{row: 2, column: "Family Name"}: "Majewska " contains trailing whitespace
{row: 2, column: "Customer ID"}: "775ANSID" does not match the pattern "\d{4}[A-Z]{4}"
Installation¶
Install PandasSchema using pip:
pip install pandas_schema
Module Summary¶
As you can probably see from the example above, the main classes you need to interact with to perform a validation are
Schema, Column, the Validation classes, and ValidationWarning. A Schema contains many Columns, and a Column contains many
Validations. Then to run a validation, you simply call schema.validate()
on a DataFrame, which will produce a list of
ValidationWarnings. The public interface of these classes is documented here. Validations are covered in the next section.
Schema¶
- class pandas_schema.schema.Schema(columns: Iterable[pandas_schema.column.Column], ordered: bool = False)[source]¶
A schema that defines the columns required in the target DataFrame
- Parameters
columns – A list of column objects
ordered – True if the Schema should associate its Columns with DataFrame columns by position only, ignoring the header names. False if the columns should be associated by column header names only. Defaults to False
- validate(df: pandas.core.frame.DataFrame, columns: Optional[List[str]] = None) List[pandas_schema.validation_warning.ValidationWarning] [source]¶
Runs a full validation of the target DataFrame using the internal columns list
- Parameters
df – A pandas DataFrame to validate
columns – A list of columns indicating a subset of the schema that we want to validate
- Returns
A list of ValidationWarning objects that list the ways in which the DataFrame was invalid
Column¶
- class pandas_schema.column.Column(name: str, validations: Iterable[pandas_schema.validation._BaseValidation] = [], allow_empty=False)[source]¶
Creates a new Column object
- Parameters
name – The column header that defines this column. This must be identical to the header used in the CSV/Data Frame you are validating.
validations – An iterable of objects implementing _BaseValidation that will generate ValidationErrors
allow_empty – True if an empty column is considered valid. False if we leave that logic up to the Validation
ValidationWarning¶
- class pandas_schema.validation_warning.ValidationWarning(message: str, value: Optional[str] = None, row: int = - 1, column: Optional[str] = None)[source]¶
Represents a difference between the schema and data frame, found during the validation of the data frame
- column¶
The column name of the cell that failed the validation
- row¶
The row index (usually an integer starting from 0) of the cell that failed the validation
- value¶
The value of the failing cell in the DataFrame
Validators¶
Built-in Validators¶
- class pandas_schema.validation.CanCallValidation(func: Callable, **kwargs)[source]¶
Validates if a given function can be called on each element in a column without raising an exception
- Parameters
func – A python function that will be called with the value of each cell in the DataFrame. If this function throws an error, this cell is considered to have failed the validation. Otherwise it has passed.
- property default_message¶
Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg
- class pandas_schema.validation.CanConvertValidation(_type: type, **kwargs)[source]¶
Checks if each element in a column can be converted to a Python object type
- Parameters
_type – Any python type. Its constructor will be called with the value of the individual cell as its only argument. If it throws an exception, the value is considered to fail the validation, otherwise it has passed
- property default_message¶
Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg
- class pandas_schema.validation.DateFormatValidation(date_format: str, **kwargs)[source]¶
Checks that each element in this column is a valid date according to a provided format string
- Parameters
date_format – The date format string to validate the column against. Refer to the date format code documentation at https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior for a full list of format codes
- property default_message¶
Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg
- class pandas_schema.validation.InListValidation(options: Iterable, case_sensitive: bool = True, **kwargs)[source]¶
Checks that each element in this column is contained within a list of possibilities
- Parameters
options – A list of values to check. If the value of a cell is in this list, it is considered to pass the validation
- property default_message¶
Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg
- class pandas_schema.validation.InRangeValidation(min: float = - inf, max: float = inf, **kwargs)[source]¶
Checks that each element in the series is within a given numerical range
- Parameters
min – The minimum (inclusive) value to accept
max – The maximum (exclusive) value to accept
- property default_message¶
Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg
- class pandas_schema.validation.IsDistinctValidation(**kwargs)[source]¶
Checks that every element of this column is different from each other element
- property default_message¶
Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg
- class pandas_schema.validation.IsDtypeValidation(dtype: numpy.dtype, **kwargs)[source]¶
Checks that a series has a certain numpy dtype
- Parameters
dtype – The numpy dtype to check the column against
- get_errors(series: pandas.core.series.Series, column: Optional[pandas_schema.column.Column] = None)[source]¶
Return a list of errors in the given series :param series: :param column: :return:
- class pandas_schema.validation.LeadingWhitespaceValidation(**kwargs)[source]¶
Checks that there is no leading whitespace in this column
- property default_message¶
Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg
- class pandas_schema.validation.MatchesPatternValidation(pattern, options={}, **kwargs)[source]¶
Validates that a string or regular expression can match somewhere in each element in this column
- Parameters
kwargs – Arguments to pass to Series.str.contains (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html) pat is the only required argument
- property default_message¶
Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg
- class pandas_schema.validation.TrailingWhitespaceValidation(**kwargs)[source]¶
Checks that there is no trailing whitespace in this column
- property default_message¶
Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg
Custom Validators¶
Simple Validators¶
The easiest way to add your own Validator is to use the CustomSeriesValidation or CustomElementValidation class.
For example if you wanted a validation that checked if each cell in a column contained the word ‘fail’, and failed if it did, you’d do one of the following:
CustomSeriesValidation(lambda s: ~s.str.contains('fail'), 'contained the word fail')
CustomElementValidation(lambda s: ~s.contains('fail'), 'contained the word fail')
The difference between these two classes is that CustomSeriesValidation uses Pandas Series methods to operate on the entire series using fast, natively implemented functions, while CustomElementValidation operates on each element using ordinary Python code.
Consequently, if the validation you want to create is easy to express using Pandas Series methods (http://pandas.pydata.org/pandas-docs/stable/api.html#series), we recommend you use a CustomSeriesValidation since it will likely perform better. Otherwise, feel free to use a CustomElementValidation. Of course, if there is a built-in Validation class that fits your use-case, like MatchesPattern, it will be implemented as fast as possible, so then this is the recommended method to implement the validation
The arguments to these classes constructors are listed here:
- class pandas_schema.validation.CustomElementValidation(validation: Callable[[Any], Any], message: str)[source]¶
Validates using a user-provided function that operates on each element
- Parameters
message –
The error message to provide to the user if this validation fails. The row and column and failing value will automatically be prepended to this message, so you only have to provide a message that describes what went wrong, for example ‘failed my validation’ will become
{row: 1, column: “Column Name”}: “Value” failed my validation
validation – A function that takes the value of a data frame cell and returns True if it passes the the validation, and false if it doesn’t
- class pandas_schema.validation.CustomSeriesValidation(validation: Callable[[pandas.core.series.Series], pandas.core.series.Series], message: str)[source]¶
Validates using a user-provided function that operates on an entire series (for example by using one of the pandas Series methods: http://pandas.pydata.org/pandas-docs/stable/api.html#series)
- Parameters
message –
The error message to provide to the user if this validation fails. The row and column and failing value will automatically be prepended to this message, so you only have to provide a message that describes what went wrong, for example ‘failed my validation’ will become
{row: 1, column: “Column Name”}: “Value” failed my validation
validation – A function that takes a pandas Series and returns a boolean Series, where each cell is equal to True if the object passed validation, and False if it failed
Inheriting From _SeriesValidation¶
If you want to implement more complicated logic that doesn’t fit in a lambda, or you want to parameterize your Validator
and re-use it in different parts of your application, you can instead make a class that inherits from
_SeriesValidation
.
All this class needs is:
An
__init__
constructor that callssuper().__init__(**kwargs)
A
default_message
propertyA
validate
method
For reference on how these fields should look, have a look at the source code for the Built-in Validators (click the
[source]
button next to any of them)
Boolean Logic on Validators¶
You can also combine validators with the Boolean operators and
, or
and not
. These are implemented using
the following python operators:
Boolean Operation |
Operator |
---|---|
|
|
|
|
|
|
For example, if we wanted a validation that checks if the cell either contains a number, or is a word with more than 1 character that also contains an ‘a’, we could do the following:
from pandas_schema import Column, Schema
from pandas_schema.validation import MatchesPatternValidation, CanConvertValidation, CustomSeriesValidation
import pandas as pd
schema = Schema([
Column('col1', [
CanConvertValidation(int) |
(
CustomSeriesValidation(lambda x: x.str.len() > 1, 'Doesn\'t have more than 1 character') &
MatchesPatternValidation('a')
)
])
])
test_data = pd.DataFrame({
'col1': [
'an',
'13',
'a',
'8',
'the'
]
})
errors = schema.validate(test_data)
for error in errors:
print('"{}" failed!'.format(error.value))
This would produce the following result, because ‘a’ is a word, but isn’t more than one character, and because ‘the’ is a word, but it doesn’t contain the letter ‘a’:
"a" failed!
"the" failed!
Note that these operators do not short-circuit, so all validations will be applied to all rows, regardless of if that row has already failed a validation.
Changelog¶
0.3.6¶
Include the column name in the
ValidationWarning
when a column listed in the schema is not present in the data frame (#65)schema.validate()
now no longer immediately returns when a column is missing. Instead it adds aValidationWarning
and continues validation
0.3.5¶
Development¶
To install PandasSchema’s development requirements, run
pip install -r requirements.txt
The setup.py can be run as an executable, and it provides the following extra commands:
./setup.py test
: runs the tests./setup.py build_readme
: rebuilds theREADME.rst
fromdoc/readme/README.rst
./setup.py build_site --dir=<dir>
: builds the documentation website fromdoc/site/index.rst
into<dir>