PandasSchema

https://travis-ci.org/TMiguelT/PandasSchema.svg?branch=master

Introduction

PandasSchema is a module for validating tabulated data, such as CSVs (Comma Separated Value files), and TSVs (Tab Separated Value files). It uses the incredibly powerful data analysis tool Pandas to do so quickly and efficiently.

For example, say your code expects a CSV that looks a bit like this:

Given Name,Family Name,Age,Sex,Customer ID
Gerald,Hampton,82,Male,2582GABK
Yuuwa,Miyake,27,Male,7951WVLW
Edyta,Majewska,50,Female,7758NSID

Now you want to be able to ensure that the data in your CSV is in the correct format:

import pandas as pd
from io import StringIO
from pandas_schema import Column, Schema
from pandas_schema.validation import LeadingWhitespaceValidation, TrailingWhitespaceValidation, CanConvertValidation, MatchesPatternValidation, InRangeValidation, InListValidation

schema = Schema([
    Column('Given Name', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
    Column('Family Name', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
    Column('Age', [InRangeValidation(0, 120)]),
    Column('Sex', [InListValidation(['Male', 'Female', 'Other'])]),
    Column('Customer ID', [MatchesPatternValidation(r'\d{4}[A-Z]{4}')])
])

test_data = pd.read_csv(StringIO('''Given Name,Family Name,Age,Sex,Customer ID
Gerald ,Hampton,82,Male,2582GABK
Yuuwa,Miyake,270,male,7951WVLW
Edyta,Majewska ,50,Female,775ANSID
'''))

errors = schema.validate(test_data)

for error in errors:
    print(error)

PandasSchema would then output

{row: 0, column: "Given Name"}: "Gerald " contains trailing whitespace
{row: 1, column: "Age"}: "270" was not in the range [0, 120)
{row: 1, column: "Sex"}: "male" is not in the list of legal options (Male, Female, Other)
{row: 2, column: "Family Name"}: "Majewska " contains trailing whitespace
{row: 2, column: "Customer ID"}: "775ANSID" does not match the pattern "\d{4}[A-Z]{4}"

Installation

Install PandasSchema using pip:

pip install pandas_schema

Module Summary

As you can probably see from the example above, the main classes you need to interact with to perform a validation are Schema, Column, the Validation classes, and ValidationWarning. A Schema contains many Columns, and a Column contains many Validations. Then to run a validation, you simply call schema.validate() on a DataFrame, which will produce a list of ValidationWarnings. The public interface of these classes is documented here. Validations are covered in the next section.

Schema

class pandas_schema.schema.Schema(columns: Iterable[pandas_schema.column.Column], ordered: bool = False)[source]

A schema that defines the columns required in the target DataFrame

Parameters
  • columns – A list of column objects

  • ordered – True if the Schema should associate its Columns with DataFrame columns by position only, ignoring the header names. False if the columns should be associated by column header names only. Defaults to False

get_column_names()[source]

Returns the column names contained in the schema

validate(df: pandas.core.frame.DataFrame, columns: Optional[List[str]] = None) List[pandas_schema.validation_warning.ValidationWarning][source]

Runs a full validation of the target DataFrame using the internal columns list

Parameters
  • df – A pandas DataFrame to validate

  • columns – A list of columns indicating a subset of the schema that we want to validate

Returns

A list of ValidationWarning objects that list the ways in which the DataFrame was invalid

Column

class pandas_schema.column.Column(name: str, validations: Iterable[pandas_schema.validation._BaseValidation] = [], allow_empty=False)[source]

Creates a new Column object

Parameters
  • name – The column header that defines this column. This must be identical to the header used in the CSV/Data Frame you are validating.

  • validations – An iterable of objects implementing _BaseValidation that will generate ValidationErrors

  • allow_empty – True if an empty column is considered valid. False if we leave that logic up to the Validation

ValidationWarning

class pandas_schema.validation_warning.ValidationWarning(message: str, value: Optional[str] = None, row: int = - 1, column: Optional[str] = None)[source]

Represents a difference between the schema and data frame, found during the validation of the data frame

__str__() str[source]

The entire warning message as a string

column

The column name of the cell that failed the validation

row

The row index (usually an integer starting from 0) of the cell that failed the validation

value

The value of the failing cell in the DataFrame

Validators

Built-in Validators

class pandas_schema.validation.CanCallValidation(func: Callable, **kwargs)[source]

Validates if a given function can be called on each element in a column without raising an exception

Parameters

func – A python function that will be called with the value of each cell in the DataFrame. If this function throws an error, this cell is considered to have failed the validation. Otherwise it has passed.

property default_message

Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg

validate(series: pandas.core.series.Series) pandas.core.series.Series[source]

Returns a Boolean series, where each value of False is an element in the Series that has failed the validation :param series: :return:

class pandas_schema.validation.CanConvertValidation(_type: type, **kwargs)[source]

Checks if each element in a column can be converted to a Python object type

Parameters

_type – Any python type. Its constructor will be called with the value of the individual cell as its only argument. If it throws an exception, the value is considered to fail the validation, otherwise it has passed

property default_message

Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg

class pandas_schema.validation.DateFormatValidation(date_format: str, **kwargs)[source]

Checks that each element in this column is a valid date according to a provided format string

Parameters

date_format – The date format string to validate the column against. Refer to the date format code documentation at https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior for a full list of format codes

property default_message

Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg

validate(series: pandas.core.series.Series) pandas.core.series.Series[source]

Returns a Boolean series, where each value of False is an element in the Series that has failed the validation :param series: :return:

class pandas_schema.validation.InListValidation(options: Iterable, case_sensitive: bool = True, **kwargs)[source]

Checks that each element in this column is contained within a list of possibilities

Parameters

options – A list of values to check. If the value of a cell is in this list, it is considered to pass the validation

property default_message

Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg

validate(series: pandas.core.series.Series) pandas.core.series.Series[source]

Returns a Boolean series, where each value of False is an element in the Series that has failed the validation :param series: :return:

class pandas_schema.validation.InRangeValidation(min: float = - inf, max: float = inf, **kwargs)[source]

Checks that each element in the series is within a given numerical range

Parameters
  • min – The minimum (inclusive) value to accept

  • max – The maximum (exclusive) value to accept

property default_message

Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg

validate(series: pandas.core.series.Series) pandas.core.series.Series[source]

Returns a Boolean series, where each value of False is an element in the Series that has failed the validation :param series: :return:

class pandas_schema.validation.IsDistinctValidation(**kwargs)[source]

Checks that every element of this column is different from each other element

property default_message

Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg

validate(series: pandas.core.series.Series) pandas.core.series.Series[source]

Returns a Boolean series, where each value of False is an element in the Series that has failed the validation :param series: :return:

class pandas_schema.validation.IsDtypeValidation(dtype: numpy.dtype, **kwargs)[source]

Checks that a series has a certain numpy dtype

Parameters

dtype – The numpy dtype to check the column against

get_errors(series: pandas.core.series.Series, column: Optional[pandas_schema.column.Column] = None)[source]

Return a list of errors in the given series :param series: :param column: :return:

class pandas_schema.validation.LeadingWhitespaceValidation(**kwargs)[source]

Checks that there is no leading whitespace in this column

property default_message

Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg

validate(series: pandas.core.series.Series) pandas.core.series.Series[source]

Returns a Boolean series, where each value of False is an element in the Series that has failed the validation :param series: :return:

class pandas_schema.validation.MatchesPatternValidation(pattern, options={}, **kwargs)[source]

Validates that a string or regular expression can match somewhere in each element in this column

Parameters

kwargs – Arguments to pass to Series.str.contains (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html) pat is the only required argument

property default_message

Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg

validate(series: pandas.core.series.Series) pandas.core.series.Series[source]

Returns a Boolean series, where each value of False is an element in the Series that has failed the validation :param series: :return:

class pandas_schema.validation.TrailingWhitespaceValidation(**kwargs)[source]

Checks that there is no trailing whitespace in this column

property default_message

Create a message to be displayed whenever this validation fails This should be a generic message for the validation type, but can be overwritten if the user provides a message kwarg

validate(series: pandas.core.series.Series) pandas.core.series.Series[source]

Returns a Boolean series, where each value of False is an element in the Series that has failed the validation :param series: :return:

Custom Validators

Simple Validators

The easiest way to add your own Validator is to use the CustomSeriesValidation or CustomElementValidation class.

For example if you wanted a validation that checked if each cell in a column contained the word ‘fail’, and failed if it did, you’d do one of the following:

CustomSeriesValidation(lambda s: ~s.str.contains('fail'), 'contained the word fail')

CustomElementValidation(lambda s: ~s.contains('fail'), 'contained the word fail')

The difference between these two classes is that CustomSeriesValidation uses Pandas Series methods to operate on the entire series using fast, natively implemented functions, while CustomElementValidation operates on each element using ordinary Python code.

Consequently, if the validation you want to create is easy to express using Pandas Series methods (http://pandas.pydata.org/pandas-docs/stable/api.html#series), we recommend you use a CustomSeriesValidation since it will likely perform better. Otherwise, feel free to use a CustomElementValidation. Of course, if there is a built-in Validation class that fits your use-case, like MatchesPattern, it will be implemented as fast as possible, so then this is the recommended method to implement the validation

The arguments to these classes constructors are listed here:

class pandas_schema.validation.CustomElementValidation(validation: Callable[[Any], Any], message: str)[source]

Validates using a user-provided function that operates on each element

Parameters
  • message

    The error message to provide to the user if this validation fails. The row and column and failing value will automatically be prepended to this message, so you only have to provide a message that describes what went wrong, for example ‘failed my validation’ will become

    {row: 1, column: “Column Name”}: “Value” failed my validation

  • validation – A function that takes the value of a data frame cell and returns True if it passes the the validation, and false if it doesn’t

validate(series: pandas.core.series.Series) pandas.core.series.Series[source]

Returns a Boolean series, where each value of False is an element in the Series that has failed the validation :param series: :return:

class pandas_schema.validation.CustomSeriesValidation(validation: Callable[[pandas.core.series.Series], pandas.core.series.Series], message: str)[source]

Validates using a user-provided function that operates on an entire series (for example by using one of the pandas Series methods: http://pandas.pydata.org/pandas-docs/stable/api.html#series)

Parameters
  • message

    The error message to provide to the user if this validation fails. The row and column and failing value will automatically be prepended to this message, so you only have to provide a message that describes what went wrong, for example ‘failed my validation’ will become

    {row: 1, column: “Column Name”}: “Value” failed my validation

  • validation – A function that takes a pandas Series and returns a boolean Series, where each cell is equal to True if the object passed validation, and False if it failed

validate(series: pandas.core.series.Series) pandas.core.series.Series[source]

Returns a Boolean series, where each value of False is an element in the Series that has failed the validation :param series: :return:

Inheriting From _SeriesValidation

If you want to implement more complicated logic that doesn’t fit in a lambda, or you want to parameterize your Validator and re-use it in different parts of your application, you can instead make a class that inherits from _SeriesValidation.

All this class needs is:

  • An __init__ constructor that calls super().__init__(**kwargs)

  • A default_message property

  • A validate method

For reference on how these fields should look, have a look at the source code for the Built-in Validators (click the [source] button next to any of them)

Boolean Logic on Validators

You can also combine validators with the Boolean operators and, or and not. These are implemented using the following python operators:

Boolean Operation

Operator

not

~

and

&

or

|

For example, if we wanted a validation that checks if the cell either contains a number, or is a word with more than 1 character that also contains an ‘a’, we could do the following:

from pandas_schema import Column, Schema
from pandas_schema.validation import MatchesPatternValidation, CanConvertValidation, CustomSeriesValidation
import pandas as pd

schema = Schema([
    Column('col1', [
        CanConvertValidation(int) |
        (
            CustomSeriesValidation(lambda x: x.str.len() > 1, 'Doesn\'t have more than 1 character') &
            MatchesPatternValidation('a')
        )
    ])
])

test_data = pd.DataFrame({
    'col1': [
        'an',
        '13',
        'a',
        '8',
        'the'
    ]
})

errors = schema.validate(test_data)

for error in errors:
    print('"{}" failed!'.format(error.value))

This would produce the following result, because ‘a’ is a word, but isn’t more than one character, and because ‘the’ is a word, but it doesn’t contain the letter ‘a’:

"a" failed!
"the" failed!

Note that these operators do not short-circuit, so all validations will be applied to all rows, regardless of if that row has already failed a validation.

Changelog

0.3.6

  • Include the column name in the ValidationWarning when a column listed in the schema is not present in the data frame (#65)

  • schema.validate() now no longer immediately returns when a column is missing. Instead it adds a ValidationWarning and continues validation

0.3.5

  • Add version to a separate file, so that pandas_schema.__version__ now works (see #11)

  • Make the InRangeValidation correctly report a validation failure when it validates non-numeric text, instead of crashing (see #30)

Development

To install PandasSchema’s development requirements, run

pip install -r requirements.txt

The setup.py can be run as an executable, and it provides the following extra commands:

  • ./setup.py test: runs the tests

  • ./setup.py build_readme: rebuilds the README.rst from doc/readme/README.rst

  • ./setup.py build_site --dir=<dir>: builds the documentation website from doc/site/index.rst into <dir>