Skip to content

Dataset

A dataset is a grouping of data collections. A dataset could be a database, a storage bucket, or a Big Query dataset.

A dataset is defined in the catalogue/\<organization name\>/\<system name\>/\<resource name/>/\<resource name \>/dataset.yml file. The dataset.yml file accepts all arguments defined in the Dataset model with the expection of:

  1. The key is created using the catalogue folder structure.

Example

# dataset.yaml

dataset:
  key: dataset_key
  tags:
  - foo
  - bar
  name: user
  description: <resource description>
  meta:
    last_updated: '2021-01-01'
    version: 1.0.0
  data_categories:
  - user.contact
  data_qualifier: identified
  joint_controller:
    name: Dave
    address: Museumplein 10, 1071 DJ Amsterdam, Netherlands
    email: dave@organization.com
    phone: 020 573 2911
  third_country_transfers:
  - USA
  - CAN
  collections:
    user:
      name: user
      description: user data
      data_categories:
      - user.contact
      data_qualifier: identified
      fields:
      - name: email
        description: user email
        data_categories:
        - user.contact.email
        data_qualifier: identified
        deidentifier:
          type: replace
          value: fake@email.com
        period: P365D
      - name: name
        description: user name
        data_categories:
        - user.name
        data_qualifier: identified
        deidentifier:
          type: redact
        period: P365D
      datetime_field:
        name: created_at

Models

blackline.models.catalogue.Dataset

Bases: BlacklineModel

The Dataset resource model.

Todo: This breaks the Liskov substitution principle because it restrics the BlacklineModel, not expand it. This model has no children.

Parameters:

Name Type Description Default
meta Optional[dict[str, str]] None
data_categories Optional[list[Key]] Array of Data Category resources identified by `key`, that apply to all collections in the Dataset. None
data_qualifier Key required
joint_controller Optional[ContactDetails] None
third_country_transfers Optional[list[str]] An optional array to identify any third countries where data is transited to. For consistency purposes, these fields are required to follow the Alpha-3 code set in [ISO 3166-1](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3). None
_check_valid_country_code classmethod required
_alias required
children Optional[dict[str, DatasetCollection]] None
stem required
children_stem required
children_cls required
collections Optional[dict[str, DatasetCollection]] required
Source code in BAR /opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/blackline/models/catalogue.py
class Dataset(BlacklineModel):
    """The Dataset resource model.

    Todo: This breaks the Liskov substitution principle because it restrics the BlacklineModel,
    not expand it. This model has no children.
    """

    meta: Optional[dict[str, str]] = Field(
        description=Key(
            "An optional object that provides additional information about the Dataset. You can structure the object however you like. It can be a simple set of `key: value` properties or a deeply nested hierarchy of objects."
        ),
    )
    data_categories: Optional[list[Key]] = Field(
        description="Array of Data Category resources identified by `key`, that apply to all collections in the Dataset.",
    )
    data_qualifier: Key = Field(
        default=Key(
            "aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified"
        ),
        description="Array of Data Qualifier resources identified by `key`, that apply to all collections in the Dataset.",
    )
    joint_controller: Optional[ContactDetails] = Field(
        description=ContactDetails.__doc__,
    )
    third_country_transfers: Optional[list[str]] = Field(
        description="An optional array to identify any third countries where data is transited to. For consistency purposes, these fields are required to follow the Alpha-3 code set in [ISO 3166-1](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3).",
    )
    _check_valid_country_code: classmethod = country_code_validator
    _alias = "collections"
    children: Optional[dict[str, DatasetCollection]] = Field(description=f"Collection dict. Alaised to {_alias}", alias=_alias)  # type: ignore[assignment]
    stem = "dataset"
    children_stem = "collections"
    children_cls = DatasetCollection

    @root_validator(pre=True)
    def add_key_to_collection(cls, values):
        for key, collection in (
            values["collections"].items() if values.get("collections") else []
        ):
            collection["key"] = values["key"] + "." + key
        return values

    @property
    def collections(self) -> Optional[dict[str, DatasetCollection]]:
        return self.children

blackline.models.catalogue.BlacklineModel

Bases: BaseModel

The base model for all Resources.

Parameters:

Name Type Description Default
key Key A unique key used to identify this resource. required
tags Optional[list[str]] A list of tags for this resource. None
name Optional[str] None
description Optional[str] None
children Optional[dict[str, Type[BlacklineModel]]] The children resources. None
stem str The stem of the resource. required
children_stem Optional[str] required
children_cls Optional[type[BlacklineModel]] required
Source code in BAR /opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/blackline/models/catalogue.py
class BlacklineModel(BaseModel):
    """The base model for all Resources."""

    key: Key = Field(description="A unique key used to identify this resource.")
    tags: Optional[list[str]] = Field(description="A list of tags for this resource.")
    name: Optional[str] = name_field
    description: Optional[str] = description_field
    children: Optional[dict[str, Type[BlacklineModel]]] = Field(
        None, description="The children resources."
    )
    stem: ClassVar[str] = Field(description="The stem of the resource.")
    children_stem: ClassVar[Optional[str]] = None
    children_cls: ClassVar[Optional[type[BlacklineModel]]] = None

    class Config:
        "Config for the BlacklineModel"
        extra = "forbid"
        orm_mode = True

    def __getitem__(self, key: str) -> Type[BlacklineModel]:
        parts = key.split(".")
        key = ".".join([self.key, parts[0]])
        if self.children is None:
            raise KeyError(f"No children for {self.key}")
        model = self.children[key]

        for part in parts[1:]:
            model = model[part]  # type: ignore[index]
        return model

    @classmethod
    def parse_dir(cls, path: Path, key_prefix: Optional[str] = None):
        """
        Parse a directory of YAML files into a dictionary of Dataset objects.

        Args:
            path: The path to the directory of YAML files.
            path: Path

        Returns:
            A dictionary of Dataset objects.
        """
        key = ".".join([key_prefix, path.name]) if key_prefix is not None else path.name
        children = cls.parse_children(path=path, key_prefix=key)
        filepath = cls.find_definition_file(path=path)
        return cls.parse_yaml(path=filepath, key=key, children=children)

    @classmethod
    def parse_children(
        cls, path: Path, key_prefix: Optional[str] = None
    ) -> dict[str, Type[BlacklineModel]]:
        """
        Parse a directory of YAML files into a dictionary of Dataset objects.

        Args:
            path: The path to the directory of YAML files.
            path: Path

        Returns:
            A dictionary of Dataset objects.
        """
        children: dict[str, Type[BlacklineModel]] = {}
        if cls.children_cls is None:
            return children
        for child_path in path.iterdir():
            if child_path.is_dir():
                child = cls.children_cls.parse_dir(
                    path=child_path, key_prefix=key_prefix
                )
                children[child.key] = child
        return children

    @classmethod
    def find_definition_file(cls, path: Path) -> Path:
        file = list(path.glob(f"{cls.stem}.yml")) + list(path.glob(f"{cls.stem}.yaml"))
        file_len = len(list(file))
        if file_len == 0:
            raise FileNotFoundError(
                f"No {cls.stem} file found in directory: {path.absolute()}"
            )
        if file_len > 1:
            raise ValueError(
                f"Multiple {cls.stem} files found in directory: {path.absolute()}, only include one of resource.yaml or resource.yml"
            )
        return file[0]

    @classmethod
    def parse_yaml(
        cls,
        path: Path,
        key: str,
        children: Optional[dict[str, Type[BlacklineModel]]] = {},
    ):
        """
        Parse a yaml file into a the children_cls object.

        Args:
            path: Path location of the yaml file.
            key: Key to identify the dataset.

        Returns:
            Dataset object.
        """
        with open(path, "r") as f:
            info = yaml.safe_load(f)[cls.stem][0]
            info["key"] = key
            if cls.stem == "dataset":
                return cls.parse_obj(info)
            info[cls.children_stem] = children
            return cls.parse_obj(info)

Config

Config for the BlacklineModel

Source code in BAR /opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/blackline/models/catalogue.py
class Config:
    "Config for the BlacklineModel"
    extra = "forbid"
    orm_mode = True

parse_children(path, key_prefix=None) classmethod

Parse a directory of YAML files into a dictionary of Dataset objects.

Parameters:

Name Type Description Default
path Path

The path to the directory of YAML files.

required
path Path

Path

required

Returns:

Type Description
dict[str, Type[BlacklineModel]]

A dictionary of Dataset objects.

Source code in /opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/blackline/models/catalogue.py
@classmethod
def parse_children(
    cls, path: Path, key_prefix: Optional[str] = None
) -> dict[str, Type[BlacklineModel]]:
    """
    Parse a directory of YAML files into a dictionary of Dataset objects.

    Args:
        path: The path to the directory of YAML files.
        path: Path

    Returns:
        A dictionary of Dataset objects.
    """
    children: dict[str, Type[BlacklineModel]] = {}
    if cls.children_cls is None:
        return children
    for child_path in path.iterdir():
        if child_path.is_dir():
            child = cls.children_cls.parse_dir(
                path=child_path, key_prefix=key_prefix
            )
            children[child.key] = child
    return children

parse_dir(path, key_prefix=None) classmethod

Parse a directory of YAML files into a dictionary of Dataset objects.

Parameters:

Name Type Description Default
path Path

The path to the directory of YAML files.

required
path Path

Path

required

Returns:

Type Description

A dictionary of Dataset objects.

Source code in /opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/blackline/models/catalogue.py
@classmethod
def parse_dir(cls, path: Path, key_prefix: Optional[str] = None):
    """
    Parse a directory of YAML files into a dictionary of Dataset objects.

    Args:
        path: The path to the directory of YAML files.
        path: Path

    Returns:
        A dictionary of Dataset objects.
    """
    key = ".".join([key_prefix, path.name]) if key_prefix is not None else path.name
    children = cls.parse_children(path=path, key_prefix=key)
    filepath = cls.find_definition_file(path=path)
    return cls.parse_yaml(path=filepath, key=key, children=children)

parse_yaml(path, key, children={}) classmethod

Parse a yaml file into a the children_cls object.

Parameters:

Name Type Description Default
path Path

Path location of the yaml file.

required
key str

Key to identify the dataset.

required

Returns:

Type Description

Dataset object.

Source code in /opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/blackline/models/catalogue.py
@classmethod
def parse_yaml(
    cls,
    path: Path,
    key: str,
    children: Optional[dict[str, Type[BlacklineModel]]] = {},
):
    """
    Parse a yaml file into a the children_cls object.

    Args:
        path: Path location of the yaml file.
        key: Key to identify the dataset.

    Returns:
        Dataset object.
    """
    with open(path, "r") as f:
        info = yaml.safe_load(f)[cls.stem][0]
        info["key"] = key
        if cls.stem == "dataset":
            return cls.parse_obj(info)
        info[cls.children_stem] = children
        return cls.parse_obj(info)

blackline.models.catalogue.ContactDetails

Bases: BaseModel

The contact details information model.

Used to capture contact information for controllers, used as part of exporting a data map / ROPA.

This model is nested under an Organization and potentially under a system/dataset.

Parameters:

Name Type Description Default
name Optional[str] An individual name used as part of publishing contact information. None
address Optional[str] An individual address used as part of publishing contact information. None
email Optional[str] An individual email used as part of publishing contact information. None
phone Optional[str] An individual phone number used as part of publishing contact information. None
Source code in BAR /opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/blackline/models/catalogue.py
class ContactDetails(BaseModel):
    """
    The contact details information model.

    Used to capture contact information for controllers, used
    as part of exporting a data map / ROPA.

    This model is nested under an Organization and
    potentially under a system/dataset.
    """

    name: Optional[str] = Field(
        description="An individual name used as part of publishing contact information.",
    )
    address: Optional[str] = Field(
        description="An individual address used as part of publishing contact information.",
    )
    email: Optional[str] = Field(
        description="An individual email used as part of publishing contact information.",
    )
    phone: Optional[str] = Field(
        description="An individual phone number used as part of publishing contact information.",
    )

blackline.models.catalogue.DatasetCollection

Bases: BlacklineModel

The DatasetCollection resource model.

This resource is nested witin a Dataset.

Parameters:

Name Type Description Default
name str The name of the collection. required
datetime_field Optional[DatetimeField] The datetime field to use for the retention limit calculations. None
where Optional[str] An addional where clause to append to the exeisting: 'WHERE {{ datetime_column }} < %(cutoff)s'. None
fields Optional[list[DatasetField]] An array of objects that describe the collection's fields. None
data_categories Optional[list[Key]] Array of Data Category resources identified by `key`, that apply to all fields in the collection. None
data_qualifier Key required
_sort_fields classmethod required
dependencies Optional[list[str]] The collection dependencies. None
Source code in BAR /opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/blackline/models/catalogue.py
class DatasetCollection(BlacklineModel):
    """
    The DatasetCollection resource model.

    This resource is nested witin a Dataset.
    """

    name: str = Field(..., description="The name of the collection.")

    datetime_field: Optional[DatetimeField] = Field(
        description="The datetime field to use for the retention limit calculations."
    )
    where: Optional[str] = Field(
        None,
        description="An addional where clause to append to the exeisting: 'WHERE {{ datetime_column }} < %(cutoff)s'.",  # noqa: E501
    )
    fields: Optional[list[DatasetField]] = Field(
        description="An array of objects that describe the collection's fields.",
    )

    data_categories: Optional[list[Key]] = Field(
        description="Array of Data Category resources identified by `key`, that apply to all fields in the collection.",
    )
    data_qualifier: Key = Field(
        default=Key(
            "aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified"
        ),
        description="Array of Data Qualifier resources identified by `key`, that apply to all fields in the collection.",
    )

    _sort_fields: classmethod = validator("fields", allow_reuse=True)(
        sort_list_objects_by_name
    )
    dependencies: Optional[list[str]] = Field(
        None, description="The collection dependencies."
    )

blackline.models.catalogue.DatasetField

Bases: DatasetFieldBase

The DatasetField resource model.

This resource is nested within a DatasetCollection.

Parameters:

Name Type Description Default
fields Optional[list[DatasetField]] An optional array of objects that describe hierarchical/nested fields (typically found in NoSQL databases). None
Source code in BAR /opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/blackline/models/catalogue.py
class DatasetField(DatasetFieldBase):
    """
    The DatasetField resource model.

    This resource is nested within a DatasetCollection.
    """

    fields: Optional[list[DatasetField]] = Field(
        description="An optional array of objects that describe hierarchical/nested fields (typically found in NoSQL databases).",
    )