Skip to content

Scan Databases with the CLI

Overview

Fides ships with several CLI-based scanners to connect to your databases, services, or applications, and create Fides-compatible resource templates for use in your CI/CD pipelines and data maps.

The Fides CLI provides a generate command to connect to a database or service, and automatically generate a YAML-based resource file based on the database schema. The scan command is available to compare your existing resources against what is defined in your Fides server, or against your resource manifest files, to ensure any updates remain compliant. These commands are available for a variety of databases and services.

The scan and generate commands work best when used in tandem. The Fides format must be followed to ensure coverage.

Scan databases

Fides works out-of-the-box with most databases, including:

  • PostgreSQL
  • MySQL and MariaDB
  • SQLite
  • Oracle
  • Microsoft SQL Server

For a list of supported options, see the available SQLAlchemy dialects.

Additionally, Fides includes commands to connect to BigQuery and Snowflake data warehouses.

Provide connection credentials

Database credentials are provided as part of the connection string supplied. The connection string can be supplied as a command option, or using the Fides configuration. The appropriate connection-string format for your database connector can be found in the SQLAlchemy Documentation.

Command line

A connection string can be supplied using the connection-string option:

fides generate dataset db \
--connection-string <my_connection_string>

Configuration files

A connection string can also be defined within your Fides configuration under the credentials section.

[credentials]
my_database_credentials = {connection_string="<my_connection_string>"}

Your command can then reference the key defined in your config:

...
--credentials-id "my_database_credentials"
...

Environment variables

It is possible to use an environment variable to set credentials config values if persisting your connection string to a file is problematic. To set a connection string you can set the environment variable with a prefix of FIDES__CREDENTIALS__ and __ as the nested key delimiter:

export FIDES__CREDENTIALS__MY_DATABASE_CREDENTIALS__CONNECTION_STRING="<my_database_credentials>"

Example

The following example will generate a new dataset from the below database schema:

 id |     created_at      |       email       |              password              | first_name | last_name
----+---------------------+-------------------+------------------------------------+------------+-----------
  1 | 2020-01-01 00:00:00 | admin@example.com | pbkdf2:sha256:260000$O87nanbSkl... | Admin      | User
  2 | 2020-01-03 00:00:00 | user@example.com  | pbkdf2:sha256:260000$PGcBy5NzZe... | Example    | User

Invoke the generate command by providing a connection url for this database:

./venv/bin/fides generate dataset db \
  fides_resources/flaskr_postgres_dataset.yml \
  --connection-string postgresql://postgres:postgres@localhost:5432/flaskr

The second line represents an output file location. In this case, the generated dataset will be placed in the /fides_resources/ directory, as a file named flaskr_postgres_dataset.yml.

The result is a resource file with a dataset with collections and fields to represent the above schema:

dataset:
- fides_key: public
  organization_fides_key: default_organization
  name: public
  description: 'Fides Generated Description for Schema: public'
  meta: null
  data_categories: []
  data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
  collections:
  - name: public.users
    description: 'Fides Generated Description for Table: public.users'
    data_categories: []
    data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
    fields:
    - name: created_at
      description: 'Fides Generated Description for Column: created_at'
      data_categories: []
      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
    - name: email
      description: 'Fides Generated Description for Column: email'
      data_categories: []
      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
    - name: first_name
      description: 'Fides Generated Description for Column: first_name'
      data_categories: []
      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
    - name: id
      description: 'Fides Generated Description for Column: id'
      data_categories: []
      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
    - name: last_name
      description: 'Fides Generated Description for Column: last_name'
      data_categories: []
      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
    - name: password
      description: 'Fides Generated Description for Column: password'
      data_categories: []
      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified

From here, the data_categories can be edited to include the types of data stored by each field. For information on how Fides can do this automatically, see the Fides data classifier.

💡
To use this Dataset in a Privacy Request, you must add required meta information.

Scan the dataset

The scan command can then connect to your database and compare its schema to your defined datasets, helping you track changes between your live tables and your stored resources.

./venv/bin/fides scan dataset db \
  fides_resources/flaskr_postgres_dataset.yml \
  --connection-string postgresql://postgres:postgres@localhost:5432/flaskr

The command output confirms our database resource is covered fully:

Loading resource manifests from: dataset.yml
Taxonomy successfully created.
Successfully scanned the following datasets:
	public

Annotation coverage: 100%

Scan a Google Cloud Platform account

The generate command can connect to a GCP account and automatically generate resource files based on the contents. Currently, generating datasets from BigQuery is supported.

Providing Credentials

GCP credentials can be generated via a service account keyfile which can be passed as a command option or via Fides configuration. You will need to set project specific credentials for access rights, but datasets can be passed explicitly at runtime.

Command line

The path to the keyfile can be directly supplied in your command using the keyfile_path option.

...
--keyfile-path "/path/to/keyfile.json"
...

Configuration files

Credentials can be defined within your Fides config under the credentials section.

[credentials.bigquery_1]
type = "service_account"
project_id = "<my_project_id>"
private_key_id = "<my_private_key_id>"
private_key = "<my_private_key>"
client_email = "<my_client_email>"
client_id = "<my_client_id>"
auth_uri = "https://accounts.google.com/o/oauth2/auth"
token_uri = "https://oauth2.googleapis.com/token"
auth_provider_x509_cert_url = "https://www.googleapis.com/oauth2/v1/certs"
client_x509_cert_url = "<my_cert_url>"

Your command can then reference the key defined in your config.

...
--credentials-id "my_gcp_credentials"
...

If persisting your keys to a config file is problematic, it is possible to use an environment variable to set credentials config values. To set a secret access key and id, you can set the environment variable with a prefix of FIDES__CREDENTIALS__ and __ as the nested key delimiter:

export FIDES__CREDENTIALS__BIGQUERY_1__PRIVATE_KEY="<my_private_key>"
export FIDES__CREDENTIALS__BIGQUERY_1__CLIENT_ID="<my_client_id>"

Generate a Dataset

Once credentials have been configured, the generate dataset gcp bigquery command can take both a configuration option and a dataset name to create the resource file:

./venv/bin/fides generate dataset gcp bigquery \
  <dataset_name> --keyfile-path ".fides/creds/bigquery.json" \
  <output_file_name>

The result is a resource file with a dataset that represents the bigquery dataset defined in your account.

dataset:
- fides_key: my_bigquery_dataset
  organization_fides_key: default_organization
  name: bigquery dataset
  description: 'Fides Generated Description for Schema: BigQuery'
  data_categories: []
  data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
  retention: No retention or erasure policy
  collections:
  - name: table
    description: 'Fides Generated Description for Table: table'
    data_categories: []
    data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
    fields:
    - name: column
      description: 'Fides Generated Description for Column: column'
      data_categories: []
      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
...