Baffle Advanced Data Protection
Baffle Advanced Data Protection provides a range of data encryption, tokenization and de-identification methods to protect data in data stores and cloud storage environments. Common methods that Baffle employs include column or field level encryption, tokenization, format preserving encryption (FPE), dynamic data masking, and record level encryption.
Baffle integrates with key management stores via a key virtualization layer. It also provides for a local key store so you can use your own keys for data protection in the cloud. This page covers the following topics. Click a link to jump to a topic:
- Baffle Advanced Data Protection for Snowflake
- Baffle Shield setup
- Snowflake setup
- AWS configuration
- Snowflake configuration
Baffle Advanced Data Protection for Snowflake
This document provides a high-level overview of how to set up Baffle Advanced Data Protection for decrypting data (re-identification) on Azure, and how to configure Snowflake to utilize it.
Baffle Advanced Data Protection enables column-level encryption for data in Snowflake (de-identification) and allows decryption (re-identification) of the data based on policy. Snowflake calls this feature External Tokenization. The data de-identification process requires that Baffle is used to encrypt (de-identify) data as it gets staged before ingestion into Snowflake, utilizing Baffle’s API or proxy. These methods can encrypt data as it is moved to a stage environment.
When executing a query on a Snowflake data warehouse, Snowflake is configured to automatically rewrite your queries to include calls out to Baffle DPS and enable decryption. The Baffle DPS decryption service runs in your own VPC and is based on a policy you create in Snowflake.
- For cost effectiveness, host your Snowflake and customer account with AWS and make sure they are in the same region. This is the assumed architecture.
- Install the Snowflake command line tools. For more information, see the Snowflake Installing SnowSQL documentation.
- Install the AWS Command Line tools. For more information, see see the Amazon AWS Command Line Interface documentation.
- Install Java version 1.8 or later.
- Build a set of Baffle config files, including CLE rules in the Baffle Privacy Schema (TOML only). See Baffle Config files for Snowflake AWS deployments for examples, or see the following documentation to modify existing Baffle config files:
– BaffleCommonConfig: Modify and Deploy the Baffle Common Config File
– KmsConifig.properties: Modify the KMS Config File for KMS
– BafflePrivacySchema: Modify and Deploy the Baffle Privacy Schema (TOML format only)
- Download the baffle-snowflake-service.
Copy the baffle-sources baffle-client-local-shaded-INTERNAL.jar file into both the service/libs directory and the configtool/libs directory
Build the service and the Snowflake configuration tool by running gradle build.
Copy the snowflake service zip file (service/build/distributions/service-1.0-SNAPSHOT.zip) to the S3 bucket of your choice. The key should be called snowflakeservice.zip
Baffle Shield setup
- Set up Baffle Shield to write its output to an S3 bucket. For more information, see Configure a Baffle Shield – AWS AMI. – Amazon S3.
- Create and run an Amazon Database Migration Service (DMS). For more information, see the Amazon DMS documentation.
- Navigate to the populated S3 bucket and verify that the CSV file columns are encrypted.
- Create a role called baffleadmin on Snowflake to manage the Baffle integration. All of the integration objects will be owned and managed by this role.
- Grant further rights and usages from the baffleadmin role, similar to the following example.
CREATE ROLE baffleadmin;
GRANT usage ON WAREHOUSE <yourwarehouse> TO baffleadmin;
GRANT create database ON ACCOUNT to baffleadmin;
GRANT create integration ON ACCOUNT TO baffleadmin WITH GRANT OPTION;
GRANT apply masking policy ON ACCOUNT TO baffleadmin WITH GRANT OPTION;
- Create a database/schema/tables/columns for your data that match the BafflePrivacySchema (BPS). NOTE: The BPS has three-level database column definitions for MySQL, but Snowflake is four-level. The included scripts assume that the table names and column names match – the database/schema does not need to match.
Grant access to the baffleadmin role the users who will be administering your Baffle/Snowflake integration.
- Run the setup.sh script and specify the following when prompted:
– CloudFormation stack name. Choose a name that you will be able to find in the AWS console.
– Name of an S3 bucket containing your Baffle configurations.
– Name of an S3 bucket containing the snowflakeservice.zip file.
– Name of an S3 bucket containing the CSV files generated by the Baffle Shield.
The setup.sh script performs the following tasks:
Sets up an AWS CloudFormation stack, using the setup-sf.yml script. The stack will be:
An AWS Lambda function containing the snowflake-service
An AWS API Gateway API that Snowflake can use to invoke the snowflake-service
IAM roles for both the Lambda function and Snowflake to assume
Creates a Snowflake storage integration object that refers to the S3 bucket containing the CSV data files.
Creates a Snowflake API integration object that refers to the snowflake-service lambda function.
Updates the IAM role created by CloudFormation to only trust the newly-created Snowflake integration objects.
Fetches the BPS from the specified configuration bucket.
Runs the Snowflake configuration tool on the BPS. This generates a Snowflake script that will generate the per-column objects in Snowflake.
Creates the Snowflake objects based on the BPS and attaches them to the applicable tables/columns.
In the customer AWS account where Baffle is hosted, the encryption API endpoints need to be exposed to the specific Snowflake instance. The easiest way to do that is to use the AWS API Gateway service. The API Gateway is a convenient and flexible way to expose APIs to external entities. It provides a security and versioning layer for APIs.
A set of external functions that represent the Baffle API should created in the snowflake instance, and permission to use it granted to the snowflake users who need access.
To do this, you first create an API and specify the customer-account IAM role for Snowflake to assume during invocation, as shown in the following example.
CREATE OR REPLACE API INTEGRATION baffle_api
api_provider = aws_api_gateway
api_aws_role_arn = 'arn:aws:iam::EXAMPLE:role/snowflake_role'
enabled = true
api_allowed_prefixes = ('https://EXAMPLE.execute-api.us-west-2.amazonaws.com/Snowflake_Baffle');
You can query the newly-created API to find out the identity ARN and external ID that Snowflake will use when assuming the specified cross-account role. The role’s trust policy on the customer account side should be restricted to only trust this specific ARN and ID. This binds the role and this specific Snowflake/API tuple to each other.
Then, each of the endpoints of the API are created. The following is a decrypt example:
CREATE EXTERNAL FUNCTION baffle_encrypt(v VARCHAR)
API_INTEGRATION = snowflake_test_api
CREATE EXTERNAL FUNCTION baffle_decrypt(v VARCHAR)
API_INTEGRATION = snowflake_test_api
NOTE: Snowflake invokes the API in batches, with multiple row values in a single invocation. If the API has a maximum number of rows that it can handle at once, this can be specified as a configuration on the snowflake instance. Also, the API should consider returning a 429 (Too Many Requests) error when overloaded. In this case, Snowflake will automatically retry, and use smaller batch sizes.