[Data Flow] [Customer] Guide and Documentation V1 – Blip | Blip Help

Objective

The main objective of this document is to enable the customer to use the Data Flow data service efficiently, providing a detailed overview of its technical aspects. The focus is to ensure that the customer fully understands the solution, enabling its safe and optimized use.

Definitions

Data Flow

In the dynamic data engineering ecosystem, effective data sharing across platforms is a critical pillar for advanced analytics and informed decision-making. Our innovative platform, integrated with leading-edge solutions such as Databricks, Azure Data Factory, and Soda, is at the forefront of simplifying data sharing. At the heart of this initiative is Data Flow , a vital tool that leverages the Delta Sharing protocol to facilitate data sharing.

At the core of Data Flow is a Python library specifically designed to streamline and secure data sharing across multiple platforms. With the adoption of the Delta Sharing protocol, this library stands out, offering an efficient solution for data engineers while adhering to strict governance policies. From facilitating data export and import to maintaining data integrity during sharing, Data Flow becomes an indispensable partner.

In summary, Data Flow is an abstraction layer created by the Data team at Blip, with the aim of facilitating the configuration and management of data sharing between parties. With it, it is possible to consume Near Real Time data, in Batches or Streaming, in a secure and simple way. Through the sharing of Blip contracts (tenant ID), it is possible to ensure that the data is consistent and reliable. Finally, it is possible to manage exceptions in sharing, with error handling and permission strategies.

We will explore some key features that Data Flow enables:

Secure Data Sharing

Data Flow simplifies the process of sharing data between systems, ensuring that data movement is not only efficient but also aligned with governance standards. Whether streaming data in batches or in real time, Data Flow equips you with the tools you need to share data securely and efficiently.

Simplified Interfaces for Access

Through the protocol used, Data Flow proposes a simplified flow for interaction with data systems. An interoperable solution that allows any use case, any platform and any tool.

Implementation of Sharing Contracts

Maintaining data integrity is essential, and Data Flow employs advanced features to ensure that shared data sets are consistent and reliable. By defining and implementing sharing agreements, the tool ensures that data conforms to predefined formats and standards, minimizing errors and inconsistencies.

Effective Exception Management

Data sharing can be subject to unexpected events and exceptions. Data Flow provides a robust solution for managing these exceptions, enabling the implementation of effective error handling strategies, thus ensuring that data sharing processes are continuous and reliable.

Product/Service Vision and Roadmap

As a data service, Data Flow, for clients, is the best way to consume raw conversational data from Blip, with near real-time latency (up to 120s).

The available data comes from the following tables: messages, eventtracks, notifications and tickets . The customer can consume the data through Databricks or any other data consumption solution available through the Delta Sharing protocol.

Architecture

Delta Sharing

What is Delta Sharing?

Delta Sharing is an open protocol from Databricks that revolutionizes the way organizations share and exchange data. It provides a simple, secure, and open method for data providers and consumers to share information in real time, regardless of the computing platforms they use.

Fundamental Concepts

Provider / Data Provider

Entities that make data available for sharing. In our case, Blip.

Sharing

A share is a logical grouping of shares to share with recipients, with read-only permissions. A share can be shared with one or multiple recipients. A recipient can access all resources in a share. A share can contain multiple schemas, tables, notebooks, volumes, ML models, or other data assets that the provider wants to share.

Recipient / Data Recipient

A client that has a token to access shared objects.

Schema

A schema is a logical grouping of tables. A schema can contain multiple tables.

Table

A table is either a Delta Lake table or a view over a Delta Lake table.

Sharing Server

A server that implements the protocol.

Delta sharing protocol operation diagram

Delta Sharing Sharing Methods

D2D: Sharing between Databricks environments with access via Catalog Explorer, Databricks CLI or SQL.

D2O: Databricks sharing for open source with credentials and activation links.

O2O: Sharing between open source platforms with reference server and Delta Sharing clients.

O2D: Open source sharing for Databricks using token-based credentials system.

Sharing Between Databricks Environments (Databricks to Databricks - D2D)

The recipient provides a unique identifier tied to your Databricks workspace.

The data provider creates a "share" in its own workspace, which includes tables, views, and notebooks.

A "recipient" object is created to represent the user or group that will access the data.

The provider grants access to the share, which appears in the recipient's workspace.

Users can access the share through various means, such as Catalog Explorer, the Databricks CLI, or SQL commands.

Databricks to Open Source Sharing (Databricks to Open - D2O)

The data provider creates "recipient" and "share" objects, just like in the previous method.

A token and activation link are generated for the recipient.

The provider sends the activation link to the recipient securely.

The recipient uses this link to download a credential file, which is used to establish a secure connection with the provider and access the shared data.

This method allows data to be read on any platform or tool.

Open to Open (O2O) Sharing

Allows data sharing between any open source platforms or tools, without the need for Databricks.

The data provider can use an open source reference server to create and manage shares and recipients.

The recipient can use any Delta Sharing client to access the shared data using a credential file.

This method enables data sharing across different clouds and regions with minimal configuration and maintenance.

Open to Databricks (O2D) Sharing

Enables sharing of data and AI models beyond the Databricks ecosystem.

Utilizes a token-based credentials system, allowing data providers to share assets with any user, regardless of access to Databricks.

Examples include sharing data from Oracle to Databricks.

Despite its openness, Delta Sharing ensures robust security and governance.

Sharing possibilities via Delta Sharing

CONNECTION GUIDE

Usage in Databricks

By default, the customer must locate the “deltashare_core” share (or another if it is a custom share) through the catalog, in the “Delta Sharing” tab, within the “Shared with me” section, as shown in the images below. It is necessary to create a catalog from the share. It is then possible to read the data using several tools available to Databricks users.

Download Activation Link

The recipient who receives the activation link must download the credentials file locally in JSON format. Note that for security reasons, the credentials file can only be downloaded once and will expire after the first download.

For certain technologies, such as Tableau, in addition to the URL link, you may need to upload this credentials file. For other technologies, you may need a bearer token or other credentials contained in this file.

Using the Credentials File in Python/Notebooks

Once the credentials file has been downloaded, it can be used across multiple notebook platforms, such as Jupyter and Databricks, to access shared data stored within data frames.

To enable this functionality on your notebook, run the following commands to install and import the Delta Sharing client. Alternatively, you can choose to install it from PyPi by searching for "delta-sharing-server" and installing the "delta-sharing" package.

After installation, you can use the previously downloaded credentials profile file to enumerate and access all shared tables in your notebook environment:

##Install the Delta Sharing Python package

!pip install delta-sharing

##Import Delta Sharing libraries

import delta_sharing

##Client Configuration for Access to the Credential File

##This part can be done locally or stored in an external environment

config_path = "C:/Users/dummy_user/shares/config.share"

client = delta_sharing.SharingClient(config_path)

Listing available datasets

Now that the client has been configured, you can query the available datasets within your Data Flow. You can do this by simply calling the list_all_tables() method of the SharingClient object you just created.

We can see in the example below that a dataset called incomings_schema.records , which is located inside incomings_share , is available to me.

Each Table object that appears in this list below is a different table/dataset that you have access to.

##Displaying available datasets

print(client.list_all_tables())

## Result

[Table(name='records', share='incomings_share', schema='incomings_schema')]

Accessing the data

To access and use data from a dataset via Data Flow, you need to load the data into the Python session, either via pandas DataFrame or Apache Spark DataFrame, depending on your convenience.

The complete address of a dataset is made up of three distinct parts.

First, it is the path to your credentials file (i.e. the config.share file ).

The second part is a hashtag character (#), which acts as a separator between the first and third parts of this full address.

Then, the third part is the full name of the dataset, which is composed of three parts separated by periods, which are: the share name, the schema name, and the dataset name (or the "table name", if you prefer to call it that).

Therefore, the full name of a dataset takes the form: <share-name>.<schema-name>.<table-name>.

Returning to the previous example, we have access to only one dataset, called records, which is within the schema incomings_schema and the share incomings_share . Therefore, the full name of this dataset would be incomings_share.incomings_schema.records .

Since we already have the path to our credentials file, the full address for this dataset would be:

#Data Access Path Configuration

config_path = "C:/Users/dummy_user/shares/config.share"

table_name = "incomings_share.incomings_schema.records"

table_address = config_path + "#" + table_name

print(table_address)

##Result

C:/Users/dummy_user/shares/config.share#incomings_share.incomings_schema.records

Loading data into pandas

To do this, you can use the load_as_pandas() method from the delta_sharing library . All you need to do is provide the full address to this dataset to the method.

# Import using Pandas

import delta_sharing

config_path = "C:/Users/dummy_user/shares/config.share"

table_name = "incomings_share.incomings_schema.records"

table_address = config_path + "#" + table_name

table = delta_sharing.load_as_pandas(table_address)

print(table)

##Result

date datetime id value

0 2024-05-15 2024-05-15 15:12:20.680756 1 3200

1 2024-05-15 2024-05-15 15:12:54.680769 2 1550

2 2024-05-15 2024-05-15 15:06:14.680772 3 8700

3 2024-05-15 2024-05-15 15:13:08.680774 4 5800

Loading data into Apache Spark

If you prefer, you can load the dataset into Apache Spark by changing the method used to load_as_spark() . The input to this method is the same as in load_as_pandas() , namely the full address to the dataset you are trying to access.

#Import using Apache Spark

import delta_sharing

config_path = "C:/Users/dummy_user/shares/config.share"

table_address = config_path + "#incomings_share.incomings_schema.records"

table = delta_sharing.load_as_spark(table_address)

table.show()

##Result

+----------+-------------------+---+-----+

+----------+-------------------+---+-----+

|2024-05-15|2024-05-15 15:12:...| 2| 1550|

|2024-05-15|2024-05-15 15:06:...| 3| 8700|

|2024-05-15|2024-05-15 15:13:...| 4| 5800|

|2024-05-15|2024-05-15 15:12:...| 1| 3200|

+----------+-------------------+---+-----+

Use in Reporting Platforms

Additionally, you have the option of using well-known reporting platforms like Power BI to access your shared tables. In the context of Power BI, connecting to a Delta Sharing source is a very straightforward process. You can do this by selecting “Delta Sharing” from the available data sources options in Power BI and clicking the Connect button.