API Reference#

This page provides an overview of all public objects, functions, and methods included in the scystream-sdk.

Core (scystream.sdk.core)#

The core of the SDK are entrypoints

To configure entrypoints a decorator is provided. Use this to define entrypoints, and pass scystream.sdk.env.settings.EnvSettings if necessary.

Config (scystream.sdk.config)#

The scystream-sdk contains of two main configuration objects.

  1. SDKConfig (scystream.sdk.config.SDKConfig)

    This is a Object containing all the global configurations for the SDK to work. This could be, for example, the app name, which will be used to identify the compute block on your spark-master.

  2. ComputeBlockConfig (scystream.sdk.config.models.ComputeBlock)

    The ComputeBlockConfig is a file that “configures” the ComputeBlocks Inputs and Outputs. It also contains some metadata configurations (e.g. Author, Docker-Image-URL, …).

ENVs and Settings (scystream.sdk.env)#

When using the scystream-sdk and defining entrypoints, its important to give the user (via the scheduler) the possibility to define settings for each entrypoints.

These settings are set using env variables.

There are three main types of Settings:

  1. EnvSettings (scystream.sdk.env.settings.EnvSettings)

    The EnvSettings class inherits from the pydantic BaseSettings class. It can be used to parse env-variables from the .env file. You should use this class when defining the Settings for an entrypoint.

    However, you can also use this function to parse your custom environment variables which might not be user-defined.

  2. InputSettings (scystream.sdk.env.settings.InputSettings)

    Use this when defining settings for your inputs. Under the hood, this works exactly the same as EnvSettings.

  3. OutputSettings (scystream.sdk.env.settings.OutputSettings)

    Use this when defining settings for your outputs. Under the hood, htis works exactly the same as EnvSettings.

The SDK also provides more specific types of inputs and outputs. These offer predefined config-keys:

  1. FileSettings (scystream.sdk.env.settings.FileSettings)

  2. DatabaseSettings (scystream.sdk.env.settings.DatabaseSettings)

Spark Manager (scystream.sdk.spark_manager)#

We aim to handle all our data exchange & data usage using Apache Spark.

To use Spark you need to configure the scystream.sdk.spark_manager.SparkManager, which connects to a spark-master and gives you access to the session.

Bare in mind, currently only the database connection is handled using Spark. When using a Database, please make sure to setup the connection using:

Database Handling (scystream.sdk.database_handling)#

The database handling package provides utilities to connect to and interact with databases using a unified interface.

It supports both Pandas-based and Apache Spark-based workflows.

Supported Databases#

The SDK supports all databases compatible with SQLAlchemy via a DSN (Data Source Name), including:

  • PostgreSQL

  • MySQL

  • SQLite

  • Snowflake

  • and others

Core Functionality#

The scystream.sdk.database_handling.database_manager module provides the following core abstractions:

These methods allow reading from and writing to a database using a consistent API.

Schema Support (PostgreSQL)#

The SDK supports optional database schemas to logically separate tables within the same database.

Schemas are configured at initialization time of the database operations class and are applied automatically to all read and write operations.

from scystream.sdk.database_handling.database_manager import PandasDatabaseOperations

db = PandasDatabaseOperations(
    dsn="postgresql://user:pass@host:5432/db",
    schema="my_project"
)

df = db.read(table="users")
db.write(table="users", data=df)

This will result in queries being executed against:

SELECT * FROM "my_project"."users"

If no schema is provided, the default database schema (e.g. public in PostgreSQL) is used.

Schema Creation#

When using schemas, the schema must exist in the target database before performing read or write operations.

Important

Automatic schema creation is currently only supported for PostgreSQL databases.

For other database systems, schema creation (or equivalent namespace setup) must be handled externally.

Also the read and write methods might not work properly

Pandas Integration#

For most use cases, the SDK provides a Pandas-based implementation:

This implementation uses SQLAlchemy and supports any DSN-compatible database.

Spark Integration#

For distributed workloads, the SDK provides a Spark-based implementation:

Note

Currently, Spark integration only supports PostgreSQL.

To initialize Spark database access, use:

from scystream.sdk.spark_manager import SparkManager

manager = SparkManager()
db = manager.setup_pg(dsn)

File Handling (scystream.sdk.file_handling)#

The file handling package contains all the required utilities to connect & query from/to a file storage. Currently the file handling package does NOT make use of Apache Spark.

Currently the scystream-sdk supports the following file-storages:

  1. S3 Buckets (scystream.sdk.file_handling.s3_manager)

    The s3_manager module currently supports:

Scheduler (scystream.sdk.scheduler)#

The scheduler module can be used to list & execute entrypoints.

Modules#

The following modules are part of the scystream-sdk: