Remote Admin Training

Download as pdf or txt
Download as pdf or txt
You are on page 1of 196
At a glance
Powered by AI
The key takeaways from the training are an overview of the Dataiku platform, its architecture designed for production use, how it enables inclusive data science through various visual and automated tools, and techniques for managing memory for DSS processes.

The main components of the Dataiku architecture are the development, production and deployment zones which separate environments. It utilizes nodes like the design, automation and deployer/scoring nodes. The architecture is designed to be ready for production use.

Dataiku enables inclusive data science through features like visual data preparation and modeling, automated machine learning, visual pipelines and monitoring. It aims to make data science accessible to various roles from data scientists to business analysts.

Remote Admin Training

Day 1 Day 2
• Module 1: Dataiku Overview + Architecture • Module 5: ode Environments
• Lab: Installing DSS • Lab: Maintaining Code Evns

• Module 2: DSS Integrations • Module 6: DSS Maintenance

• Lab: DSS Integrations • Lab: Logs + Troubleshooting

• Module 3: Security • Module 7: Resource Management

• Lab: User and Group security • Lab: Cgroups + DSS Processes

• Module 4: Automation and API Nodes

• Lab: Installing Automation and API Nodes


• Understanding of basic linux commands

• DSS Basic Training, or equivalent
• SSH client set up on your personal machine

Module 1:

DSS Overview, Architecture, +


Dataiku Overview


Data Engineer Business Analyst

Analytics Leader Data Scientist

©2018 dataiku, Inc.


Inclusive Comprehensive Model Open Adaptation

Data Science Operationalization To Your Needs

©2018 dataiku, Inc.


Build Build Build Business Monitoring

plugins for…. plugins for…. Dashboard For…

Business Analyst Data Scientist Business Analyst Data Scientist Business Analyst

Integrate Work Monitor

Find Understand Business Together in … Results
Prepare Data Modelling Prototype


Optimize Integrate
Use For Productivity Results
Use as a Baseline
And Extend
Use for optimization

Data Scientist Analytics Leader Data Engineer Analytics Leader

To Enable Comprehensive Operationalization…

Prototype Reuse
Enable fast prototyping Augment or Replace Manual
(incl. data integration) for process thanks to AI
Detection of dead-ends

Time to Deploy Cost to Maintain

Augment or Replace Manual Augment or Replace Manual
process thanks to AI process thanks to AI

Dataiku Architecture

Dataiku DSS Architecture, Ready For Production
Development Zone Data Production Zone Web Production Zone

Deploy Workflow Deploy Model

Data Users
Node Node

Dev DWH Dev Hadoop / Production Production

data lake DWH / Hadoop Databases

Database System Web

Administrator Administrator Developer
Leverage your infrastructure

Run in Memory
Run in Database Python, R, …
Enterprise SQL,
Analytic SQL
By default, DSS Distributed ML
Run In Cluster Mllib, H2O, …
automatically chooses Spark, Impala, Hive, …
the most effective
execution engine
amongst the engines
available next to the
input data of each
computation. Data Lake
ML in Memory
S3 Python Scikit-Learn, R, …

Database Data
Vertica, File System Data
Redshift, Host File System,
PostgreSQL, Remote File System,
… 12

The Dataiku DSS Architecture (simplified)

Users DSS Design Node External Data sources

Hadoop Cluster
Project A
Project B

Web Browser DSS Server FS SQL DB


etc. Remote FS

External Compute Cloud Storage

In Memory
Compute 13
Example of Full Life Cycle of a Project

Design Automation API

Project Design Project Testing Project Testing


Development AWS


Design Automation Automation

API Production


Project Release Project Validation Project Production



API Production
Enterprise Scale Sizing Recommendation

Design nodes are generally consume more memory than other because
Design node it’s the collaborative environment for design prototyping and
128-256 GB experiments.

Automation node will run maintain and monitor project workflows and
Automation node models in production. Since the majority of actions are batches you can
64-128 GB partition the activity in the 24 hours and optimize resource consumption.
(+ 64 GB in preprod) You can also use a non production automation node to validate your
project before going to production

Scoring nodes are real time production nodes for scoring or labeling with
Scoring node
prediction models. A single node doesn’t require a lot of memory but these
4+ GB per node
nodes are generally deployed on dedicated clusters of containers or virtual
fleet of n nodes

Memory usage on the DSS server side can be controlled at the Unix level when DSS impersonation is activated
Database resource management can be done on the DB side at the user level when per-user credentials mode is activated 15
DSS Components and Processes

Starting the DSS Design/Automation Node.

● 4 processes are spawned
○ Supervisor: process manager
○ Nginx server listening to installation port
○ Backend server listening to installation port + 1
○ Ipython (Jupyter) server listening to installation port + 2
The next slides detail the role of each server and where they sit in the overall DSS architecture.
DSS Components and processes


Handles all interactions with the end

user through its web browser. It acts as
a HTTP proxy for forwarding requests to
all other DSS components. It binds to
the DSS port number specified at

Protocol: HTTP(s) and websockets.

DSS Components and processes

Metadata server of DSS

● Interact with config folder

● Prepare preview
● Explore (e.g. charts aggregation)
● git
● public api
● schedule
● scenarios

It binds to the DSS port number

specified at install +1.

Backend is a single point of failure. It won't go down alone! Hence it is supposed to handle as little actual
processing as possible. Backend can spawn child processes: custom scenario steps/triggers, Scala
validation, API node DevServer, macros, etc. 18
DSS Components and processes


It handles interactions with R , Python

and Scala notebook kernels using the
ZMQ protocol.

It binds to the DSS port number

specified at install +2.

DSS Components and processes


Handles dependencies computation

and recipes running on DSS engine. For
other engines and code recipes, it will
launch its child processes: Python, R,
Spark, SQL, etc.

DSS Components and processes

Handles non-jobs related background

tasks that may be dangerous, such as:

● metrics computation. It can launch

child Python processes for custom
Python metrics.
● sample building for machine
learning and charts.
● Machine learning preparation

DSS Components and processes

Handles Python-based machine

learning training, as well as data


Handles current user-created webapp

backends (Python Flask Backend,
Python Bokeh and R Shiny)

Open Ports
Base Installations
● Design: User’s choice of base TCP port (default 11200) + next 9 consecutive ports
Only the first of these ports needs to be opened out of the machine. It is highly recommended to firewall the other ports
● Automation: User’s choice of base TCP port (default 12200) + next 9 consecutive ports
Only the first of these ports needs to be opened out of the machine. It is highly recommended to firewall the other ports
● API: User’s choice of base TCP (default 13200) + next 9 consecutive ports
Only the first of these ports needs to be opened out of the machine. It is highly recommended to firewall the other ports

Supporting Installations
● Data sources: JDBC entry point; network connectivity
● Hadoop: ports + workers required by specific distribution; network connectivity
● Spark: executor + callback (two way connection) to DSS

Privileged Ports
● DSS itself cannot run on ports 80 or 443 because it does not run as root, and cannot bind to these privileged ports.
● The recommended setup to have DSS available on ports 80 or 443 is to have a reverse proxy (nginx or apache) running on the same
machine, forwarding traffic from ports 80 / 443 to the DSS port. (

Installing DSS

Command Line Installation
(the easy part)

The Data Science Studio Installation process is fairly straightforward. Due to the number of options available, we
do have several commands to issue for a full installation. There are a couple of important terms to understand
before we start.
● DSSUSER -- This is a Linux User ID that will run DSS. It does not require elevated privileges.
● DATADIR -- This is the directory where DSS will install binaries, libraries, configurations and store all data.
● INSTALLDIR -- This is the directory created when you extract the DSS tar file.
● DSSPORT -- This is the first port that DSS Web Server opens to present the Web UI. We request 9 additional
ports, in sequence, for interprocess communications.

● Hadoop Proxy User -- If you are connecting to a Hadoop cluster with Multi-User Security, the Proxy User
configuration must be enabled. Additional details are contained in our reference documentation.
● Kerberos Keytab -- If your Hadoop cluster users Kerberos, we will need a keytab file for the DSSUSER.

Key integration points

• HTTPS easily configurable for every access to DSS

• Support LDAP/LDAPS
• Support SSO (SAMLv2 and SPNEGO)

• Relies on impersonation where applicable

○ sudo on unix
○ proxy user on hadoop / Oracle
○ constrained delegation for SQL server
• Otherwise personal credential for other DBs

• Complete audit trail exportable to external system

• Permissions and multi level authorization dashboard

Example Install Commands

As root, install dependencies:

INSTALL_DIR/scripts/install/ -with-r
INSTALL_DIR/ -d /home/dataiku/dssdata -p 2600 -l /tmp/dsslicense.json
DATA_DIR/bin/dssadmin install-Hadoop-integration
DATA_DIR/bin/dssadmin install-Spark-integration
DATA_DIR/bin/dssadmin install-R-integration

As root:
/home/dataiku/dssdata/bin/dssadmin install-impersonation DSS_USER


Upgrade Options:
1. In place (recommended)
a. ./install_dir/ -d <path to data_dir> -u
2. Project Export/Import
a. tedious
3. Cloning
a. be careful of installing on the same machine (port conflicts, overwriting directories, etc)

Post Upgrade Tasks:

1. Rerun: R Integration (if enabled), Graphics Exports (if enabled), MUS Integration (if enabled)
2. Recommended to rebuild code envs
3. Recommended to rebuild ML models

Time for the Lab!

Refer to the Lab Manual for exercise


Lab 1: Installing DSS

Lab 2: Validating Installation
Lab 3: Upgrading DSS
Lab 4: Installing R integration (Optional)

Module 2:

DSS Integrations

SQL Integrations

DSS and SQL - Supported flavors

Supported Experimental Support Other Support

● MySQL ● IBM DB2 In addition, DSS can connect to any
● PostgreSQL 9.x ● SAP HANA database that provides a JDBC driver.
● HP Vertica ● IBM Netezza
● Amazon Redshift ● Snowflake
● EMC Greenplum ● sql/exasol
● Teradata
Warning Warning
● Oracle
● Microsoft SQL Server For databases not listed previously,
Support for these databases is
● Google Bigquery provided as a “best-effort”, we make we cannot guarantee that anything

no guarantees as to which features will work. Reading datasets often

precisely work works, but it is rare that writing works

out of the box.

DSS and SQL - Installing the Database Driver

1) Download the JDBC driver of the database.

2) Stop DSS: ./DATA_DIR/bin/dss stop
3) Copy the driver’s JAR file (and its dependencies, if any) to the
DATA_DIR/lib/jdbc directory
4) Start DSS: ./DATA_DIR/bin/dss start

We already have a PostgreSQL database connected to our platform.

DSS and SQL - Defining a connection through the UI.

We already have a PostgreSQL connection on DSS, but these would be the steps to follow to create your

● Go to the Administration > Connection page.

● Click “New connection” and select your database type.
● Enter a name for your connection.
● Enter the requested connection parameters. See the page of your database for more information, if
● Click on Test. DSS attempts to connect to the database, and gives you feedback on whether the
attempt was successful.
● Save your connection.
DSS and SQL - Connection parameters.
Advanced JDBC properties

For all databases, you can pass arbitrary key/value properties

that are passed as-is to the database’s JDBC driver. The
possible properties depend on each JDBC driver. Please refer
to the documentation of your JDBC driver for more

Fetch size

When DSS reads records from the database, it fetches them by

batches for improved performance. The “fetch size” parameter
lets you select the size of this batch. If you leave this parameter
blank, DSS uses a reasonable default.

DSS and SQL - Connection parameters.
Relocation of SQL datasets

For SQL datasets, in the settings of the connection, you can

configure (with variables):

● For the table name, a prefix and a suffix to the dataset

● The database schema name

For example, with:

● Schema: ${projectKey}
● Table name prefix: ${myvar1}_
● Table name suffix: _dss

If you go to project P1 (where myvar1 = a2) and create a

managed dataset called ds1 in this connection, it will be stored
in schema P1 and the table will be called a2_ds1_dss

Hadoop Integrations

DSS and Hadoop - Supported flavors

Supported Distros Experimental Support MUS Support

● Cloudera ● Dataproc Supported:

● Hortonworks ● HDInsight ● Cloudera

● MapR ● Hortonworks
Support for these distros is provided
Supported FS as a “best-effort”, we make no ● MapR

guarantees as to which features ● EMR

● HDFS precisely work
● S3

Read documentation for instructions on

setting up connections

Installing HDFS Integration
● DSS node should be set up as edge node to cluster.
○ I.E. common client tools should function, such as “hdfs dfs”, “hive”/ “beeline”, “spark-shell”/
“pyspark” / “spark-submit”

● Run integration script

○ ./bin/dssadmin install-hadoop-integration

○ ./bin/dssadmin install-hadoop-integration

-principal <principal> -keytab <keytab>

● Modify configuration settings in ADMIN >


HDFS connection

Root path : data will be stored at this location

Fallback Hive database : when a HDFS
dataset is built, a Hive table is created in the
dedicated database (Hive table = metadata,
no data duplication)
Additional naming settings : prefix/suffix Hive Note: By default DSS will save data to
tables or HDFS paths, create one Hive DB per root_uri/path_prefix+path_suffix/dataset_name. You can
project overwrite HDFS paths, but when rebuilding a datasets
DSS assumes it is contained in a dedicated folder and will
delete all files out of it. In short, don’t share
subdirectories with different datasets! 40
DSS schema vs Hive schema
• DSS Schema : DSS object
• Metastore : where the physical schema is stored
• If physical and DSS schema mismatch

Metastore DSS

• Welcome to the danger zone : clean way to rename a dataset without removing the data ?

Managed (Internal) vs External Hive tables
• Managed tables
• Managed by hive user
• Location : /user/hive/warehouse
• DROP TABLE : remove also the data
• Security managed with the Hive schema and a service like Sentry/Ranger, etc. : GRANT ROLE …

• External tables
• CREATE EXTERNAL TABLE ( … ) LOCATION ‘/path/to/folder’
• DROP TABLE : remove the hive table but not the data
• Security : filesystem permissions of the folder

• DSS can read both external and managed tables

• DSS creates (i.e. writes) only EXTERNAL tables
• In Hive, locations must be folders!!!

Exposing HDFS data to end-users
• Depend on your data ingestion process : raw data are put on HDFS

• From files
○ Create a HDFS dataset and specify the HDFS path of your files
○ dss_service_account needs to have access to those files

• Create a dedicated Hive table

○ If you need to query the data with a SQL language
○ Synchronize metastore , will create the hive table according to the naming policy (can be
overwritten at a dataset level)

Exposing Hive data to end-users
• Depend on your data ingestion process : Hive tables/views exists

• Hive tables : import from HDFS connection

• A hive table is just metadata around a hdfs/s3/etc
dataset. You can access via HDFS connection.
and grab metadata
• Hive views
• Connect directly to the Hive table through a Hive Dataset
• Exposing views
• Permissions handled by Sentry
• Read access to the database is enough
• No way to overwrite the metastore

• Use Hive dataset only for views !!!

• If you run a Spark recipe on top of a Hive dataset, data will be streamed into DSS backend, not loaded from

Hive config
• HiveServer2 :
• Recommended mode (others may be depreciated in the future)
• Mandatory for MUS
• Mandatory for notebooks and metrics
• Target the global metastore

• Hive CLI (global metastore)

• When MUS is not activated, you can have access to every Hive tables created by DSS, even if
your user doesn’t have access to the related HDFS connections

• Hive CLI (isolated metastore)

• Creates a specific metastore for each job, includes only the table in input of the recipe : improve security
• No access to dataset stats, used to optimize execution plan (Tez)

Multiple Clusters in DSS
• DSS can create compute clusters in

• Clusters can be created manually or via

a plugin
• EMR and Dataproc Plugins already exist
• Customers can extend cluster
functionality by creating custom

• Clusters use global hadoop

binaries, but overwrite client

• Leverage transient or persistent clusters.

Ideal for scenarios
Multiple Clusters Limitations/Warnings

● For DSS to work with a cluster, it needs to have the necessary binaries and
client configurations available.

● DSS can only work with one set of binaries, meaning that a single DSS
instance can only work with one Hadoop vendor/version.
○ DSS “cluster” definitions override global cluster client configs.

● For secure clusters, DSS is only configured to use one keytab, so all clusters
must accept that keytab (same realm or cross-ream trust)

● User Mappings must be valid in all clusters

Spark Integrations

Spark Supported Flavors + Usage

● Supported Spark is the same as supported Hadoop, with a few additions:

○ Databricks support is experimental
○ Spark on Kubernetes support is experimental

● Spark can be used in a variety of places in DSS

○ Scala/pyspark/sparkR recipes
○ Scala/python/R notebooks
○ Compute engine for visual recipes
○ SparkSQL recipes and Notebooks
○ Spark ML Algorithms available in Visual Analysis
○ H2O Sparkling Water integration

Installing Spark

● DSS node should be set up as edge node to spark cluster

○ i.e. spark-shell, spark-submit, pyspark, etc all function on the CLI

● Run spark integration script

○ ./bin/dssadmin install-spark-integration -sparkHome <path/to/spark>

○ Note that DSS can only work with one spark version.

● Configure spark in ADMIN > SETTINGS > SPARK

Spark Configuration

● Global Settings:
○ Admins can create spark configurations in
ADMIN > SETTINGS > SPARK. These define
spark settings that users can leverage.

○ It’s good to have a good default for users and

also some different options per workload.

• You can also set default confs here for recipes,

notebooks, sparksql, etc.

• Note: all Notebooks use the same spark conf. Restart

DSS after changing default.
Spark Configuration

● Project/Local Settings:
○ Project admin may also set spark conf at
the project level. SETTINGS > ENGINES &

○ Users may also set spark conf at the

recipe/VA level

○ Users may also set some spark conf

directly in code.

Notes on Spark Usage

● It is highly advisable to have spark read from an HDFS connection (even if it’s
on cloud storage, set up a HDFS connection w/ the proper scheme).

○ Spark is able to properly read dataset from HDFS connection and parallelize it accordingly.
○ Spark is also able to read optimized formats with the HDFS connector (parquet/ORC/etc),
whereas more native connectors don’t understand these formats
○ For non-HDFS/non-S3 datasets, spark will read the dataset in a single thread and create 1
partition. This may likely be non-optimal, so users will need to repartition the dataset before
any serious work on large datasets.
○ For HDFS datasets, Groups using it should be able to read details of the dataset.

Spark Multi-cluster

● Spark multi-cluster is akin to Hadoop Multi-cluster with the same


● Databricks integration is another experimental option.

○ Databricks integration is available on AWS and Azure.
○ Clusters are transient. They are spun up when users try to run a spark job.
○ Clusters can be per-project or per-user, to enforce stricter security.
○ Databricks cluster definition is contained within the spark configuration. Configurable so you
can leverage many settings in databricks cluster.

● EMR and Dataproc (experimental) plugins are also options, outside of normal
hadoop distributions (CDH/HDP).

Time for the Lab!

Refer to the Lab Manual for exercise


Lab 1: Set up Integration to Postgres

Lab 2: Set up Integration to Spark standalone
Lab 3 Set up Integration to Cloud Storage (Optional)

Module 3:


DSS Security

User Identity
User Identity
● Users come from 1 of two locations:
○ local db

User Authentication
● Users are authenticated via:
○ local password
Users can be 1 of three types:
● Local (local acct/local pass)
● LDAP (ldap acct/ldap pass or SSO)
● Local No Auth (local acct/sso)

LDAP(S) Integration
4 main pieces of information to provide:
● LDAP Connection: obtain from LDAP admin
● User Mapping: Filter corresponding to users
in DSS.
○ specify which attributes are display name and
○ toggle whether users are automatically imported
or not
● Group Mapping: Filter defining to which
groups a user belongs
○ specify attribute for group name
○ optionally white list groups that can connect to
● Profile Mapping: Define what profile a
group is assigned to

SSO Integration
● Users can be from local DB or LDAP
● Supports SAMLv2 (recommended) and SPENEGO
● For SAML need:
○ IdP Metadata (provided by SSO admin)
■ Will likely need a callback url:
○ SP Metadata (generate)
■ If there’s no internal process, you can do
this online. Will need at least entityID (from
IdP Metadata) and Attribute Consume
Service Endpoint (callback url). X.509 certs
are also not uncommon get from the IdP
○ Login Attribute
■ Attribute in the assertion sent by IdP that
contains the DSS login.
○ Login Remapping Rules
■ Rules to map login attribute to user login.
■ I.E. → first.last via
([^@]*) -> $1
Permission Model
Multi-faceted tools to control security in
● Group:
the system:
○ Collection of users
○ Defines Global Permissions (i.e. are you
● Users: an admin? Can you create connections?
○ Must exist to login into DSS
○ Belong to a GROUP etc)
○ Have a PROFILE

● Projects:
● User Profile: ○ Determines privilege of each GROUP
○ Mainly a licensing mechanism ○ Can enforce project-level settings (lock
○ Designer: R/W access code env, etc)
■ aka Data Scientist/Data Analyst
○ Explorer: R access only
■ aka Reader ● Data Connections:
○ Grant access to GROUPS
○ Some connections allow per-user
Permission Model
● Users get assigned profile + group.
○ Can determine this automatically via
mapping rules, as discussed previously

● Auth Matrix shows all projects that a user has

access to and privileges granted. Ditto for
Permission Model
User Profiles
Each user in DSS has a single “user profile” assigned to it.
The three possible profiles are:
- Reader: users with this profile only have access to the shared dashboards in each DSS project.
- Data Analyst: data analysts can create datasets, perform visualizations, use all visual processing recipes, and more
generally perform most of the actions in the DSS interface.
- Data Scientist: in addition, data scientist can use code-based recipes (Python, R, …) and the machine learning
components of DSS.
User profile is not a security feature, but a licensing-related concept. DSS licenses are restricted in number of each
profile. Use the regular groups authorization model described later.

Note that in new licenses, the Data Analyst does not exist anymore:

- Data Scientist and Data Analyst -> Designer

- Reader -> Explorer

Permission Model
Global Group Permissions

Users can be assigned to one or more

groups. Groups are defined by
permissions their members are
granted (e.g. write code, create
projects, access to projects etc)

Do not rely on user profiles to enforce

permissions. We do not provide any
guarantee that the user profile is
strictly applied. For real security, use

We will also see that per-project

permissions can be defined to curb
permissions of the users that have
access to the project (except for
members of an "Administrator" group)
Permission Model
Per-Project Group Permissions

- On each project, you can configure an arbitrary number of groups who have access to this project. Adding
permissions to projects is done in the Project Settings, in the Security section.
- Each group can have one or several of the following permissions. By default, groups don’t have any kind of access to
a project.
- Being the owner of a project does not grant any additional permissions compared to being in a group who has
Administrator access to this project.
- This owner status is used mainly to actually grant access to a project to the user who just created it.
Permission Model
Additional Project Security

PROJECT > SECURITY can manage other aspects of security:

● Exposed Elements
○ High level view of which elements are exposed to other projects.
Project admins can modify.

● Dashboard Authorizations
○ Which Objects can be accessed Dashboard-only users

● Dashboard Users
○ Add external users who are able to access Dashboards 66
Permission Model
Additional Project Settings

PROJECT > SETTINGS can manage other aspects of configuration:

● Code Envs
○ Set default code env and prevent modification

● Cluster Selection
○ Select default Cluster to use

● Container Exec
○ Specify default container env

● Engines & Connections

○ Restrict Engines for use in Recipes
○ Change default Spark/Hive config

Permission Model
Data Connections

● Data Connections should be restricted to only

groups who should have access.

● You can create many connections and limit use +

details readability group by group. Details
include file path, connection params,
credentials, etc.

● Connections can be made read only

● Some connection support per-user credentials

(DB, etc). Users can then specify in their User
HTTPS/Reverse Proxy
● You can set up DSS to work with HTTPS by specifying the SSL certs in
data_dir/install.ini. In particular, fill out the following section:

This provides access on https://DSS_HOST:DSS_PORT

● If you want to use the default port of 443, a reverse proxy is needed. Follow
your orgs best practices in setting this up. Our docs have a few examples for
setting up nginx and Apache servers as reverse proxies.
○ This allows you to access DSS over:
■ https://<VANITY_URL>
■ http://<VANITY_URL>

In order to help our customers better comply with GDPR(General Data Protection
Regulation), DSS provides a GDPR plugin which enables additional security
● Configure GDPR admins and
documentation groups
● Document datasets as having personal data
● Project level settings to control specific functionality:
○ Forbid Dataset Sharing
○ Forbid Dataset/Project Export
○ Forbid Model Creation
○ Forbid uploaded Datasets
○ Blacklist Connections
● Easily filter to find sensitive datasets

Time for the Lab!

Refer to the Lab Manual for exercise


Lab 1:: Validate User/Group Security

Module 4

DSS Automation and API Nodes

DSS Automation Node Overview

Production in DSS - O16n
Deploying a Data Science project to production
Project in production

Sandbox project


Real time scoring API End users

Deployment to production - Motivation
Why do we need a separate environment for our Project ?

We want to have a safe environment where our prediction project is not at risk of being altered by
modifications in the flow. We also want to be connected to our production databases.

We want to be able to have health checks on our data, monitor failures in building our flow and be
able to roll-back to previous versions of the flow if needed.

To do that we will need the Automation Node

Installing/Configuring an Automation Node
Once the design node is set up, the automation node is straightforward to set-up.
● Install Automation Node via:
dataiku-dss-VERSION/ -t automation -d DATA_DIR -p PORT -l LICENSE_FILE
○ DATA_DIR and PORT are unique to the automation node. I.E. Do NOT use the same
ones used for the Design Node.
● Once installed, configured it exactly like we did the design node. I.E.
○ R integration
○ Hadoop Integration
○ Spark Integration
○ Set up dataset connections
○ Users/Group setup
○ Multi-user Security, etc.

DSS Design to Automation Workflow

From Design Node to Automation Node
Moving a project from the Design Node to the Automation Node takes a few straightforward
1) “Bundle” your project in the Design Node : this will create a zip file with all your
project’s config
2) Download the bundle to your local machine.
3) Upload the bundle to the Automation Node to create a new project or update an
existing one.
This step may require dataset connection remapping.

4) Activate the Bundle on the on the Automation Node.

Note that all those steps can be automated using our Public API either within DSS instance (a
Macro) or in another application.

From Design Node to Automation Node
Design Data Sources Design Node Automation Node Production Data Sources

Hadoop Cluster Hadoop Cluster

Project A Project A
Project B Project B


etc etc

. .

Remote FS Remote FS

et et
c. c.


etc etc
. .

d Bu

Cloud Storage Cloud Storage

projects - Monitor projects in
- Version control.
- Consume Deliverables/Consumption
Analytics(Dashboards) 79
Creating a Bundle
On the Design node, go to Bundles > Create your first Bundle

By default, only the project metadata are bundled. As a result, all datasets will come empty and models will
come untrained by default.

A good practice is to have the Automation Node connected to separate Production data sources. Dataset
connections can be remapped after uploading the bundle.

The Design node tracks all bundles. You can think of these as versions of your project.
People Operations Manager

Download the bundle Hands-on

On the Design Node: Select the Bundle and download it

Upload the bundle to the Automation Node Hands-on

Click Import Your First Project Bundle, choose the bundle file on your computer
and click Import

When importing the project, you may be prompted to remap connections and/or
Code Envs 82
Activate the bundle Hands-on

From the bundle list, click on your bundle > Activate

Finally, activate your Scenarios
After activating your first bundle, you need to go to the Scenario tab and activate the three
scenarios. You can trigger test them to make sure everything is OK.
You won't need to activate them again when updating the bundle as we will see in the next

Project versioning

As new bundles are produced for a project,

DSS will track them separately. Although
DSS does not provide automatic version
numbering, customers are encouraged to
utilize a naming schema that is conducive to

Similarly, the automation node will track all

the versions that it has received. This
makes it easy to understand what has gone
on in the project and what is currently

Rolling back to a previous version
From the bundle list, You can always select an older version and click “Activate” to roll back to that

Or… use the macro
DSS has a macro for automating pushing a bundle from a design node to an automation node.

For complicated workflows, you can also work directly w/ the DSS APIs and implement whatever logic
is needed.
DSS API Deployer/Node Overview

What is an API ?
An API is a software intermediary that allows two applications to talk to each other and
exchange data over the HTTP protocol.
ex : Getting weather data from Google API

An endpoint is a single path on the API and is contained within an API Service. Each endpoint fulfils a single

The DSS API Node
We can design different kinds of REST Web Services in the Design Node. Those web services can receive ad-hoc
requests from clients and return "real time" responses over the HTTP protocol. Those REST services can be hosted on
separate DSS Instance: the API Node

Client Application API Node

Model Prediction


Request with features

DSS lets you create different

types of API endpoints.
API Services - Prediction Model Example
In this example, we expose a visual model as an API endpoint.

Optional Query Enrichment

Client Optional data
‘feature1’: 1, (Java)
Application transformation
“{ ‘feature2’: 2, Scoring
‘feature1’: 1 ‘feature3’: 3 (prep script)
‘feature2’: 2 }”

Managed SQL db
API Node

Referenced SQL db

HTTP(S) Response: {‘prediction’: 42, …. }

DSS API Nodes - Concepts
● Flow
○ Place in Design/Automation node where model is deployed for batch workloads
● API Designer
○ Part of the project where API Services and Endpoints are created/managed.
● API Deployer
○ Central UI to manage all API Nodes and model deployments.
● API Node
○ Server that hosts endpoint as API and responds to REST API calls.
● API Service
○ Unit of deployment on API Node. Can contain many endpoints.
● API Endpoint
○ A single url path on the API node. Can be one of many types (model, python/R function, sql recipe, etc).
● API Service Version
○ A particular version of the API service.
● API Infrastructure
○ Infrastructure that API nodes run on. Can be Static or K8s.
● Model Deployment
○ Main object on the API Deployer. Corresponds to a single API Service Version running on a particular
API Services - Prediction Model Example

Service A v1
Service A v2 Service A v2 Service A v2
pred_endpoint pred_endpoint pred_endpoint

Service B v1 Service A v1

pred_endpoint_2 pred_endpoint

Flow API designer


Service A v2

Model API

API Services - The Model API Deployer
The model API Deployer is a visual interface to centralize the management of your APIs deployed on one or several
Dataiku API Nodes.
It can be installed locally (on the same node as Design or Automation node - not set up) or as a standalone node
(requires install)
If using a local API Deployer it can be accessed from the menu

Installing/Configuring an API Deployer Node
● Design/Automation nodes have a API Deployer built in. The local API Deployer can be used, or a
separate deployer can be set up. A separate deployer is typically recommended when many
Design/Automation nodes will be flowing into the same deployer, or when there are many API nodes or
deployments to manage.
● Install API Deployer Node via:
dataiku-dss-VERSION/ -t apideployer -d DATA_DIR -p PORT -l LICENSE_FILE
○ DATA_DIR and PORT are unique to the apideployer node. I.E. Do NOT use the same ones used for
the Design Node.
● Generate a new API key on the API Deployer (ADMIN > Security > GLOBAL API KEYS). Must have admin
● On Every Design/Automation node that will connect ot the deployer:
○ Go to Administration > Settings > API Designer & Deployer
○ Set the API Deployer mode to “Remote” to indicate that we’ll connect to another node
○ Enter the base URL of the API Deployer node that you installed
○ Enter the secret of the API key
● The API deployer doesn’t directly access data so we don’t need to set up all the integration steps we did
on the design/automation node.
Installing/Configuring an API Node
● Install API Node via:
dataiku-dss-VERSION/ -t api -d DATA_DIR -p PORT -l LICENSE_FILE
○ DATA_DIR and PORT are unique to the api node. I.E. Do NOT use the same ones used for the Design

● The API Node doesn’t directly access data so we don’t need to set up all the integration steps we did on
the design/automation node.

Setting up Static Infrastructure on API Deployer

● For each API Node, generate an API key

○ ./bin/apinode-admin admin-key-create
● On the API Deployer, go to API Deployer > Infrastructures
○ Create a new infrastructure with “static” type
○ Go to the “API Nodes” settings page
○ For each API node, enter its base URL ( and the API key
created above
● Then, go to the “Permissions” tab and grant to some user groups the right to deploy models to this

Using K8s for API Node Infra
API Deployer Node must be set up to work with K8s. Requirements are the same as having Design/Automation
node work with K8s. Details will be covered in a later section. Once Configured:

● Go to API Deployer > Infrastructures

● Create a new infra with type Kubernetes
● Go to Settings > Kubernetes cluster

The elements you may need to customize are:

● Kubectl context: if your kubectl configuration file has several contexts, you need to indicate which one DSS
will target - this allows you to target multiple Kubernetes cluster from a single API Deployer by using
several kubectl contexts
● Kubernetes namespace: all elements created by DSS in your Kubernetes cluster will be created in that
● Registry host: registry where images are stored.

Grant permissions to the Infra to the group as needed.

DSS API Deployer Workflow

Deploying our prediction model

The workflow for deploying the prediction model from your Automation node to an API
Node is as follows:

1) Create a new API Service and an API endpoint from your flow model
2) (Optional) Add a data enrichment to the model endpoint
3) Test the endpoint and push a new version to the API deployer
4) (Optional) Deploy our version to our Dev infrastructure
5) Test our version and push it to Production infrastructure
6) (As needed) Deploy a new version of the service with an updated model
7) (As needed) A/B test our 2 services versions inside a single endpoint
8) Integrate it in our real time prediction App.

Creating an Endpoint in a new Service

● API Services and endpiont can be

created from the flow in the design or
automation node and pushed to the
API Deployer
● If no API Deployer is used, you can
download models from
Design/Automation and upload to the
API Node directly via the CLI.
● Using an API Deployer has many
advantages and is highly
recommended for customers.

Push to API Deployer

- Push to the API deployer: by doing

so, you create a new version of the
service and ship it to the API
- Every Deployment is a new version.

Deploying your API service version to an infrastructure

Once a model is in the API Deployer, it is easy to deploy it to a target infrastructure.

Having multiple infrastructures enables customers to have dev, test and production dedicated API
Nodes. You can connect a single API Deloyer to many in order to easily manage your envs.

Go to API Deployer Select your API Service, Start Deployment

deploy it to infra_dev

Switching our deployment from dev to prod

- In your dev Deployment, go to
Actions > Copy this deployment
- Select the copy target as the
PRODUCTION stage infrastructure
- Click on “Start now”
- Once the prod deployment is
done, check the Deployments

Switching our deployment from dev to prod

We now have two deployments running, one on our Dev infrastructure and the other in Production

We have a real time prediction API !!
Go to Deployment > Summary > Endpoint URL
This url is the path to our API endpoint → this is what we will use in our third party apps to get model
You will get a different URL for each API node in your infrastructure. You can set up a load balancer to
round-robin the different endpoints.

Calling our real time prediction API from the

Deploying a new version of the service
You can deploy a new version of your service at any time in the API Designer.
Click on your service and push a new version (‘v2’, etc) to the API Deployer.

Deploying a new version of the service
Go to your API Deployer, deploy the new version of your deployment to your dev infrastructure, select
“Update an existing deployment”

A/B testing service versions
In order to A/B test our 2 service versions, we will have to randomly dispatch the queries between
version 1 and version 2 :
1. Click on your Deployment > Settings
2. Set Active version mode to “Multiple Generations”
3. Set Strategy to “Random”
4. Set Entries to :
{"generation": "v2",
"proba": 0.6},
{"generation": "v1",
"proba": 0.4}
5. Save and update deployment

A/B testing service versions
Go back to the predictions webapp, run several times the same query and see how the same query is
dispatched between version 1 and 2 !

DSS Automating the API Deployer

Create new API Service Version in Scenario
Go to your scenario’s steps
Add a step Create API Service Version → This will create a new API service version with the model

Create new API Service Version in Scenario

1. Choose your API Service

2. add an id to that version
3. check box to make version id
unique for future runs
4. Publish to api deployer
5. Add a variable name in Target
variable → this will save the
version id to a variable that we
will be able to use in later steps
of the scenario

Update API deployment in Scenario
Adding a step to Update our deployment in the API Deployer

Update API deployment in Scenario
Adding a step to Update our deployment in the API Deployer

Id of your deployment on the API


New service version id → this uses the

variable we just created before

Save and run the scenario, Go to the API Deployer and check that your new version
is deployed on dev infrastructure
Time for the Lab!

Refer to the Lab Manual for exercise


Lab 1: Install Automation and API Nodes

Lab 2: Test Automation Node
Lab 3: Install API node
Lab 4: Test the API node

Module 5

Code Environments

DSS Code Environments

Code Environments in DSS
Customize your environment: code env !

DSS allows you to create an arbitrary number of code environments !

→ A code environment is a standalone and self-contained environment to run

Python or R code

→ Each code environment has its own set of packages

→ In the case of Python environments, each environment may also use its
own version of Python

→ You can set a default code env per project

→ You can choose a specific code env for

any Python/R recipe

→ You can choose a specific code env for

the visual ML
Code Environments in DSS

➢ DSS allows for Data Scientists to create and manage their own Python and R coding
environments, if given permission to do so by an Admin (Group Permissons)
➢ These Envs can be activated and deactivated for different pieces of code/levels in
DSS including
○ Projects, web apps, notebooks, and plugins
➢ To create/ manage Code Envs: Click the Gear -> Administration -> Code Envs

Code Environments in DSS

➢ When creating a New Code ENV in DSS, it is best practice

○ Keep it Managed by DSS
○ Install Mandatory Packages for DSS
○ Install Jupyter Support
➢ Options for
○ Using Conda
■ Conda must be on PATH
○ Python Version (2 and 3 supported)
■ Python version must be on PATH
○ Importing your own ENV
➢ Base Packages:
○ Mandatory: must be included to work in DSS Non-Managed Code Env:
○ Jupyter: must be included to use in Notebook ● Point to path of python/R environment on the DSS
host. DSS will not modify this environment.
Code Environments in DSS
Uploading a Pre-built ENV

➢ You can upload your own pre-built

environment by selecting a file on
your computer
○ Make sure it has these mandatory
Dataiku Packages for core feature
functionality of the Internal Dataiku
○ Essentially, pass in a

Code Environments in DSS
Installing Packages to your Env

➢ To Install Packages to your ENV

○ Click on your ENV in the list of
○ Go to ‘Packages to Install’ section
○ Type in the packages you wish to
install line by line, like how you
would for a requirements.txt file
○ Click Save and Update

➢ Standard pip syntax applies

○ i.e. -e /local/path/to/lib will
install a local python package not
availalble on pypi
➢ Review installed packages in
“Installed Packages” 124
Code Environments in DSS
Other Options

➢ Permissions
○ Allow groups to use the code env
and define their level of use: i.e.
use only, can manage/update
➢ Container Exec
○ Build docker images that include
the libraries of your code env
○ Build for specific container
configs or all configs
➢ Logs
○ Review any errors in install code

Code Environments in DSS
Activating Code Envs

➢ To activate a ENV for all code

recipes in a project
○ Go to Project Settings
○ Settings Tab
○ Code Recipes
○ Select the ENV you want to
➢ You can set the ENV to use for
a notebook and other
applications separately

Using Non-standard Repositories

● By default, DSS will connec to public repositories

(PyPi/Conda/CRAN) in order to download libraries
for code env.
● This is undesireable in some customer
○ air-gapped installed
○ customers with restrictions on library use
● Admins can set up specific mirrors for use in code
○ ADMIN > SETTINGS > MISC > Code env extra
● Set CRAN mirror URL, extra options for pip/conda as
needed. Follow standard documentation.
○ example: --index-url for pip
R Studio Integration

RStudio Integration - Overview
● DSS comes with Jupyter pre-installed for Notebooks use. This enables use of coding in:
○ Python
○ R
○ Scala
● Some Data Scientists prefer using different editors. Options are available for non-Jupyter use:
○ Embedded in DSS:
■ RStudio Server on DSS Host
■ RSTudio Server External to DSS Host
○ Other External Coding:
■ Rstudio Desktop
■ Pycharm
■ Sublime
● Note, execution is always done via DSS. External coding allows connecting to DSS via API to edit code and push
back into DSS.

RStudio Integration - Desktop

● Install Dataiku Package:

○ install.packages("http(s)://DSS_HOST:DSS_PORT/public/packages/dataiku_current.tar.gz", repos=NULL)

● Set up connection to DSS:

○ In code:

○ In Env Variables:

○ In ~/.dataiku/config.json
● Addins menu now has options for
interacting with dataiku
● Docs have a user tutorial for
working with these commands

RStudio Integration - External Server

● Rstudio on an External Host can be set up exactly like RStudio desktop to remotely work with DSS
● Additionally, you can embed RStudio Server in the DSS UI:
○ Edit /etc/rstudio/rserver.conf and add a line www-frame-origin = BASE_URL_OF_DSS
○ Restart RStudio Server
○ Edit DSS_DATA_DIR/config/ and add a line
○ Restart DSS
● Rstudio can now be accessed via the UI.
● Login to RStudio Server as Usual
● Interact w/ DSS as described with Desktop Integration.

RStudio Integration - Shared Server

● If
○ Rstudio Server is on the same host as DSS
○ MUS is enabled
○ the same unix account is used for DSS and Rstudio, then
● An enhanced integration is available:
○ DSS will automatically install the dataiku package in the user’s R library
○ DSS will automatically connect DSS to Rstudio, so that you don’t have to declare the URL and API token
○ DSS can create RStudio projects corresponding to the DSS project
● Embed R Studio as described for the external host. RStudio has an “RStudio Actions” page where you can:
○ Install R Package
○ Setup Connection
○ Create Project Folder

Time for the Lab!

Refer to the Lab Manual for exercise


Lab 1: Creating a Managed Code Environment

Lab 2: Creating a Python 3 Code Environment
Lab 3: Create an Unmanaged Code Environment
Lab 4: Create Local Python Mirror (Optional)

Module 6

DSS Maintenance

DSS Logs

DSS Logs

There are many types of logs in DSS:

- Main DSS Processes logs

- Jobs logs
- Scenario Logs
- Analysis Logs
- Audit logs

Main DSS Process Log Files

Main DSS Processes log files

Those logs are located in the DATA_DIR/run directory and are also accessible through the UI
(Administration > Maintenance > Log files)

Main DSS Processes log files
By default, the “main” log files are rotated when they reach a given size, and purged after a given number of
rotations. By default, rotation happens every 50 MB and 10 files are kept.

Those default values can be changed in the DATA_DIR/install.ini file (the installation configuration file)

Job Logs
Everytime you run a recipe a log file is generated. Go to a job page project. Click on the triangle ("play") sign
or type the “gj” keyboard shortcut

The last 100 job log files can be seen through the UI (see picture above). All the job logs files are stored in the
DATA_DIR/jobs/PROJECT_KEY/ directory.
Job Logs
When you click on a job log, you have the possibility to view the full log or downloading a job diagnosis.

When interacting with Dataiku support about a job, it is good practice to send us a Job diagnosis.

The DATA_DIR/jobs/PROJECT_KEY log files are not automatically purged. So the directory can quickly become big.

You need to clean old job log files once in a while. A good way to do this is through the use of Macros, which we will
disuss later. 141
Scenario Logs
- Each time a scenario is run in DSS, DSS makes a snapshot of the project configuration/flow/code, runs the
scenario (which, in turn, generally runs one or several jobs), and keeps various logs and diagnostic
information for this scenario run.
- The log files are located in the scenario section, in the tab last run

- on the DATA_DIR, scenario logs are located at

Visual Analysis Logs
- Amongst a lot of other info, Visual Analysis creates a log for each model trained. This log file can be
accessed via the Visual Analysis component in Model Information> Training Information.
- Additionally, this gets saved in the directory:


- These logs are not rotated, along w/ the other data in Visual
- You can manually remove files or delete analysis data
via a macro.

Audit Trail Logs
- DSS includes an audit trail that logs all actions performed by the users, with details about user id,
timestamp, IP address, authentication method, …
- You can view the latest audit events directly in the DSS UI: Administration > Security > Audit trail.

- Note that this live view only includes the last 1000 events logged by DSS, and it is reset each time
DSS is restarted. You should use log files( in DATA_DIR/run/audit) or external systems for real
auditing purposes. 144
Audit Trail Logs

- The audit trail is logged in the DATA_DIR/run/audit directory

- This folder is made of several log files, rotated automatically. Each file is rotated when it reaches 100 MB,
and up to 20 history files are kept.

Modifying Log Levels
● Log levels can be modifying by changing parameters in:
○ install_dir/resources/logging/
● Configure by logger + by process.
○ Logger is typically 4th component you see in a log file, i.e.:
○ [2017/02/13-09:01:01.421] [DefaultQuartzScheduler_Worker -1] [INFO]
[dku.projects.stats.git] - [ct: 365] Analyzing 17 commits

○ Processes are what we discussed in DSS architecture, jek, fek, etc. dku applies to all processes.

○ You can split processes out to their own log file as well, i.e.
○ install_dir/resources/logging/dku-

DSS Diagnostic Tool
You may have noticed the Diagnostic tool in the maintenance tab. When interacting with the DSS support
about an issue that is not related to a specific job, they may request this information.

This creates a single file that gives DSS support a good understanding of the configuration of your system, for
aiding in resolving issues.

You’ll be able to configure options for inclusion. 147


Troubleshooting Backend Issues

UI Down

• Check process status

• Check the backend.log in $DIP_HOME/run/ (prefer tail over other tools)
• Search for *Exceptions [ERROR] and stacktraces
• If dataset related, test the connection

UI accessible

• Check the backend.log via the UI (admin>maintenance>backend.log)

• Search for *Exceptions [ERROR] and stacktraces
• Test the same action on other projects or items
• If dataset related, test the connection
Troubleshooting Job Issues
• Read the exception stacktrace and focus on the ’caused by section’
when it exists
• Test every underlying connection
○ Test outside DSS as well to exclude underlying data platform
• Try to test it from a notebook if possible
• Try to retrieve the command launched from the backend.log

Troubleshooting UI Issues

Browser dev tools


Troubleshooting Notebook Issues

• Read Notebook Stacktrace. Differentiate between coding

errors and system errors
• Inspect ipython.log for more details
• Ensure correct code env is used
• Ensure the correct kernel is used. Try restarting the
• For Hadoop-connections, ensure they are working
properly outside of notebook.

Troubleshooting Hadoop/Spark Issues

• Read DSS message to understand underlying problem. Check backend to see if

more info is provided.
• Double-check logs on hadoop/spark to better understand issue
• For connection issues, try running on DSS host external to DSS (i.e. spark-shell,
beeline, etc)
• For Spark/Yarn issues, get yarn application_ID in DSS log and check logs.
• Performance issues: often a result of poor configuration of sub-optimal flow in
DSS (i.e. running spark job on sql dataset instead of hdfs dataset, etc).

Working with DSS support

Forward to support:

• Get the DSS diagnostic ./bin/dssadmin run-diagnosis -i /tmp/mydiag.zop

• Get the job diagnostic
• Get the system info

Working with DSS support
For customer only, open a ticket on our support portal: Or send an email to

Another channel for support is the Intercom chat that you can reach anywhere on

At times, logs or diagnosis might be too big to be attached to your request. You may want to use to transfer files

Try to internally manage your questions to the Dataiku support to avoid duplicates and to make sure
everybody on your team benefits from the answers.
Working with DSS support - Intercom
Intercom is the place to visit for usage questions. See example below. (Also, check the documentation :D )
Refrain from using any support channels for code review or administrating task over which we have no

Usage Debugging code / Performance tuning

Feature capabilities Administrative requests
Advanced data science consulting
✓ How can I change the sample of data
shown in my prepare recipe?
✓ How can I modify the size of the bins on ✘ My code is not working. Can you please
the chart? review my code?
✓ For my flow, where would be the best ✘ Can you grant me access to an
place to filter my data? I am doing it additional database?
through the join recipe but is that ✘ Can you tell me what algorithm will
efficient? provide the best performance for my

DSS Data Directory, Disk Space, +

Dataiku Data Directory - DATA_DIR

The data directory is the unique location on the DSS server where DSS stores all its
configuration and data files.
Notably, you will find here:
- Startup scripts to start and stop DSS.
- Settings and definitions of your datasets, recipes, projects, …
- The actual data of your machine learning models.
- Some of the data for your datasets (those stored in dss managed local connections).
- Logs.
- Temporary files
- Caches
The data directory is the directory which you set during the installation of DSS on your server
(the -d option).
It is highly recommended that you reserve at least 100 GB of space for the data directory
Dataiku Data Directory - DATA_DIR
├── install.ini file to customize the installation of DSS
DATA_DIR ├── instance-id.txt uid of installed dss
├── R.lib R libraries installed by calling install.packages()from a R notebook. ├── jobs job logs and support files for all flow build jobs in DSS
├── analysis-data data for the models trained in the Lab part of DSS. ├── jupyter-run internal runtime support file for the Juypter notebook. cwd resides
├── apinode-packages code and config related to api deployments in here for all notebooks
├── bin various programs and scripts to manage DSS. ├── lib administrator-installed global custom libraries (Python and R), as well as
JDBC drivers.
├── bundle_activation_backups
├── local administrator-installed files for serving in web applications
├── caches various precomputed information (avatars, samples, etc)
├── managed_datasets location of the “filesystem_managed” connection
├── code-envs definitions of all code environments, as well as the actual packages.
├── managed_folders location of the “filesystem_folders” connection
├── code-reports
├── notebook_results query results for SQL / Hive / Impala notebooks
├── config all user configuration and data. license.json, etc
├── plugins plugins (both installed in DSS, and developed directly in DSS)
├── data-catalog data used for data catalog, table indices, etc
├── prepared_bundles bundles
├── databases several internal databases used for operation of DSS
├── privtmp temp files, don’t modify
├── dss-version.json version of dss you’re running
├── pyenv builtin Python environment of DSS
├── exports used to generated exports (notebooks, datasets, rmarkdown, etc)
├── run all core log files of DSS
├── html-apps
├── saved_models data for the models trained in the Flow
├── install-support internal files
├── scenarios scenario configs and logs
├── timelines databases containing timeline info of dss objects
├── tmp tmp files
└── uploads files that have been uploaded to DSS to use as datasets.
For more info:
Managing DSS Disk Usage

- Various subsystems of DSS consume disk space in the DSS data directory.
- Some of this disk space is automatically managed and reclaimed by DSS (like
temporary files), but some needs some administrator decision and
- For example, job logs are not automatically garbage collected, because a user or
administrator may want to access it an arbitrary amount of time later.

There are two ways to delete those files:

1) Manually delete them on the DATA_DIR (cron task)
2) or use DSS Macros in a scenario.

We will cover Macros in a bit but first let's see what other files we can delete in the

Managing DSS Disk Usage

- Some logs are not rotated (Jobs and Scenarios). It is then crucial to clean those
once in a while.
- In addition to those files, there are some other types of files that can be deleted
to regain some disk space.
1) Analysis Data. analysis-data/ANALYSIS_ID/MLTASK_ID/
2) Saved Models.
3) Exports Files exports/
4) Temporary Files (manual deletion only) tmp/
5) Caches (manual deletion only) caches/

DSS Macros

Macros are predefined actions that allow you to automate a variety of tasks, like:

● Maintenance and diagnostic tasks

● Specific connectivity tasks for import of data
● Generation of various reports, either about your data or DSS

Macros can either be:

● Run manually, from a Project’s “Macros” screen.

● Run automatically from a scenario step
● Made available for running to dashboard users by adding them on a dashboard.
Macros can be:

● Provided as part of DSS

● In a plugin
● Developed by you
Macros Provided by DSS

- Go to any project and click on Macros on the navigation bar

- Fill out macro settings and run!

Backup/Disaster Recovery
• Periodic backup of DATADIR (contains all config/DSS state)
• Consistent live backup requires snapshots (disk-level for cloud and NAS/SAN, or OS-level with LVM)
• Industry standard backup procedure applies

Dataiku Data Directory - DATA_DIR
Dataiku recommends backing up the entire data directory. If, for whatever reason,
that is not possible, the following are essential to backup:

Include in Backups:
R.lib managed_folders
analysis-data managed_results
plugins jobs
config pyenvprivtmp

databases saved_model
lib timelines

local uploads
HA and Scalability

DSS Design and DSS Automation support active/passive high

availability . This requires the use of a shared fileSystem (must
support setfacl for MUS. SAN is recommended) between the
Active Passive
different nodes. DSS DSS

(or replicated w/ sync)
File System


The scoring nodes are all stateless thus they support

active/active high availability

The number of API nodes required depends of the target QPS (Query Per Second) :
A A A ● Optimized models (java, spark, or SQL engines; see documentation) can lead
to 100 to 2000 QPS
● for non-optimized models, expect 5-50 qps per node
● If using an external RDBMS, it has to be HA itself
DSS Public API

The DSS Public API
The DSS public API allows you to interact with DSS from any external system. It allows you to perform a large
variety of administration and maintenance operations, in addition to access to datasets and other data
managed by DSS.

The DSS public API is available:

• As an HTTP REST API. This lets you interact with DSS from any program that can send an HTTP request.
• As a Python API client. This allows you to easily send commands to the public API from a Python program.

The public API Python client is preinstalled in DSS. If you plan on using it from within DSS (in a recipe,
notebook, macro, scenario, ...), you don’t need to do anything specific.
● To use the Python client from outside DSS, simply install it from pip.
○ pip install dataiku-api-client

The DSS Public API - Internal Use

When in DSS, you will inherit the credentials of the user writing the python code. Hence you don’t need an
API key. You can thus connect to the API in the following way:

The DSS Public API - External Use

On the contrary, when accessing DSS from the outside, you will need credentials to be able to connect. You
will need an API key. You can define API key in the settings of a project. Then one can connect to the API

The DSS Public API- Generating API Keys.
There are three kinds of API keys for the DSS REST API:

● Project-level API keys: privileges on the content of the project only. They cannot give access to
anything which is not in their project. http://YOUR_INSTANCE/projects/YOUR_PROJECT/security/api

● Global API keys: encompass several projects. Global API keys can only be created and modified by DSS
administrators. http://YOUR_INSTANCE/admin/security/apikeys/

● Personal API keys : created by each user independently. They can be listed and deleted by admin, but
can only be created by the end user. A personal API key gives exactly the same permissions as the user who
created it. http://YOUR_INSTANCE/profile/apikeys/

DSS Public API- Generating Global API Keys

To create a global API key.:

1) Either through the UI. Go to Administration > Security > Global API key > add a new key.
Specify the permissions desired for the key, which DSS user to impersonate, etc.

2) or with the command line tool:

./DATA_DIR/bin/dsscli api-key-create
The DSS Public API - Documentation
➢ The Dataiku Public API is capable of a lot!
○ Utilize to fully customize/ automate processed inside DSS in external and internal systems

The DSS Public API - Python Examples
The Public API can help you interact with several parts of DSS:
✓ Managing users:

✓List users:

✓Create user:

✓Change user parameters:

✓Drop user:

The DSS Public API - Python Examples

The Public API can help you interact with several parts of DSS:
✓ Managing groups:

✓List groups:

✓Create group:

✓Drop group:

The DSS Public API - Python Examples

The Public API can help you interact with several parts of DSS:
✓ Managing connections:

✓List connections:

✓Create connection:

✓Drop connection:

The DSS Public API - Python Examples

The Public API can help you interact with several parts of DSS:
✓ Managing projects:
✓Create new project:

✓Change project metadata

✓Handle permissions

✓Drop the project:

import requests
import json

#create user
HOST = "http://<host>:<port>/public/api/admin/users/"
API_KEY = "<key>"
HEADERS = {"Content-Type":"application/json"}
DATA = {
"login": "user_x",
"sourceType": "LOCAL",
"displayName": "USER_X",
"groups": [
"userProfile": "DATA_SCIENTIST"
r =, auth=("API_KEY", ""), headers=HEADERS, data=json.dumps(DATA))
Dataiku Command Line Tool - dsscli
dsscli is a command-line tool that can perform a variety of runtime administration tasks on DSS. It can be
used directly by a DSS administrator, or incorporated into automation scripts.

dsscli is made of a large number of commands. Each command performs a single administration task.

From the DSS data directory, run ./bin/dsscli <command> <arguments>

● Running ./bin/dsscli -h will list the available commands.

● Running ./bin/dsscli <command> -h will show the detailed help of the selected command.

For example, to list jobs history in project MYPROJECT, use ./bin/dsscli jobs-list MYPROJECT

Time for the Lab!

Refer to the Lab Manual for exercise


Lab 1: Troubleshooting via Logs

Lab 2: Disk Space Maintenance
Lab 3: Flow Limits
Lab 4: Using the DSS APIs (Optional)

Module 7

Resource Management in DSS

CGroups in DSS

DSS 5.0 brings some new solutions on resource management

● Resource control : full integration with the Linux cgroups functionality in order to restrict resource usages
per project, user, category, … and protect DSS against memory overruns

● Docker : Python, R and in memory Visual ML recipe can be ran in Docker container :
○ Ability to push computation to specific remote host
○ Ability to leverage host with different computing capabilities like GPU
○ Ability to restrict the used resources (cpu, memory, …) either per container
○ But no global ressource control and the user has to decide on which host (no magic distribution)

● Kubernetes : Ability to push DSS in memory computation to a cluster of machine

○ Native ability to run on a cluster of machines. Kubernetes automatically places containers on machines depending on resources availability.
○ Ability to globally control resource usage.
○ Managed cloud Kubernetes services can have auto-scaling capabilities.

©2018 dataiku, Inc.

Using cgroup for resource control
Feature description

● This feature allows control over usage of memory, CPU (+ other resources) by most processes.
● The cgroups integration in DSS is very flexible and allows you to devise multiple resource allocation
● Limiting resources for all processes from all users
● Limiting resources by process type (i.e. a resource limit for notebooks, another one for webapps, …)
● Limiting resources by user
● Limiting resources by project key

©2018 dataiku, Inc.

Using cgroup for resource control

● cgroups enabled on the linux DSS server(this is the default on all recent DSS-supported
● DSS service account needs to have write access to one or several cgroups
● This normally requires some action to be performed at system boot before DSS startup, and can be
handled by the DSS-provided service startup script
● This feature works with both regular and multi user security

©2018 dataiku, Inc.

Using cgroup for resource control
Process that can be controlled by Cgroup

● Python and R recipes

● PySpark, SparkR and sparklyr recipes (only applies to the driver part, executors are covered by the
cluster manager and Spark-level configuration keys)
● Python and R recipes from plugins
● Python, R and Scala notebooks (not differentiated, same limits for all 3 types)
● In-memory visual machine learning and deep learning (for scikit-learn and Keras backends. For MLlib
backend, this is covered by the cluster manager and Spark-level configuration keys)
● Webapps (Shiny, Bokeh and Python backend of HTML webapps, not differentiated, same limits for all
3 types)

©2018 dataiku, Inc.

Using cgroup for resource control
Process that CANNOT be controlled by Cgroup

● The DSS backend itself

● Execution of jobs with the DSS engine (prepare recipe and others)
● The DSS public API, which runs as part of the backend
● Custom Python steps and triggers in scenarios

©2018 dataiku, Inc.

Using cgroup for resource control
Configuration in Administration > Settings > Resource control - General

©2018 dataiku, Inc.

Using cgroup for resource control
Definition of Target Cgroups

● A process can be placed into multiple cgroups targets

● Cgroups target definition can use variables for dynamic placement strategy
○ memory/DSS/${user} => will place the process in a dedicated cgroup for each user
○ memory/DSS/${projectKey} => will place the process in a dedicated cgroup for each project

● The applicable limits are the one made available by Linux cgroups (check linux doc for more
○ memory.limit_in_bytes : sets the maximum amount of user memory (including file cache). If no units are specified, the
value is interpreted as bytes. However, it is possible to use suffixes to represent larger units — k or K for kilobytes, m or
M for megabytes, and g or G for gigabytes
○ cpu.cfs_quota_us and cpu.cfs_period_us : cpu.cfs_quota_us specifies the total amount of time in microseconds for
which all tasks in a cgroup can run during one period as defined by cpu.cfs_period_us.
©2018 dataiku, Inc.
Using cgroup for resource control
Server side setup preparation
● In most Linux, the “cpu” and “memory” controllers are mounted in different hierarchies, generally :
○ /sys/fs/cgroup/memory
○ /sys/fs/cgroup/cpu
● You will first need to make sure that you have write access to a cgroup within each of these
● To avoid conflicts with other parts of the system which manage cgroups, it is advised to configure
dedicated subdirectories within the cgroup hierarchies for DSS. I.E.
○ /sys/fs/cgroup/memory/DSS
○ /sys/fs/cgroup/cpu/DSS
●Note that these directories will not persist over a reboot. You can modify the DSS startup script
(/etc/sysconfig/dataiku[.INSTANCE_NAME]) to create these.

©2018 dataiku, Inc.

Managing Memory for DSS Processes

JVM Memory Model
➢ You need to tell Java how much memory it can allocate
➢ -Xms => Minimum amount of memory allocated for the heap
(Your java process will never consume less memory than this limit + a fixed
➢ -Xmx => Maximum amount of memory allocated for the heap
(Your java process will never consume more memory than this limit + a fixed
➢ Java allocate memory when it needs…and deallocate memory if it didn't use it for
a while.
○ For that Java uses a Garbage Collector which periodically scans the Java
program to find the unused memory blocks and reclaim them.
➢ If your program requires more memory than the authorized maximum (Xmx), the
program will throw an OutOfMemory exception...but before that the Garbage
Collector will make its best to find the memory your program is asking for
○ More often that not, the Java process seems stuck before it throws an
OutOfMemory exception because all CPU cycles of the Java process are
burned by the GC (which try to find memory for you) rather than by the actual
Java Memory Settings
If you experience OOM issues, you may want to modify the memory settings in the data_dir/install.ini file:
● stop dss
● [javaopts]
● backend.max = Xg
○ Default of 2g, global
○ For large production instances, may need to be as high as 20g
○ Look for “OutOfMemoryError: Java Heap Space” or “OutOfMemoryError: GC Overhead limit
exceeded” before “DSS Startup: backend version” in backend.log
● jek.xmx = Xg
○ default of 2g, multiplied by number of jek
○ increase incrementally by 1g
○ Look for “OutOfMemoryError: Java Heap Space” or “OutOfMemoryError: GC Overhead limit
exceeded” in job log
● fek.xmx =Xg
○ default of 2g, multiplied by number of fek
○ increase incrementally by 1g
● Restart DSS
● Note: You should typically only increase these per the instructions of Dataiku. 193
Other Processes
Spark Drivers:
● Configure Spark Driver Memory
○ spark.driver.memory
○ or cgroups
● Notebooks
○ Unload notebooks
○ Admins can force shutdown
○ use cgroups
○ Or, run them in k8s
● In Memory ML
○ use cgroups
● Webapps
○ use cgroups
Time for the Lab!

Refer to the Lab Manual for exercise


Lab 1: Fixing FEK OOM Issues

Lab 2: Setting Up Cgroups
Lab 3: Validating Cgroups
Lab 4: Fixing Backend OOM Issues (Optional)

The End!

©2018 dataiku, Inc. | | | @dataiku 196

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy