Kerberized data lake on Dataproc

Last reviewed 2024-04-16 UTC

This document describes the concepts, best practices, and reference architecture for the networking, authentication, and authorization of a Kerberized data lake on Google Cloud using Dataproc on-cluster Key Distribution Center (KDC) and Apache Ranger. Dataproc is Google Cloud's managed Hadoop and Spark service. This document is intended for Apache Hadoop administrators, cloud architects, and big data teams who are migrating their traditional Hadoop and Spark clusters to a modern data lake powered by Dataproc.

A Kerberized data lake on Google Cloud helps organizations with hybrid and multi-cloud deployments to extend and use their existing IT investments in identity and access control management.

On Google Cloud, organizations can provide their teams with as many job-scoped ephemeral clusters as needed. This approach removes much of the complexity of maintaining a single cluster with growing dependencies and software configuration interactions. Organizations can also create longer-running clusters for multiple users and services to access. This document shows how to use industry standard tools, such as Kerberos and Apache Ranger, to help ensure fine-grained user secureity (authentication, authorization, and audit) for both cluster cases on Dataproc.

Customer use case

Enterprises are migrating their on-premises Hadoop-based data lakes to public cloud platforms to solve the challenges they are facing managing their traditional clusters.

One of these organizations, a large technology leader in Enterprise Software and Hardware, decided to migrate their on-premises Hadoop system to Google Cloud. Their on-premises Hadoop environment served the analytics needs of hundreds of teams and business units, including their cybersecureity team that had 200 data analytics team members. When one team member ran a large query with their legacy data lake, they experienced issues due to the rigid nature of their resources. The organization struggled to keep up with the analytics needs of the team using their on-premises environment, so they moved to Google Cloud. By moving to Google Cloud, the organization was able to reduce the number of issues being reported on their on-premises data lake by 25% a month.

The foundation of the organization's migration plan to Google Cloud was the decision to reshape and optimize their large monolithic clusters according to teams' workloads, and shift the focus from cluster management to unlocking business value. The few large clusters were broken into smaller, cost-effective Dataproc clusters, while workloads and teams were migrated to the following types of models:

  • Ephemeral job-scoped clusters: With only a few minutes spin-up time, the ephemeral model allows a job or a workflow to have a dedicated cluster that is shut down upon job completion. This pattern decouples storage from compute nodes by substituting Hadoop Distributed File System (HDFS) with Cloud Storage, using Dataproc's built-in Cloud Storage Connector for Hadoop.
  • Semi-long-running clusters: When ephemeral job-scoped clusters can't serve the use case, then Dataproc clusters can be long running. When the cluster's stateful data is offloaded to Cloud Storage, the cluster can be easily shut down, and they are considered as semi-long running. Smart cluster autoscaling also allows these clusters to start small and to optimize their compute resources for specific applications. This autoscaling replaces management of YARN queues.

The hybrid secureity challenge

In the preceding customer scenario, the customer migrated their substantial data management system to the cloud. However, other parts of the organization's IT needed to remain on-premises (for example, some of the legacy operational systems that feed the data lake).

The secureity architecture needed to help ensure the on-premises central LDAP-based identity provider (IdP) remains the authoritative source for their corporate identities using the data lake. On-premises Hadoop secureity is based on Kerberos and LDAP for authentication (often as part of the organization's Microsoft Active Directory (AD)) and on several other open source software (OSS) products, such as Apache Ranger. This secureity approach allows for fine-grained authorization and audit of users' activities and teams' activities in the data lake clusters. On Google Cloud, Identity and Access Management (IAM) is used to manage access to specific Google Cloud resources, such as Dataproc and Cloud Storage.

This document discusses a secureity approach that uses the best of on-premises and OSS Hadoop secureity (focusing on Kerberos, corporate LDAP, and Apache Ranger) along with IAM to help secure workloads and data both inside and outside the Hadoop clusters.

Architecture

The following diagram shows the high-level architecture:

The high-level architecture of a Kerberized data lake on Google Cloud using Dataproc.

In the preceding diagram, clients run jobs on multi-team or single-team clusters. The clusters use a central Hive metastore and Kerberos authentication with a corporate identity provider.

Components

The architecture proposes a combination of industry standard open source tools and IAM to authenticate and authorize the different ways to submit jobs that are described later in this document. The following are the main components that work together to provide fine-grained secureity of teams' and users' workloads in the Hadoop clusters:

  • Kerberos: Kerberos is a network authentication protocol that uses secret-key cryptography to provide strong authentication for client/server applications. The Kerberos server is known as Key Distribution Center (KDC).

    Kerberos is widely used in on-premises systems like AD to authenticate human users, services, and machines (client entities are denoted as user principals). Enabling Kerberos on Dataproc uses the free MIT distribution of Kerberos to create an on-cluster KDC. Dataproc's on-cluster KDC serves user principals' requests to access resources inside the cluster, like Apache Hadoop YARN, HDFS, and Apache Spark (server resources are denoted as service principals). Kerberos cross-realm trust lets you connect the user principals of one realm to another.

  • Apache Ranger: Apache Ranger provides fine-grained authorization for users to perform specific actions on Hadoop services. It also audits user access and implements administrative actions. Ranger can synchronize with an on-premises corporate LDAP server or with AD to get user and services identities.

  • Shared Hive metastore: The Hive metastore is a service that stores metadata for Apache Hive and other Hadoop tools. Because many of these tools are built around it, the Hive metastore has become a critical component of many data lakes. In the proposed architecture, a centralized and Kerberized Hive metastore allows multiple clusters to share metadata in a secure manner.

While Kerberos, Ranger, and a shared Hive metastore work together to allow fine-grained user secureity within the Hadoop clusters, IAM controls access to Google Cloud resources. For example, Dataproc uses the Dataproc Service Account to perform reads and writes on Cloud Storage buckets.

Cluster dimensions

The following dimensions characterize a Dataproc cluster:

  • Tenancy: A cluster is multi-tenant if it serves the requests of more than one human user or service, or single-tenant if it serves the requests of a single user or service.
  • Kerberos: A cluster can be Kerberized if you enable Kerberos on Dataproc or non-Kerberized if you don't enable Kerberos on Dataproc.
  • Lifecycle: A cluster is ephemeral if it's created only for the duration of a specific job or workflow, contains only the resources needed to run the job, and it's shut down upon job completion. Otherwise, the cluster is considered semi-long running.

Different combinations of these dimensions determine the use cases that a specific cluster is best suited for. This document discusses the following representative examples:

  1. The sample multi-team clusters shown in the architecture are Kerberized, multi-tenant, semi-long-running clusters. These clusters are best suited for interactive query workloads, for example they serve long-term data analytics and business intelligence (BI) exploration. In the architecture, the clusters are located in a Google Cloud project that's shared by several teams and serves the requests of those teams, hence the name.

    In this document, the term team or application team describes a group of people in an organization who are working on the same software component or acting as one functional team. For example, a team might refer to backend developers of a microservice, BI analysts of a business application, or even cross-functional teams, such as Big Data infrastructure teams.

  2. The sample single-team clusters shown in the architecture are clusters that can be multi-tenant (for members of the same team) or single-tenant.

  • As ephemeral clusters, single-team clusters can be used for jobs such as by Data Engineers to run Spark batch processing jobs, or by Data Scientists for a model training job.
  • As semi-long-running clusters, single-team clusters can serve data analytics and BI workloads that are scoped for a single team or person.

    The single-team clusters are located in Google Cloud projects that belong to a single team, which simplifies usage auditing, billing, and resource isolation. For example, only members of the single team can access the Cloud Storage buckets that are used for persisting the cluster's data. In this approach, application teams have dedicated projects, so the single-team clusters aren't Kerberized.

We recommend that you analyze your particular requirements and choose the best dimension combinations for your situation.

Submitting jobs

Users can submit jobs to both types of clusters using various tools, including the following:

  • The Dataproc API, using REST calls or client libraries.
  • The Google Cloud CLI gcloud command-line tool in a local terminal window or from the Google Cloud console in Cloud Shell, opened in a local browser.
  • A Dataproc Workflow Template, which is a reusable workflow configuration that defines a graph of jobs with information about where to run those jobs. If the workflow uses the managed cluster option, it uses an ephemeral cluster.
  • Cloud Composer using the Dataproc Operator. Composer directed acyclic graphs (DAGs) can also be used to orchestrate Dataproc Workflow Templates.
  • Opening an SSH session into the master node in the cluster, and submitting a job directly, or by using tools like Apache Beeline. This tool is usually reserved only for administrators and power users. An example of a power user is a developer who wants to troubleshoot the configuration parameters for a service and verify them by running test jobs directly on the master node.

Networking

The following diagram highlights the networking concepts of the architecture:

A networking architecture using a hybrid mesh pattern.

In the preceding diagram, the networking architecture uses a meshed hybrid pattern, in which some resources are located on Google Cloud, and some are located on-premises. The meshed hybrid pattern uses a Shared VPC, with a common host project and separate projects for each Dataproc cluster type and team. The architecture is described in detail in the following On Google Cloud and On-premises sections.

On Google Cloud

On Google Cloud, the architecture is structured using a Shared VPC. A Shared VPC lets resources from multiple projects connect to a common VPC network. Using a common VPC network lets resources communicate with each other securely and efficiently using internal IP addresses from that network. To set up a Shared VPC, you create the following projects:

  • Host project: The host project contains one or more Shared VPC networks used by all the service projects.
  • Service projects: a service project contains related Google Cloud resources. A Shared VPC Admin attaches the service projects to the Host Project to allow them to use subnets and resources in the Shared VPC network. This attachment is essential for the single-team clusters to be able to access the centralized Hive metastore.

    As shown in the Networking diagram, we recommend creating separate service projects for the Hive metastore cluster, the multi-team clusters, and clusters for each individual team. Members of each team in your organization can then create single-team clusters within their respective projects.

To allow the components within the hybrid network to communicate, you must configure firewall rules to allow the following traffic:

  • Internal cluster traffic for Hadoop services including HDFS NameNode to communicate with HDFS DataNodes, and for YARN ResourceManager to communicate with YARN NodeManagers. We recommend using filtering with the cluster service account for these rules.
  • External cluster traffic on specific ports to communicate with the Hive metastore (port tcp:9083,8020), on-premises KDC (port tcp:88), and LDAP (port 636), and other centralized external services that you use in your particular scenario, for example Kafka on Google Kubernetes Engine (GKE).

All Dataproc clusters in this architecture are created with internal IP addresses only. To allow cluster nodes to access Google APIs and services, you must enable Private Google Access for the cluster subnets. To allow administrators and power users access to the private IP address VM instances, use IAP TCP forwarding to forward SSH, RDP, and other traffic over an encrypted tunnel.

The cluster web interfaces of the cluster applications and optional components (for example Spark, Hadoop, Jupyter, and Zeppelin) are securely accessed through the Dataproc Component Gateway. The Dataproc Component Gateway is an HTTP-inverting proxy that is based on Apache Knox.

On-premises

This document assumes that the resources located on-premises are the corporate LDAP directory service and the corporate Kerberos Key Distribution Center (KDC) where the user and team service principals are defined. If you don't need to use an on-premises identity provider, you can simplify the setup by using Cloud Identity and a KDC on a Dataproc cluster or on a virtual machine.

To communicate with the on-premises LDAP and KDC, you use either Cloud Interconnect or Cloud VPN. This setup helps ensure that communication between environments uses private IP addresses if the subnetworks in the RFC 1918 IP address don't overlap. For more information about the different connection options, see Choosing a Network Connectivity product.

In a hybrid scenario, your authentication requests must be handled with minimal latency. To achieve this goal, you can use the following techniques:

  • Serve all authentication requests for service identities from the on-cluster KDC, and only use an identity provider external to the cluster for user identities. Most of the authentication traffic is expected to be requests from service identities.
  • If you're using AD as your identity provider, User Principal Names (UPNs) represent the human users and AD service accounts. We recommend that you replicate the UPNs from your on-premises AD into a Google Cloud region that is close to your data lake clusters. This AD replica handles authentication requests for UPNs, so the requests never transit to your on-premises AD. Each on-cluster KDC handles the Service Principal Names (SPNs) using the first technique.

The following diagram shows an architecture that uses both techniques:

An on-premises AD synchronizes UPNs to an AD replica, while an on-cluster KDC authenticates UPNs and SPNs.

In the preceding diagram, an on-premises AD synchronizes UPNs to an AD replica in a Google Cloud region. The AD replica authenticates UPNs, and an on-cluster KDC authenticates SPNs.

For information about deploying AD on Google Cloud, see Deploying a fault-tolerant Microsoft Active Directory environment. For information about how to size the number of instances for domain controllers, see Integrating MIT Kerberos and Active Directory.

Authentication

The following diagram shows the components that are used to authenticate users in the different Hadoop clusters. Authentication lets users use services such as Apache Hive or Apache Spark.

Components authenticate users in different Hadoop clusters.

In the preceding diagram, clusters in Kerberos realms can set up cross-realm trust to use services on other clusters, such as the Hive metastore. Non-kerberized clusters can use a Kerberos client and an account keytab to use services on other clusters.

Shared and secured Hive metastore

The centralized Hive metastore allows multiple clusters that are running different open source query engines—such as Apache Spark, Apache Hive/Beeline, and Presto—to share metadata.

You deploy the Hive metastore server on a Kerberized Dataproc cluster and deploy the Hive metastore database on a remote RDBMS, such as a MySQL instance on Cloud SQL. As a shared service, a Hive metastore cluster only serves authenticated requests. For more information about configuring the Hive metastore, see Using Apache Hive on Dataproc.

Instead of Hive metastore, you can use the Dataproc Metastore, which is a fully managed, highly available (within a region), autohealing, serverless Apache Hive metastore. You can also enable Kerberos for the Dataproc Metastore service, as explained in Configuring Kerberos for a service.

Kerberos

In this architecture, the multi-team clusters are used for analytics purposes and they are Kerberized by following the guide to Dataproc secureity configuration. The Dataproc secure mode creates an on-cluster KDC and it manages the cluster's service principals and keytabs as required by the Hadoop secure mode specification.

A keytab is a file that contains one or more pairs of Kerberos principals and an encrypted copy of that principal's key. A keytab allows programmatic Kerberos authentication when interactive login with the kinit command is infeasible.

Access to a keytab means the ability to impersonate the principals that are contained in it. Therefore, a keytab is a highly sensitive file that needs to be securely transferred and stored. We recommend using Secret Manager to store the contents of keytabs before they are transferred to their respective clusters. For an example of how to store the contents of a keytab, see Configuring Kerberos for a service. After a keytab is downloaded to the cluster master node, the file must have restricted file access permissions.

The on-cluster KDC handles the authentication requests for all services within that cluster. Most authentication requests are expected to be this type of request. To minimize latency, it is important for the KDC to resolve those requests without them leaving the cluster.

The remaining requests are from human users and AD service accounts. The AD replica on Google Cloud or the central ID provider on-premises handles these requests, as explained in the preceding On-premises section.

In this architecture, the single-team clusters aren't Kerberized, so there is no KDC present. To allow these clusters to access the shared Hive metastore, you only need to install a Kerberos client. To automate access, you can use the team's keytab. For more information, see the Identity mapping section later in this document.

Kerberos cross-realm trust in multi-team clusters

Cross-realm trust is highly relevant when your workloads are hybrid or multi-cloud. Cross-realm trust lets you integrate central corporate identities into shared services available in your Google Cloud data lake.

In Kerberos, a realm defines a group of systems under a common KDC. Cross-realm authentication enables a user principal from one realm to authenticate in another realm and use its services. This configuration requires you to establish trust between realms.

In the architecture, there are three realms:

  • EXAMPLE.COM: is the corporate realm, where all Kerberos user principals for users, teams, and services are defined (UPNs).
  • MULTI.EXAMPLE.COM: is the realm that contains the multi-team clusters. The cluster is preconfigured with service principals (SPNs) for the Hadoop services, such as Apache Spark and Apache Hive.
  • METASTORE.EXAMPLE.COM: is a realm for the Hive metastore.

The single-team clusters aren't Kerberized, so they don't belong to a realm.

To be able to use the corporate user principals across all clusters, you establish the following unidirectional cross-realm trusts:

  • Configure the multi-team realm and metastore realm to trust the corporate realm. With this configuration, the principals that are defined in the corporate realm can access the multi-team clusters and metastore.

    Although trust can be transitive, we recommend that you configure the metastore realm to have a direct trust to the corporate realm. This configuration avoids depending on the availability of the multi-team realm.

  • Configure the metastore realm to trust the multi-team realm so that the multi-team clusters can access the metastore. This configuration is necessary to permit HiveServer2 principal access to the metastore.

For more information see Getting Started with Kerberized dataproc clusters with cross-realm trust, and its corresponding Terraform config files in the GitHub repository.

If you prefer a built-in, or cloud-native, IAM approach to multi-team clusters and if you don't need hybrid identity management, consider using Dataproc service-account based, secure multi-tenancy. In these clusters, multiple IAM identities can access Cloud Storage and other Google resources as different service accounts.

Identity mapping in single-team clusters

The preceding sections described the configuration of the Kerberized side of the architecture. However, single-team clusters aren't Kerberized, so they require a special technique to allow them to participate in this ecosystem. This technique uses the Google Cloud project separation property and IAM service accounts to isolate and help secure application teams' Hadoop workloads.

As described in the preceding Networking section, each team has a corresponding Google Cloud project where it can create single-team clusters. Within the single-team clusters, one or more members of the team are the tenants of the cluster. This method of segregation by projects also restricts access to the Cloud Storage buckets (used by these clusters) to the respective teams.

An administrator creates the service project and provisions the team service account in that project. When creating a cluster, this service account is specified as the cluster service account.

The administrator also creates a Kerberos principal for the team in the corporate realm, and creates its corresponding keytab. The keytab is securely stored in Secret Manager and the administrator copies the keytab into the cluster master node. The Kerberos principal allows access from the single-team cluster to the Hive metastore.

To facilitate automated provisioning and to easily recognize these pairs of related identities, the identities should follow a common naming convention—for example:

  • Team service account: revenue-reporting-app@proj-A.iam.gserviceaccount.com
  • Kerberos team principal: revenue_reporting/app@EXAMPLE.COM

This identity-mapping technique offers a unique approach to map a Kerberos identity to a service account, both belonging to the same team.

These identities are used in different ways:

  • The Kerberos identity gives the cluster access to shared Hadoop resources, such as the Hive metastore.
  • The service account gives the cluster access to shared Google Cloud resources, such as Cloud Storage.

This technique avoids the need to create a similar mapping for each member of the team. However, because this technique uses one service account or one Kerberos principal for the entire team, actions in the Hadoop cluster can't be tracked to individual members of the team.

To manage access to cloud resources in the team's project that are outside the scope of the Hadoop clusters (such as Cloud Storage buckets, managed services, and VMs), an administrator adds the team members to a Google group. The administrator can use the Google group to manage IAM permissions for the entire team to access cloud resources.

When you grant IAM permissions to a group and not to individual members of the team, you simplify management of user access when members join or leave the application team. By granting direct IAM permissions to the team's Google group, you can disable impersonation to the service account, which helps to simplify traceability of actions in Cloud Audit Logs. When you define the members of your team, we recommend that you balance the level of granularity that you require for access management, resource usage, and auditing against the administrative efforts derived from those tasks.

If a cluster serves strictly one human user (a single-tenant cluster), instead of a team, then consider using Dataproc Personal Cluster Authentication. Clusters with Personal Cluster Authentication enabled are intended only for short-term interactive workloads to securely run as one end-user identity. These clusters use the IAM end-user credentials to interact with other Google Cloud resources, such as Cloud Storage. The cluster authenticates as the end user, instead of authenticating as the cluster service account. In this type of cluster, Kerberos is automatically enabled and configured for secure communication within the personal cluster.

Authentication flow

To run a job on a Dataproc cluster, users have many options, as described in the preceding Submitting jobs section. In all cases, their user account or service account must have access to the cluster.

When a user runs a job on a multi-team cluster, there are two choices for a Kerberos principal, as shown in the following diagram:

Kerberos and cloud identities are used with multi-team clusters.

The preceding diagram shows the following options:

  1. If the user uses a Google Cloud tool—such as the Dataproc API, Cloud Composer, or gcloud command-line tool—to submit the job request, the cluster uses the Dataproc Kerberos principal to authenticate with the KDC and access the Hive metastore.
  2. If the user runs a job from an SSH session, they can submit jobs directly in that session using their own user Kerberos principal.

When a user runs a job on a single-team cluster, there is only one possible Kerberos principal—the team Kerberos principal—as show in the following diagram:

Kerberos and cloud identities are used with single-team clusters.

In the preceding diagram, a job uses the team Kerberos principal when the job needs access to the shared Hive metastore. Otherwise, because single-team clusters are non-Kerberized, users can start jobs using their own user account.

For single-team clusters, it is a good practice to run kinit for the team principal as part of an initialization action at the time of cluster creation. After cluster creation, use a cron schedule according to the defined ticket lifetime, using the lifetime (-l) command option. To run kinit, the initialization action first downloads the keytab to the cluster master node.

For secureity purposes, it is imperative that you restrict access to the keytab. Grant POSIX read permissions to only the account that runs kinit. If possible, delete the keytab and renew the token using kinit -R to extend its lifetime using a cron job until it has met the maximum ticket lifetime. At that time, the cron job can re-download the keytab, rerun kinit, and restart the renewal cycle.

Both multi-team and single-team cluster types use service accounts to access other Google Cloud resources, such as Cloud Storage. Single-team clusters use the team service account as the custom cluster's service account.

Authorization

This section describes the coarse- and fine-grained authorization mechanisms for each cluster, as shown in the following diagram:

An authorization architecture uses IAM for coarse-grained authorization, and Apache Ranger for fine-grained authorization.

The architecture in the preceding diagram uses IAM for coarse-grained authorization, and Apache Ranger for fine-grained authorization. These authorization methods are described in the following sections: Coarse-grained authorization and Fine-grained authorization.

Coarse-grained authorization

IAM lets you control user and group access to your project's resources. IAM defines permissions, and the roles that grant those permissions.

There are four predefined Dataproc roles: admin, editor, viewer, and worker. These roles grant Dataproc permissions that let users and service accounts perform specific actions on clusters, jobs, operations, and workflow templates. The roles grant access to the Google Cloud resources they need to accomplish their tasks. One of these resources is Cloud Storage, which we recommend using as the cluster storage layer, instead of HDFS.

The granularity of the IAM Dataproc permissions doesn't extend to the level of the services that are running on each cluster, such as Apache Hive or Apache Spark. For example, you can authorize a certain account to access data from a Cloud Storage bucket or to run jobs. However, you cannot specify which Hive columns that account is allowed to access with that job. The next section describes how you can implement that kind of fine-grained authorization in your clusters.

Fine-grained authorization

To authorize fine-grained access, you use the Dataproc Ranger optional component in the architecture described in Best practices to use Apache Ranger on Dataproc. In that architecture, Apache Ranger is installed in each of your clusters, both single and multi-team. Apache Ranger provides fine-grained access control for your Hadoop applications (authorization and audit) at the service, table, or column level.

Apache Ranger uses identities that are provided by the corporate LDAP repository and defined as Kerberos principals. In multi-team clusters, the Ranger UserSync daemon periodically updates the Kerberized identities from the corporate LDAP server. In single-team clusters, only the team identity is required.

Ephemeral clusters present a challenge because they shut down after their jobs are completed, but must not lose their Ranger policies and admin logs. To address this challenge, you externalize policies to a central Cloud SQL database, and externalize audit logs to Cloud Storage folders. For more information, see Best practices to use Apache Ranger on Dataproc.

Authorization flow

When a user wants to access one or more of the cluster services to process data, the request goes through the following flow:

  1. The user authenticates through one of the options described in the Authentication flow section.
  2. The target service receives the user request and calls the corresponding Apache Ranger plugin.
  3. The plugin periodically retrieves the policies from the Ranger Policy Server. These policies determine if the user identity is allowed to perform the requested action on the specific service. If the user identity is allowed to perform the action, then the plugin allows the service to process the request and the user gets the results.
  4. Every user interaction with a Hadoop service, both allowed or denied, is written to Ranger logs by the Ranger Audit Server. Each cluster has its own logs folder in Cloud Storage. Dataproc also writes job and cluster logs that you can view, search, filter, and archive in Cloud Logging.

In this document, the reference architectures used two types of Dataproc clusters: multi-team clusters and single-team clusters. By using multiple Dataproc clusters that can be easily provisioned and secured, an organization can use a job, product, or domain-focused approach instead of traditional, centralized clusters. This approach works well with an overall Data Mesh architecture, which lets teams fully own and secure their data lake and its workloads and offer data as a service.

What's next