Hashicorp Vault
Hashicorp Vault
Preface 3
Day Zero 4
Vault Internals and Key Cryptography Principles 4
Vault Solution Architecture 6
Single Site 6
Multiple sites 9
Day One 10
Deployment Patterns 10
Monitoring 11
Telemetry 12
Audit 12
Backup and Restore 12
Failure Scenarios 12
Initialization Ceremony 12
Initialization 14
Basic Configuration 14
Root Token Revocation 18
Organizational Roles 18
Day Two 19
Namespacing 19
Secure Introduction 19
Storing and Retrieving Secrets 20
API 20
Non-invasive patterns 20
Response wrapping 21
Encryption as a Service 21
Day N 23
Storage Key Rotation 23
Master Key Rotation 23
Seal Key Rotation 24
DR Promotion 25
Policy Maintenance Patterns 26
Application onboarding 27
__
Parts of this document are intended to be foundations for an organizational runbook, describing certain
procedures commonly run in Vault.
__
§ Vault’s primary interface is through a HTTP Restful API. Both the CLI and the Web GUI interface with
Vault through the same API. A developer would use this API for programmatic access. There is no other
way to expose functionality in Vault other than through this API.
§ In order to consume secrets, clients (either users or applications) need to establish their identity. While
Vault supports the common authentication platforms for users, such as LDAP or Active Directory, it
also adds different methods of programatically establishing an application identity based on platform
trust, leveraging AWS IAM, Google IAM, Azure Application Groups, TLS Certificates, and Kubernetes
namespaces among others. Upon authentication, and based on certain identity attributes like group
membership or project name, Vault will grant a short lived token aligned with a specific set of policies.
§ Policies in Vault are decoupled from identities and define what set of secrets a particular identity can
access. They are defined as code in HashiCorp Configuration Language (HCL). Rich policy can be
defined using Sentinel rules1, that are designed to answer "under what condition" an entity would get
access to the secret, rather than the traditional “who gets access” to what secret defined in regular
ACLs.
1
https://www.vaultproject.io/docs/enterprise/sentinel/index.html
__
From a data organization perspective, Vault has a pseudo-hierarchical API path, in which top level engines can
be mounted to store or generate certain secrets, providing either an arbitrary path (i.e.
/secret/sales/password), or a predefined path for dynamic secrets (e.g. /pki/issue/internal).
No secret leaves the cryptographic barrier (which only exists in volatile memory on the Vault host) in
unencrypted form. This barrier is protected by a set of keys. Information served through the HTTP API is
encrypted using TLS. Information stored in the storage backend is encrypted using a Storage Key, which is
encrypted by a Master Key, which is optionally encrypted by an external Seal for operational ease, and
automates the unsealing procedure2, as shown in the diagram below. The same seal can be also used to
2
https://www.vaultproject.io/docs/internals/architecture.html
__
Single Site
When it comes to a single site, a production deployment of Vault is a clustered unit, that would generally
consist of three nodes. In its open source variant, one node is automatically designated as "Active", while the
remaining two are in "Standby" mode (not actively providing service). These "Standby" nodes would take the
"Active" role automatically, upon orderly shutdown of the "Active node" (in which case failover is immediate),
or failure (failover under an unclean shutdown will take 45 seconds), when using HashiCorp Consul, the
recommended storage backend. When Vault Enterprise Premium is being used, the standby nodes can be set
up as "Performance Standbys", which would enable them to scale the Cluster horizontally, by virtue of
answering certain queries directly, while transparently forwarding some others to the Active node in the
cluster. As mentioned before, there is a hard limitation in terms of latency between Active and Standby
nodes.
3
https://www.vaultproject.io/docs/enterprise/sealwrap/index.html
__
HashiCorp Consul must maintain a quorum at all times to provide services (N+1 of the nodes/votes), and
quorum loss is not an acceptable operational scenario; as such, there are a number of strategies and
Autopilot4 features available in the product to ensure service continuity and scaling, such as Non-Voting
members, Redundancy Zones and Update Migrations.
The most common single site Vault deployment would look as follows:
4
https://www.consul.io/docs/guides/autopilot.htm
__
__
The minimum requirement from a resiliency perspective, is to provision a Disaster Recovery (DR) Replica,
which is a warm standby and holds a complete copy of everything. A DR Replica is not able to answer
requests until promoted. From an operational perspective, a DR Replica would provide service continuity in
case of:
§ Data tampering: Unintended or intentional manipulation of the storage backend, which holds binary
encrypted data. This aspect can be somewhat mitigated through backup / restore techniques, with a
considerable impact on RPO and RTO, which a DR replica is not subject to.
§ Seal Tampering: The key, being integral to cryptographic assurances, is not recoverable. If the
encryption key were to be deleted or tampered, Vault would not be able to recover. A DR Replica has a
separate seal, to which access should be segregated.
§ Infrastructure failure: If the Primary cluster were to become unavailable from a Network or Compute
perspective, a DR Replica can be promoted to provide service continuity with no impact.
More commonly, organizations with presence in multiple geographical locations will create a set of
Performance Replicas, spread geographically. While this setup can provide replication of secrets and
configuration within locations, most importantly it unifies a single Keyring (Master Key & Storage Key) across
locations, which is paramount to reduce operational overhead when using Vault at scale.
In Performance Replication mode, selected paths are replicated to different clusters, and operations as much
as possible are carried out in Secondary Clusters to minimize latency. Both replication modes can be
combined, which would allow a broad scale of both operational efficiency and resiliency, as illustrated in the
attached diagram.
5
https://www.vaultproject.io/docs/enterprise/replication/index.html
__
Deployment Patterns
The preferred deployment pattern for the recommended architecture is to perform immutable builds of
HashiCorp Vault and Consul, and perform a Blue/Green deployment6. As such, tools like HashiCorp Packer7
are recommended to build immutable images for different platforms, and HashiCorp provides a number of
examples8 in regards to how to build these elements through existing CI/CD orchestration.
HashiCorp recognizes that there are organizations that have not adopted those patterns, and as such there
are a number of configuration management patterns to install, upgrade and configure Vault, available in
different repositories:
It is recommended to restrict SSH access to Vault servers, as there are a number of sensitive items stored in
volatile memory on a system.
6
Blue/Green deployment refers to the pattern where a straight copy of the infrastructure gets deployed (“Blue”) and the
previous production infrastructure (“Green”) gets decommissioned when the service is failed over to the new copy.
7
https://www.packer.io/
8
https://github.com/hashicorp/guides-configuration/tree/master/vault
__
The individual seal status of a node can also be queried, as shown below:
__
Telemetry
When running Vault at large scale, it is recommended to profile the performance of the system by using the
Telemetry output of Vault in conjunction with a telemetry analysis solution such as StatsD, Circonus or
Datadog.
Audit
From a Vault perspective, the use of a SIEM system is a requirement to keep track of access brokering. Vault
provides output through either Syslog, a File, or a Unix socket. Most organizations parse and evaluate this
information through Splunk or using tooling from the ELK Stack. The Audit output is JSON data, enabling
organizations to parse and analyze information with ease.
Failure Scenarios
§ In case of an individual node failure, or up to two node failures, the solution will continue to run
without operator intervention.
§ If the Vault cluster fails, it can be re-provisioned using the same Storage Backend configuration.
§ If the Consul cluster were to lose quorum, there are alternatives to regain service availability, although
the recommended approach from an RTO/RPO perspective is to fail over to a DR Cluster or promote a
Performance Replica.
§ If the Seal Key was to be deleted or unavailable, the only supported scenario is failing over to a DR
Cluster or Performance Replica.
Initialization Ceremony
While every action in Vault can be API driven, and as such automated, the initialization process guarantees
the cryptographic trust in Vault. The Key Holders, which hold either the Unseal Keys, or most commonly the
Recovery Keys, are the guardians of Vault trust. This process would only be carried out once for any Vault
installation.
9
https://learn.hashicorp.com/vault/operations/ops-reference-architecture#load-balancing-using-consul-interface
10
https://learn.hashicorp.com/vault/operations/ops-reference-architecture#load-balancing
11
https://learn.hashicorp.com/vault/operations/monitoring
12
https://www.consul.io/docs/commands/snapshot/agent.html
__
It is recommended that the initialization ceremony is carried out on a single room, where an operator and the
Vault key holders would be present throughout the process, which would be as follows:
1. The Operator starts the Vault daemon and the initialization process, as described in the
documentation13, providing the public GPG keys from the Keyholders, and the Operators own public
GPG key for the root token.
2. Vault will return the GPG encrypted recovery keys, which should be distributed among the keyholders.
3. The operator uses the root token for loading the initial policy and configuring the first authentication
backend, traditionally Active Directory or LDAP, as well as the Audit backend.
4. The operator validates they can log into Vault with their directory account, and can add further policy.
5. The operator revokes the root token. The Keyholders can now leave the room with the assurance that
no one person has full and unaudited access to Vault.
13
https://www.vaultproject.io/docs/commands/operator/init.html
__
The full set of options for initialization is described in the Vault Documentation quoted in the footnote,
though the following parameters should be considered:
§ -root-token-pgp-key: The Operator Public PGP key that will be used to encrypt the root token
§ -recovery-shares/threshold: Number of keys to provision (based on number of keyholders), and
quorum of key holders needed to perform recovery operations (generally half of the keyholders plus
one)
§ -recovery-pgp-keys: Similar to the parameter above, list of Public PGP keys from the keyholders in
order to provide an encrypted output of the Key Shares
Basic Configuration
In order to proceed with further configuration without the need of using a Root Token, an alternate
authentication method must be configured. Traditionally this is done though the LDAP Authentication
Backend configuring an Active Directory / LDAP integration, or most recently through OIDC or Okta support.
A group or claim is defined as providing certain administrative attributes to Vault.
Furthermore, policy is defined in regards to administrative actions into Vault (like adding mounts, policies,
configuring further authentication, audit, etc...), cryptographic actions (like starting key rotation operations),
and ultimately consumption patterns, which are generally defined at a later time based on requirements. An
example administrative policy could be defined as follows:
## Operations
__
# CORS configuration
path "/sys/config/cors" {
capabilities = ["read", "list", "create", "update", "sudo"]
}
# Initialize Vault
path "/sys/init" {
capabilities = ["read", "update", "create"]
}
# Manage policies
path "/sys/policies*" {
capabilities = ["read", "list", "create", "update", "delete"]
}
# Manage Mounts
path "/sys/mounts*" {
capabilities = ["read", "list", "create", "update", "delete"]
}
__
## Audit
# Remove audit devices
path "/sys/audit/*" {
capabilities = ["delete"]
}
## Key Officers
path "/sys/generate-root/update" {
capabilities = ["create", "update"]
}
__
# Verify update
path "/sys/rekey-recovery-key/verify" {
capabilities = ["create", "update"]
}
These policies are not exhaustive, and while three profiles are defined, in most organizations role segregation
runs even deeper.
It’s also worth noting that due to the sensitivity of certain endpoints, most organizations chose to use Control
Groups14 in order to require an approval workflows for some configuration changes, for example, when
adding or modifying policy, like in the example below:
In this way, policy changes would require two approvers from either the “managers” or “leads” group, and one
from the “infosec” group.
14
https://www.vaultproject.io/docs/enterprise/control-groups/index.html
__
An operator along with a quorum of key holders can re-generate the root token in an emergency scenario.
Organizational Roles
In most organizations where Vault has been deployed at scale, there is no requirement for extra staffing or
hiring. In terms of deploying and running the solution. Vault has no predefined roles and profiles, as the
policy system allows very granular definitions of the duties for different teams, but generally speaking these
have been defined in most organizations as follows:
Consumers: Individuals or teams that require the ability to consume secrets or have a need for a
namespaced Vault capability.
Operators: Individuals or teams who onboard consumers, as well as secret engine capabilities,
policies, namespaces and authentication methods.
Crypto: Individual or teams who manage key rotation and audit policies.
__
Namespacing
Vault Enterprise allows an organization to logically segment a Vault installation.
Traditionally, in organizations where Vault is run as a central capability, certain teams that are required to
maintain a large set of policies would get their own namespace. This can also be segmented, aligning for
example, with a set of Kubernetes namespaces.
For some organizations where there is a dedicated team running Vault, namespaces may not add significant
advantages, and could potentially increase complexity.
Namespaces are generally provisioned in an automated manner, in line with team or project onboarding, for
example when a new Kubernetes namespace or AWS account is provisioned.
Secure Introduction
Ahead of being able to consume secrets, a user or an application has to login into Vault, and obtain a short
lived token. The method used for applications is generally based on either the platform where the application
is running (AWS, GCE, Azure, Kubernetes, OIDC), or the workflow used to deploy it (AppRole with a CI tool, like
Jenkins, or Circle CI).
Token must be maintained client side, and upon expiration can simply be renewed. For short lived workflows,
traditionally tokens would be created with a lifetime that would match the average deploy time and left to
expire, securing new tokens with each deployment.
Most commonly, systems perform an authentication process automatically, though responsibility of carrying
out the process is generally agreed as part of a handover, when multiple teams take responsibility of
provisioning a system or deploying an application.
Based on existing attributes (like LDAP Groups, OIDC Claims, IAM roles, Google Project ID, etc...) roles are
created in the different authentication backends, which map to policies (that ultimately grant access to
secrets).
__
The most common pattern for adoption, is starting by storing secrets in Vault that have traditionally been
spread across files, desktop password management tools, version control systems and configuration
management, and utilize some of the consumption patterns described below as means to get the secrets to
the applications consuming it.
Moving forwards most organizations start adopting Secret Engines as a way to reduce overhead, such as
using PKI as an intermediate CA and generating short lived certificates automatically, or short lived database
credentials.
API
As described before, Vault provides a HTTP Restful API that allows applications to consume secrets
programmatically. There are a large number of client libraries and in-language abstractions that allow for
simpler programming.
This is traditionally the most secure pattern to retrieve secrets from Vault, as these are generally stored as a
variable in memory, and get cleaned up when the application is not using them.
This pattern is considerably invasive, as it requires modifying the application to retrieve secrets from Vault
(generally upon initialization, or consuming an external service).
Non-invasive patterns
HashiCorp provides two applications that provide workflows to consume credentials without modifying the
application, using the most common pattern for secrets consumption, via a configuration file or environment
variables.
These applications either render a configuration file template16 interpolating secrets, or pass environment
variables17 with values obtained from Vault.
15
https://learn.hashicorp.com/vault/identity-access-management/iam-secure-intro
16
https://github.com/hashicorp/consul-template
17
https://github.com/hashicorp/envconsul
__
Encryption as a Service
Vault provides cryptographic services, where consumers can simply encrypt / decrypt information by virtue of
an API call, and key lifecycle and management is generally managed by an external team (often assisted by
automation).
Each key generated has separate API paths for management, and each service action (encrypt / decrypt /
sign /verify), allowing policy to be set at a very granular level, aligning to roles existing in the organization. As
an example, while the security department may have the following permissions over a particular transit
mount:
## Crypto officers
# Create key material, non deletable, non exportable in unencrypted fashion, only aes-256 or
rsa-4096
path "/transit/keys/*" {
capabilities = ["create”, “update"]
allowed_parameters = {
"allow_plaintext_backup" = [“false”]
"type" = ["aes256-gcm96", "rsa-4096"]
“convergent_encryption” = []
“derived” = []
}
}
# List keys
path "/transit/keys" {
capabilities = ["list"]
}
# Rotate Key
path "/transit/keys/*/rotate" {
capabilities = ["create"]
}
__
## Consumers
# Encrypt information
path "/transit/encrypt/keyname" {
capabilities = ["create”]
}
# Decrypt information
path "/transit/decrypt/keyname" {
capabilities = ["create"]
}
# Rewrap information
path "/transit/rewrap/keyname" {
capabilities = ["create"]
}
__
The Storage Key encrypts every secret that is stored in Vault, and only lives unencrypted in memory. This key
can be rotated online by simply sending a call to the right API endpoint, or from the CLI:
This requires the right privileges as set on the policy. From the point in time of rotating the key every new
secret gets encrypted with the new key. This is a fairly straightforward process that most organizations carry
out every six months, unless there is a compromise.
This procedure should be carried out whenever a Key Holder is no longer available for an extended period of
time. Traditionally in organizations there is a level of collaboration with a Human Resources department, or
alternatively these procedures already exists for organizations using HSMs, and can be leveraged for Vault.
An illustration of the procedure can be found below.
__
For a PKCS#11 seal, traditionally an operator would simply change the key_label and h
mac_key_label parameters.
Upon detecting that the label has changed, and it does not match the label used to encrypt the Master Key,
Vault will simply re-wrap it:
__
The existence of the key can be verified by using HSM tooling or the pkcs11-list command:
$ pkcs11-list
Enter Pin:
object[0]: handle 2 class 4 label[11] 'vault_key_2' id[10] 0x3130323636303831...
object[2]: handle 4 class 4 label[16] 'vault_hmac_key_1' id[9] 0x3635313134313231...
object[3]: handle 5 class 4 label[11] 'vault_key_1' id[10] 0x3139303538303233...
DR Promotion
The DR promotion procedure would be carried out to continue service upon a catastrophic failure that would
leave the primary cluster unusable. This procedure shouldn’t be carried out if the Primary cluster is in service
as it may have unintended consequences.
Possible causes to trigger a DR promotion would include, but not be limited to:
§ Primary Datacenter failure
§ Data corruption on Primary Cluster
§ Seal key loss
§ Seal key tampering
This procedure includes a Vault operator and the Key Officers, and is similar to an unsealing or key rotation,
in the sense that a quorum of officers need to approve the operation by virtue of submitting their key share.
1. The operator submits a request for promoting the Secondary cluster to Primary. This is one of the only
endpoints available to a DR Cluster, which is not operational under normal parameters. It will retrieve
a nonce for the operation.
2. The Key Officers would proceed to submit their key shard.
3. Once a quorum of Key Officers have been reached, the operator can retrieve an operation key.
4. This operation key is one time use, and the operator will use it to start the promotion process.
__
Policy templates are also used as a way to reduce the amount of policies maintained, based on interpolating
values from attributes, or key value pairs from entities.
As an example, from an entity that contains an “application” key, with an “APMA” value:
__
A policy could exist to give access to a static secret mount that matches the value of a defined key:
path "{{identity.entity.metadata.application}}/*" {
capabilities = ["read","create","update","delete"]
}
Governance oriented policies can be introduced via Sentinel, as either Role or Endpoint Governing Policies.
Application onboarding
Adding new secrets into Vault and enabling new applications to consume them is the most regular
operation in Vault.
Each organization has its own guidelines in regards to secret generation, consumption, and handover
points, but generally speaking, the following aspects are agreed in advance:
§ What is going to be the consumer involvement into the process. Will be the consumer be responsible
from authenticating with Vault, securing a token, and then obtaining a secret, or is the expectation
that there is a certain degree of automation present that secure the token ahead. The handover point
needs to be agreed in advance.
§ Is it the case that there is a team handling the runtime, and as such, the developer has no
involvement, and tools external to the application are going to be used to consume the secret.
§ Is the consumer part of a namespace and as such supposed to create a role for the new application.
§ If using static secrets, is the consumer expected to manually store the secret into Vault, or will it be
refreshed by a process automatically.
__
§ Set up authentication with Vault, or simply create a role mapping to the policy (unless one of the techniques
mentioned above is used)
§ Define where secrets will be stored
§ Set up the application to consume it
__