OpenShift 4 Technical Deep Dive
OpenShift 4 Technical Deep Dive
CONTAINER
PLATFORM
Functional
overview
Self-Service Standards-based
Multi-language Web-scale
Multi-tenant Secure
Value of OpenShift
Automated Operations
Kubernetes
Router Router
Kubernetes
services
IMAGE CONTAINER
BINARY RUNTIME
container images are stored in
an image registry
IMAGE REGISTRY
an image repository contains all versions of
an image in the image registry
IMAGE REGISTRY
myregistry/frontend myregistry/mongo
frontend:latest mongo:latest
frontend:2.0 mongo:3.7
frontend:1.1 mongo:3.6
frontend:1.0 mongo:3.4
containers are wrapped in pods which are
units of deployment and management
10.140.4.44 10.15.6.55
ReplicationControllers &
ReplicaSets ensure a specified number of
pods are running at any given time
image name
replicas
labels
cpu
memory
storage
ReplicaSet
ReplicationController
Deployments and
DeploymentConfigurations define how to
roll out new versions of Pods
image name
replicas
labels
version
strategy
Deployment
DeploymentConfig
a daemonset ensures that all
(or some) nodes run a copy of a
pod
image name
replicas
labels
cpu
memory
storage
Dev Prod
appconfig.conf appconfig.conf
MYCONFIG=true MYCONFIG=false
ConfigMap ConfigMap
secrets provide a mechanism to hold
sensitive information such as passwords
Dev Prod
hash.pw hash.pw
ZGV2Cg== cHJvZAo=
ConfigMap ConfigMap
services provide internal load-balancing and
service discovery across pods
role:
backend
role:
backend
2Gi 2Gi
PersistentVolumeClaim PersistentVolume
My app is stateful.
Liveness and Readiness
alive?
ready?
PAYMENT DEV CATALOG
❌ ❌
OpenShift 4
Architecture
COMPUTE NETWORK STORAGE
WORKER WORKER
MASTER
Kubernetes
Scheduler
services
etcd
Cluster Management
MASTER
etcd
Web Console
MASTER
Infrastructure
services
Monitoring | Logging | Tuned | SDN | DNS | Kubelet
Kubernetes
services
etcd
MASTER
Infrastructure
services
Kubernetes
services
etcd
Kubernetes
services
etcd
Kubernetes
services
Kubernetes
services
Router Router
Kubernetes
services
Router Router
Kubernetes
services
Simplified opinionated “Best Customer managed resources & Deploy directly from the Azure
Practices” for cluster provisioning infrastructure provisioning console. Jointly managed by Red
Hat and Microsoft Azure engineers.
Fully automated installation and Plug into existing DNS and security
updates including host container boundaries
OS.
OCP Cluster
openshift-install deployed
RH
RHCoreOS
CoreOS RH
RHCoreOS
CoreOS
RHEL CoreOS RHEL CoreOS
OCP Cluster
Note: Control plane nodes
must run RHEL CoreOS!
RH
RHCoreOS
CoreOS RHEL
RHEL CoreOS RHEL 7
CoreOS
Customer deployed
Cloud Resources Cloud Resources
Full Stack Automation Pre-existing Infrastructure
Upgrade
4.5 Migration or Serial Upgrade
4.6 EUS
4.7
4.6 EUS
WHEN TO USE When customization and integration with When cloud-native, hands-free operations
additional solutions is required are a top priority
Red Hat Enterprise Linux CoreOS is versioned with RHEL CoreOS admins are responsible for:
OpenShift
●
●
●
●
Runs any
Minimal and Secure Optimized for
OCI-compliant image
Architecture Kubernetes
(including docker)
CRI-O tracks and versions identical to Kubernetes, simplifying support permutations
●
●
●
●
etcd kube coredns openshift
controller-manager controller-manager
kubelet CRI-O
kube-scheduler kube-apiserver openshift-apiserver openshift-oauth
systemd-managed
native binaries kubelet static containers scheduled containers
OpenShift 4
installation
How to boot a self-managed cluster:
●
●
●
●
●
●
Future
Cloud
Node
Bootstrap
Instance
NodeLink
Controller
OpenShift Cluster Management | Machine Configuration
Operator/Operand Relationships
Machine
Config
Controller
50-kargs
role:worker
5-chrony
role:worker
Rendered config:
rendered-worker-<hash>
50-motd
role:worker
OpenShift Cluster Management | Machine Configuration
files:
50-args:
/etc/args
5-chrony 5-other
/etc/ntp.conf /etc/other.conf 50-motd:
role:worker role:highperf /etc/motd
51-motd:
/etc/motd
60-args:
50-motd 51-motd /etc/args
/etc/motd /etc/motd
role:worker role:worker
Pool: Pool:
role:worker role:highperf
OpenShift Cluster Management | Machine Configuration
VM / Server
Ignition
Machine
Config
Server
Instance Metadata:
https://api-int.xxx.local:22623/config/worker
OpenShift Cluster Management | Machine Configuration
rendered-worker-<hash>
{.spec.config} Existing Workers
Machine
Config New Workers
Server
OpenShift Cluster Management | Machine Configuration
50-registries /etc/containers/registries.conf
role:worker
5-chrony /etc/chrony.conf
role:worker Machine
Rendered config: Config
Daemon
rendered-worker-<hash>
50-motd /etc/motd
role:worker
OpenShift Cluster Management | Machine Configuration
some-component
Upgrade
... Cluster Process
Some
Version
Operator Operands
... Operator
...
OpenShift Cluster Management
Over-the-air updates: Nodes
machine-config-operator
Machine Machine
Config Config
Daemon Daemon
○ etcd INGRESS
✓CONTROLLER
● Certificate rotation is automated
✓ CONSOLE
● Optionally configure external endpoints to use
custom certificates ✓ REGISTRY
service.beta.openshift.io/
inject-cabundle="true"
service-ca.crt
service.alpha.openshift.io/
serving-cert-my
serving-cert-my
tls.crt
tls.key
LDAP Google
Keystone OpenID
GitLab Basic
userXX
● Project scope & cluster scope
available
○
○
○
Node
Node
Application Logs
Node
stdout
stderr
OS DISK journald
kubelet
Node (OS)
Persistent
Storage
NFS OpenStack Cinder iSCSI Azure Disk AWS EBS FlexVolume
Container Storage
NetApp Trident*
Interface (CSI)**
Storage
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers: /foo/bar
- name: myfrontend
image: nginx
volumeMounts: Kubelet
- mountPath: "/var/www/html"
name: mypd
volumes:
Node
- name: mypd
persistentVolumeClaim:
claimName: z
2Gi NFS
PersistentVolumes
...
VolumeMount: Z
2Gi RWX
Fast 2Gi NFS
NetApp Flash
NetApp SSD
Block
VMware VMDK
Good
NetApp SSD
StorageClass
...
VolumeMount: Z
2Gi RWX
Good
Special
Resources and
Devices
NFD finds certain resources
Node Feature
Discovery Operator
(NFD)
NFD Worker
Daemonset
kubelet CRI-O
kubernetes API
(Master)
feature.node.kubernetes.io
/pci-10de.present=true
NFD Worker
Daemonset
kubelet CRI-O
Special Resource
Operator
(SRO)
GPU Feature
GPU Driver CRI-O Plugin Device Plugin Node Exporter
Discovery
Daemonset Daemonset Daemonset Daemonset
Daemonset
kubelet CRI-O
feature.node.kubernetes.io
/pci-10de.present=true
Worker Node (CoreOS)
GPU Feature
Discovery
Daemonset
kubelet CRI-O
feature.node.kubernetes.io
/pci-10de.present=true
Worker Node (CoreOS)
kmod-nvidia
nvidia-driver-userspace
GPU Driver
Daemonset
kubelet CRI-O
feature.node.kubernetes.io
/pci-10de.present=true
nvidia.com/gpu.family=tesla Worker Node (CoreOS)
nvidia.com/gpu.memory=16130
... GPU GPU GPU
CRI-O Plugin installs prestart hook
CRI-O Plugin
Daemonset
kubelet CRI-O
feature.node.kubernetes.io
/pci-10de.present=true
nvidia.com/gpu.family=tesla Worker Node (CoreOS)
nvidia.com/gpu.memory=16130
... GPU GPU GPU
Device Plugin informs kubelet of resource details
Device Plugin
Daemonset
kubelet CRI-O
feature.node.kubernetes.io
/pci-10de.present=true
nvidia.com/gpu.family=tesla Worker Node (CoreOS)
nvidia.com/gpu.memory=16130
... GPU GPU GPU
Node Exporter provides metrics on GPU
Prometheus
(cluster monitoring)
/metrics
Node Exporter
Daemonset
kubelet CRI-O
feature.node.kubernetes.io
/pci-10de.present=true
nvidia.com/gpu.family=tesla Worker Node (CoreOS)
nvidia.com/gpu.memory=16130
... GPU GPU GPU
GPU workload deployment
mypod
...
resources:
requests:
nvidia.com/gpu: 1
...
mypod
kubelet CRI-O
automated
kernel module
matching
NFD detects kernel version and labels node
kubernetes API
(Master)
kernel=4.18.0-80
NFD Worker
Daemonset
kubelet CRI-O
kernel=4.18.0-80
Worker Node (CoreOS : kernel 4.18.0-80)
kubelet CRI-O
kernel=4.18.0-80
Worker Node (CoreOS : kernel 4.18.0-80)
Special Resource
Operator
(SRO)
driver-container-4.18.0-80
GPU Driver
Daemonset
kubelet CRI-O
kernel=4.18.0-80
Worker Node (CoreOS : kernel 4.18.0-80)
kubernetes API
(Master)
kernel=4.18.0-147*
NFD Worker
Daemonset
kubelet CRI-O
kernel=4.18.0-147*
Worker Node (CoreOS : kernel 4.18.0-147*)
driver-container-4.18.0-80
GPU Driver
Daemonset
kubelet CRI-O
kernel=4.18.0-147*
Worker Node (CoreOS : kernel 4.18.0-147*)
Special Resource
Operator
(SRO)
driver-container-4.18.0-147
GPU Driver
Daemonset
kubelet CRI-O
kernel=4.18.0-147
Worker Node (CoreOS : kernel 4.18.0-147)
Virtualization
OpenShift Virtualization
● PnT Portal
● Google Slides
112
V0000000
Load Balancing and DNS
with OpenShift
For physical, OSP, RHV, and vSphere IPI deployments
113
V0000000
On-prem OpenShift IPI DNS and Load Balancer
V0000000
Excruciating detail: https://github.com/openshift/installer/blob/master/docs/design/baremetal/networking-infrastructure.md
mDNS with CoreDNS
● CoreDNS is used by Kubernetes (and
OpenShift) for internal service discovery
○ Not used for node discovery I’m etcd-0! I’m etcd-1! I’m etcd-2!
● Multicast DNS (mDNS) works by sending DNS
packets, using UDP, to a specific multicast
address
○ mDNS hosts listen on this address and
respond to queries
● mDNS in OpenShift
○ Nodes publish IP address/hostname for
themselves to local mDNS responder
○ mDNS responder on each node replies
with local value
● DNS SRV records are not used for etcd in OCP What are the
etcd servers?
4.4 and later
V0000000
keepalived
● Used to ensure that the API and Ingress
(*.apps) Virtual IPs (VIP) are always available
● Utilizes Virtual Router Redundancy Protocol
VIP
(VRRP) to determine node health and elect an
Active node
IP owner
uses RARP to
○ Only one host owns the IP at any time claim traffic
V0000000
API load balancer
1) Client creates a new request to
api.cluster-name.domain.name
2) HAProxy on the node actively hosting 3 1
the API IP address (as determined by
keepalived) load balances across
control plane nodes using round robin
3) The connection is forwarded to the
chosen control plane node, which 2
responds directly to the client, a.k.a.
“direct return”
V0000000
Ingress load balancer
● The VIP, managed by Keepalived, will
only be hosted on nodes which have a
Router instance
○ Nodes without a Router continue to
participate in the VRRP domain, but
fail the check script, so are ineligible
for hosting the VIP
● Traffic destined for the *.apps Ingress
VIP will be passed directly to the Router
instance
V0000000
Requirements and limitations
1) Multicast is required for the Keepalived (VRRP) and mDNS configuration used
2) VRRP needs layer 2 adjacency to function
a) All control plane nodes must be on the same subnet
b) All worker nodes capable of hosting a router instance must be on the same subnet
c) The VIPs must be on the same subnet as the hosts
3) Ingress (*.apps) throughput is limited to a single node
4) Keepalived failover will result in disconnected sessions, e.g. oc logs -f <pod> will terminate
a) Failover may take several seconds
5) There cannot be more than 119 on-prem IPI cluster instances on the same L2 domain
a) Each cluster uses two VRRP IDs (API, ingress)
b) The function used to generate router IDs returns values of 1-239
c) There is no collision detection between clusters for the VRRP ID
d) The chance of collision goes up as additional clusters are deployed
V0000000
Alternatives
“I don’t like this,” “I can’t use this,” and/or “this does not meet my needs”. What other options are there?
● Ingress
○ 3rd party partners, such as F5 and Citrix, have certified Operators that are capable of replacing
the Ingress solution as a day 2 operation
● API
○ There is no supported way of replacing the API Keepalived + HAProxy configuration
● DNS
○ There is no supported way of replacing mDNS in this configuration
● DHCP
○ DHCP is required for all IPI deployments, there is no supported way of using static IPs with IPI
Remember that IPI is opinionated. If the customer’s needs cannot be met by the IPI config, and it’s not
an option to reconfigure within the scope of supported options, then UPI is the solution. Machine API
integration can be deployed as a day 2 operation for node scaling.
V0000000
Developer
Experience
Deep Dive
Application
Probes
Application Probes
httpGet
exec CONTAINER
POD
tcpSocket
Application Probes
Liveness Probes
alive?
Application Probes
Readiness Probes
SERVICE
ready?
Important settings
initialDelaySeconds: How long to wait after the pod is launched to begin checking
timeoutSeconds: How long to wait for a successful connection (httpGet, tcpSocket only)
failureThreshold: How many consecutive failed checks before the probe is considered failed
Build and Deploy
Container Images
DEPLOY YOUR DEPLOY YOUR DEPLOY YOUR
SOURCE CODE APP BINARY CONTAINER IMAGE
CODE APPLICATION
BUILD IMAGE
DEPLOY