Accelerating Cloud Native in Telco
Accelerating Cloud Native in Telco
Challenges of Cloud Native Telco Transformation today and how to overcome them - A CSP
perspective
Preamble
This document is a product of the initial joint work of several Communication Service
Providers (CSPs) who are active in the Cloud Native Computing Foundation (CNCF)’s
Cloud Native Network Function Working Group (CNF WG), NGMN Alliance, and projects like
Linux Foundation (LF) Europe’s Sylva and Linux Foundation Networking (LFN) Anuket
project. It is a draft that has been published to invite feedback from other CSPs and
motivate discussion and improvements from the broader telecommunication industry. We
hope that through public discourse we can make the document more complete, relevant,
and ready for final release. If you would like to contribute to the discussion and
document, please feel free to open an issue or create a pull request.
Introduction
The recently published Cloud Native Manifesto from Next Generation Mobile Networks
(NGMN) Alliance does an excellent job outlining the vision and target picture for
cloud native telecommunication networks. The transformation towards a cloud native
production model has already commenced in many Communication Service Providers (CSPs).
Practical challenges and pain points on this journey, which hinder progress towards
the target expressed in the NGMN Cloud Native Manifesto, have been identified and are
being felt. These hindering aspects are especially prominent in the CSPs which are
already taking practical transformation steps and are trying to follow the vision
described closely.
We, the group of CSPs gathered around Cloud Native Computing Foundation (CNCF)’s Cloud
Native Network Function Working Group (CNF WG), live on the frontlines of this
transformation and have gathered valuable experience. We firmly believe that to attain
the envisioned outcome, the entire industry needs to work together to align around key
strategic and operational principles. Besides building a sound understanding of what
it would take for the transformation of CNFs to become cloud native, it's also
important to emphasize the ecosystem that would support that evolution.
The industry is still maturing and searching for the right formula to reach a cloud
native operating model. CNF vendors have not been able to comply with the cloud native
and openness requirements of CSPs yet, because these requirements are not yet stable
and still emerging; however, the reasons for this hesitance can be found in an
aversion to giving up a lucrative professional services business model and control
over vertical integration. This hesitance cannot override a CSP's need to evolve
toward a cloud native architecture, largely based on 12-factors for CNFs (see Annex 1
and Reference 5), that can be supported by changing existing commercial models that
could greatly benefit CSPs and vendors alike to create a new win-win equilibrium.
Vendors must provide open APIs, clear documentation, and cloud native architectures
and implementations that empower CSPs with self-service capabilities in the cloud
ecosystem. For the new model to work, vendors and CSPs must provide mutual SLAs: the
CSP must guarantee a certain level of quality at the platform layer, while CNF vendors
need to guarantee that the application will perform on the platform with SLAs that
meet defined KPIs. This will help drive agility and innovation, and reduce Opex costs
within CSP. CNF vendors can monetize the value of openness to evolve business models
that move away from closed solutions and professional services.
Validation. This step did not exist in the previous scenarios due to reliance on pre-
validation and pre-integration. Due to the number of permutations found in cloud
native ecosystems, pre-validation has limited value. Only validation of CNFs on CSP
premises with CSP’s flavor of cloud native infrastructure and its specific
integrations has high relevance and value for concluding if the CNF can be deployed
and promoted to production. Today, we still see that many CNFs are not ready to be
validated in the local CSP environment and rather insist on conformance with the pre-
validation stack. This practice is unsustainable and requires a fresh and flexible
local validation approach. Automation (Continuous Testing) is especially important
when validating frequently released cloud native applications and checking conformance
with frequently updated cloud platforms.
Many CNFs are still reliant on manual artifact deployment and are rooted in
traditional telco methods, such as NETCONF and YANG for configuration management.
These practices pose significant challenges for CSPs aiming for a fully automated CNF
lifecycle. Moreover, the ETSI standard follows an imperative top-down approach, often
characterized as "fire and forget”. This approach doesn't readily support
reconciliation and depends on orchestration entities operating externally to
Kubernetes "out-of-band." Even when the CNFs are following the Kubernetes native
approach, we face challenges with the quality of artifacts like Helm Charts which are
not generalized nor easily customizable, as well as with divergent configuration
schemas. This all creates further complexities in the transition to the declarative
and GitOps-driven automation models prevalent in the cloud native ecosystem.
Architecture. We are witnessing that there are still CNFs that are in their
architecture exhibiting properties of Virtualized Network Functions (VNFs). For
example, we see the “pinning” of Pods to specific cluster nodes. We also still see 1+1
redundancy models for Pods within the cluster instead of N+1. Although it is
technically possible to run such Network Functions on cloud native infrastructure,
this increases the burden of operating them and risks having a negative impact on
service quality, as small disruptions which are normal in cloud native infrastructures
result in problems within the CNFs. Furthermore, the scalability of today’s CNFs is
still sub-optimal. In many cases, it still relies on vertical scaling and manual
interventions. Sudden increases in demand cause performance degradation and even
downtime if the system is not dimensioned in advance for that peak load.
Security. In the experience so far we have noticed that CNFs, in their default setup,
have quite a relaxed posture when it comes to dealing with cluster security-relevant
aspects like Roles, ClusterRoles, privileged access, cluster node level access, and
similar functionalities. We frequently observe that the principle of least privilege
is not consistently followed and that Roles frequently require rights for everything
(“*”) and ClusterRoles are used without real need. CNFs sometimes use problematic
practices (such as hostPath mounting to write their logs, hostPorts for communication,
privilege escalation, running containers as root, managing the configuration of the
node networking stack, and performing dangerous sysctl calls), none of which are
allowed in a properly hardened environment. It looks like such CNFs assume that the
infrastructure can be consumed from a cluster admin perspective without any
restrictions, which in realistic circumstances is never the case. Such expectation
could be reasonable in a combo/silo package where CNF and infrastructure come together
from one hand as a managed package. However, in other cases, CNFs are usually “guests”
on the infrastructure and as such must have appropriate security imposed restrictions
and limitations.
3. Validation. CNFs shall be delivered with a series of automated tests that can
be used to validate the CNF operation on the spot in CSP’s context.
1. This validation shall count as only relevant one, preceding any pre-
validation or lack of it.
2. The validation shall ensure that all artifacts are passing strict
linters to prove that portability is assured.
3. It shall serve as a condition for support and SLA.
4. The validation shall be a continuous process and shall be instantly done
on any change be it on CNF or on the infrastructure side.
5. The validation tests shall cover CNF basic functionality, lifecycle, and
disaster recovery
1. Mainstream open source deployment tools from the CNCF ecosystem, like
FluxCD or ArgoCD, shall be supported per default.
2. All configurations shall be done via Configmaps and/or similar cloud
native constructs (e.g. Kubernetes Resource Models)
3. CNF is allowed to use traditional telco mechanisms internally as a
transition step, however, that should be fully encapsulated and
abstracted away.
4. Microservices should be loosely coupled (with NO tight dependency on
each other) to ensure scalability and ease of deployment, e.g. without
the need to wait for NETCONF day-1 configuration till further
microservices get deployed.
5. Artifacts are delivered via OCI(Open Container Initiative)-compliant
repositories.
6. The CNF LCM should be described declaratively and support continuous
intent-based deployments for example IP address assignment during
deployment.
7. Newly released software version (CNF/microservices) includes machine-
readable code to run health checks.
8. Release notes and impact reports should be included as machine-readable
code in every published release.
9. The CSP-internal automation pipeline shall be allowed to hook into the
vendor software delivery solution (e.g. to subscribe to CNF releases).
10. Artifacts delivered with CNFs (e.g. Helm charts) shall be customizable
for efficient multi-purpose deployments.
11. CNF configuration schemas have to follow the standards that shall be
aligned among CSPs and vendors.
7. Tracing. The CNFs shall be instrumented to emit the protocol tracing data
directly from their microservices to the configurable targets (e.g.
application-level tracing https://opentracing.io/)
1. These traces shall be sufficient for typical e2e analysis that is done
with standard telco tools such as NetScout, Gigamon, Polystar etc.
8. Architecture. The CNFs need to be architected in line with 12-factors for CNF
compliance with cloud native (Annex 1 and Reference 3) and in a way not depend
on any particular cluster node or reasonably small group of cluster nodes.
9. Security. To run in a generic cloud native environment, CNFs have to strip down
their expectations and require exactly the minimum rights that are needed for
functioning.
1. Any practice that poses the risk such as usage of hostPaths, privilege
escalations, root containers, etc. needs either to be eliminated or
replaced with an alternative cloud native approach.
2. The application must adhere to cluster policies enforced by the cluster
manager including overriding its default policies
3. The application should follow the Principle of Least Privilege.
4. RBAC definitions must declare the minimal set of rights needed for the
application to function The application should not request open-ended /
all rights in its RBAC definitions.
5. Applications should not require privileges to run including privileged
pods and cluster roles.
6. Applications that require privileges must declare which components
require privileges in both machine-readable and human-readable open
formats.
7. The application must be isolated with Namespaces and not use the default
namespace.
ANNEX 2
GitOps for cloud native applications and infrastructure
GitOps is not a single product, plugin, or platform. While the practices and patterns
in GitOps existed before Cloud Native (and the term GitOps), they happen to be a great
match for cloud native applications and infrastructure alike.
Here are some principles for GitOps (as defined by the OpenGitOps community):
Declarative - A system managed by GitOps must have its desired state expressed
declaratively.
Versioned and Immutable - The desired state is stored in a way that enforces
immutability, versioning and retains a complete version history.
Pulled Automatically - Software agents automatically pull the desired state
declarations from the source.
Continuously Reconciled - Software agents continuously observe actual system
state and attempt to apply the desired state.
References:
https://opengitops.dev/ - The GitOps Working Group under the CNCF App Delivery
SIG.
https://www.gitops.tech/ - Collection of information on GitOps by INNOQ.
REFERENCES
1. Cloud Native Networking principles whitepaper - https://networking.cloud-
native-principles.org/cloud-native-networking-preamble
2. NGMN
NGMN publishes Cloud Native Manifesto -
https://www.ngmn.org/highlight/ngmn-publishes-cloud-native-
manifesto.html
Cloud Native Manifesto "An Operator View" (PDF)
Bell Canada
Daniel Bernier, Technical Director
Roger Lupien, Sr. Mgr - Enterprise Architecture, Cloud Transformation
Charter Communications
Mohammad Zebetian, Head of Cloud, Network, Edge, and Infrastructure
Architecture
DNA Oyj
Johanna Heinonen, Development Manager
Orange
Philippe Ensarguet, VP of Software Engineering
Guillaume Nevicato, Network Anticipation & Research Manager
Swisscom
Ashan Senevirathne, Product Owner Mobile Cloud Native Orchestration
Josua Hiller, Product Manager Mobile Data Services
TELUS
Andrei Ivanov, Principal Technology Architect
Sana Tariq, Ph.D - Principal Technology Architect | Cloud and Service
Orchestration
Vodafone
Tom Kivlin, Principal Cloud Architect
Riccardo Gasparetto Stori, Principal Cloud Architect