Tech Director of Dosec.cn Discusses Best Practices for Cloud-native Security Architecture De 原创
With its efficient, stable, and?responsive features, cloud-native has become a key driver of digital innovation?in enterprises. At the same time, security risks are also increasing in?cloud-native environments, prompting enterprises to seek appropriate architecture design solutions.
In this article, we invited Mr. Bai?Liming, technology director of Dosec.cn, to present some best practices for building cloud-native?security architectures based on the company's expertise and experience.
1. Development of cloud-native
The concept of cloud-native was?first introduced in 2013 by Pivotal, a company recognized for its multi-cloud?application platform Cloud Foundry. Two years later, Matt Stine, Pivotal's?technical product manager, defined the five principles of cloud-native?architecture in his book "Migrating to Cloud-Native Application?Architecture":
Compliant with 12-factor apps;
Microservice-oriented architecture;
Self-Service Agile Architecture;
API-based collaboration;
Antifragile.
According to the CNCF Cloud Native?Definition v1.0, which was approved on June 11, 2018, cloud-native should have?the following characteristics:
· Containers;
· Service meshes;
· Microservices;
· Immutable infrastructure;
· Declarative APIs.
Applications that meet all five of?the characteristics above will be cloud-natives ones.
Throughout the evolution of?cloud-native, containerization has further simplified the capabilities and?features of the operating system. Cloud-native operating systems were developed?to meet the immutable infrastructure requirement. It features a streamlined?kernel, retains only container-related dependency libraries, and uses a?container user end as a package manager.
In cloud-native operating systems,all processes must run in containers. As no application can be installed on the?OS host, the OS becomes completely immutable, known as the immutable?infrastructure, and is expected to be the future of OS development.
In the past, applications were run?on physical machines, but as the infrastructure evolved, they moved to virtual?machines and later to containers. In the era of cloud computing, serverless?architecture seems to be the newest fad.
A physical machine's life cycle is?typically measured in years and terminated after a year or five. For virtual?machines, the unit of measurement is the month.
With the advent of containerization,each update requires rebuilding a new container; as a result, container?lifecycles are measured in days. While serverless computing progresses,function virtualization will be measured in minutes.
The emergence of containerization?accelerated the process of standardizing containers. Containers and DevOps?complement each other, and application container platforms should follow a?DevOps development model to speed up the release process. Generally,containerization promotes DevOps, and containers rely on DevOps for speeding up?iteration.
With containers as the unit of?analysis, cloud-native and services represent the network boundary.Cloud-native has no concept of IP addresses as they are all dynamic, and we?cannot configure their IP addresses on conventional firewalls. With?cloud-native, the container services are updated every day, so the IP address?is changed accordingly, and the original network policies are no longer valid.
In the era of physical machines, it?is more challenging to deploy physical devices, so running several applications?on one physical machine is common. For virtual machines, individual services?were usually divided into a single virtual machine to improve service?availability. Currently, service interfaces are increasingly dependent upon?micro-services, so they must be adapted into microservice architectures.
Here take Weibo (a Chinese?microblogging site similar to Twitter) as an example: when there is a hot?event, both physical and virtual machines require a more extended build period?in hours to allow business recovery. In a containerized scenario, the container?begins to operate in seconds, whereas physical and virtual machines start up?much more slowly. Therefore, since Weibo adopted a container architecture, hot?events are rarely the cause of downtimes. Moreover, this can also be attributed?to the self-healing and dynamic scaling capabilities of the K8S platform.
Docker was commonly equated with?containers during the early days of container runtimes. Similar to containers,which have four modules, Docker includes four interfaces. Docker, however, is a?complete development kit, and K8S will only use the runtime. Therefore, to?improve operational efficiency, K8S gradually stopped supporting Docker Shim in?version 1.20 and switched to using Docker and Containerd instead.
However, neither Containerd nor?Docker provides comprehensive security features. In Cri-o, the needs for?relative security can be met, and there is no daemon. Each Cri-o process?consists of a parent and child process, which can run as a service. In?addition, the next aspect of containers to be considered is the security of the?underlying infrastructure, including the technological containerization of?security.
2. Risks associated with cloud-native
A cloud-native architecture needs to?address five main security concerns:
· Image security
· Image repo security
· Cluster component security
· Container network risks
· Microservices risks
The risks associated with image?security are by far the most extensive. Unlike infrastructure security,cloud-native focuses more on performance optimization and infrastructure?containerization. At the moment, 51% of DockerHub images have high-risk?vulnerabilities, while 80% have low- to medium-risk vulnerabilities. It is?common for enterprises to download images from DockerHub.
As for image repositories,enterprises cannot upload all of their R&D and business images to a public?repository but must store the source code in their own repository. However,enterprise repositories can also contain vulnerabilities that hackers may exploit,leading to the replacement of images in the repository. It is possible that the?actual image pulled from the node is from a hacker with a Trojan horse.
Cluster components such as Docker,K8s, OpenShift, and Cri-o have vulnerabilities and 45 vulnerabilities in other?container runtimes such as Containerd and Kata Container. Vulnerabilities?associated with cluster components are relatively few, but they do exist.
A hacker who exploits these?vulnerabilities will also have access to other containers within the cluster.Physical firewalls can only prevent traffic emanating from outside of the?cluster, however, attacks that originate inside the firewalls, such as those?caused by K8S overlays and underlays, are not covered by firewalls, hence?posing an internal network risk to clusters.
The vulnerability of business images?can also lead to a second problem: the vulnerability of the built-in image?components. If a developer uses an API or a vulnerable development framework,this type of security problem can arise when the developer packages the?components into an image. Previously, the widely impacting Spring Framework?0-day was an infrastructure vulnerability that affected approximately 90% of?Chinese Internet enterprises. R&D is typically responsible for introducing?this type of microservices risk.
3. Design of a cloud-native security architecture
In the past, infrastructures were?primarily protected by firewalls and physical security measures. For the?computing environment of containers, container runtime security and image?security require professional protection. Moreover, regarding the security of?containers, it involves the discovery of microservices and the protection of?serverless applications.
A cloud-native scenario requires the?R&D security system to be integrated, which differs from a traditional?security system. Research and development personnel should be involved in the?security design process, and they should always pay attention to the?cloud-native data security in R&D and the permissions related to security?management.
As part of Dosec.cn's container?security solution, there are many built-in and machine behavior learning?policies, as well as other disposal policies and events.
Auditing orchestration files is one?of the features. It can read all the existing Dockerfiles, Yaml files, and?orchestration files directly from the developer's code repository. By inferring?syntax from the Dockerfile file, it can detect errors in the command.
In the event that an issue is?discovered during the audit, it will be reported to R&D team, and the image?building will be disabled. If there is no problem, modifications will be?immediately conducted, and the image will be generated once the changes have?been made. Next, the image will be reversed into a Dockerfile and compared. A?warning will be issued if any tampering with Dockfile is detected.
Moreover, the container business?running on the image will also be reversed in order to check whether the image?on which the container depends is correct and whether the process running in?the image matches the process packaged in the Dockerfile. An alert will be?raised if there is an inconsistency found, reporting that the business may be?at risk.
The cloud-native approach is?immutable, and the underlying OS and image are also included in the immutable?infrastructure, so the image is also immutable. An image is built according to?the Dockerfile, and the running containers are associated with the image.
Another feature includes the ability?to read Yaml files directly from the code repository and to control their?permissions. A warning will be raised if there is any deprecated and incorrect?syntax, high-risk commands, or other dangerous parameters in the Yaml file. The?purpose is to link security, O&M, and R&D teams. It is essential that a?cloud-native security strategy is developed in concert with the operational?team, developers, and security personnel and should never be solely the?responsibility of the security department.
A range of open-source image?component scanning tools are available on the market. Currently, Dosec.cn's?Jingjie Container Security Platform is available in both open-source and?commercial editions, and the main difference is the custom rules and?vulnerability library. Open-source vulnerability libraries are based on the?open-source CBE vulnerability libraries, which support the Chinese?vulnerability database CNNVD. CNNVD requires cooperation, and ordinary?open-source vendors may not obtain this database. This is one of the key?differences between open-source and commercial editions.
Some custom features are available?only in the commercial edition, such as trusted image, base image?identification, and host image scanning. There are always security risks?associated with image repositories, and we need to scan image repositories for?vulnerabilities to build security capabilities within the enterprise.
Furthermore, Dosec.cn has been involved with Harbor for its vulnerabilities, so?it has some advantages.
Components of the cluster are also?at risk. To find the cluster components at stake, assembling the cluster itself?and comparing it with the vulnerability database and the vulnerable version is?necessary. Meanwhile, version matching would not work for API interfaces and?permission vulnerabilities, but POC tests would be required to determine the?risks associated with all cluster components.
By scanning each component's?configuration in clusters can scan the permission of configuration. In the?early versions of K8S, authentication permissions were not enabled by default, but?now it defaults to HTTPS.
Moreover, features such as whether?audit logs are turned on, need to be configured based on cluster security,along with compliance check baselines to be scanned.
With cloud-native microservices, the?service split will lead to exponential growth in scale, which requires?automatic discovery of microservices by security software and identification of?the types of services, allowing automatic vulnerability scanning. This method?is very labor-saving.
Two methods can be used to detect?the in-container security after running. The first is learning and?standardizing all the behaviors of containers. Meanwhile, reads/writes on?container files, process start-ups and shutdowns, and access calls will be?captured and recorded in the behavior model. Accordingly, all the traffic of?container running will be considered normal, while the other traffic disposed?of will be treated as an exception.
Learning takes time, however, and if?the learning process encounters attacks or executes, the results will be?biased. A policy can be built into the attack model that will exclude behaviors?when they are found to violate the policy. It can be combined with machine?learning to protect against zero-day attacks while preventing attacks during?the learning process. Blacklisting policies integrated into the system enable?it to achieve a perfect closed-loop of machine runtime security testing. This?seems to be the best practice for container runtime security at the moment.
Microsegmentation in cloud-native is?required to achieve the following features: First, it must enable visualization?of access relationships. Inherently, cloud-native segmentation meets the zero?trust requirement. K8s does not have an IP concept and relies solely on Labels.These labels are tagged by the R&D and business teams, who will utilize?them to implement microsegmentation dynamically. Thus, it is necessary to?automatically generate and rehearse the container's policy based on the?learning relationship.
When the policy learning is complete?and confirmed, it will enter rehearsal mode, where the rehearsal time can be?set. The normal traffic flow will not be blocked for a certain period. In the?event that traffic flow is found to be affected by the policy, it will be warned.In this case, the company's R&D or business team can make a judgment in?person, and if the business traffic is safe, the machine behavior learning?model will be edited in order to exclude it.
If no more exceptions are found?after a certain period, the trained policy will not affect regular traffic?patterns and can effectively defend against attacks. By clicking policy?execution, the automatic policy can now be applied to the production?environment without affecting it.
Lastly, in cloud-native?environments, the security of its own software platform must comply with the?three-layer architecture: first, there is the management layer, which must be?decoupled from the task center so that all clusters are convergent.
If the image repository contains too?much data, the scanning can be integrated directly with the repository image.Instead of relying on network bandwidth to pull the image, it could scan?directly while reading the storage path. In this manner,network utilization,as well as disk IO usage, can be significantly reduced, enabling direct?reading. Currently, this is the most influential architecture design for?container security.
4. Best practices in cloud-native security
There are three main components of?DevSecOps design in cloud-native environments. First, there is the construction?phase. Dosec.cn provides a golden image repository where all the images are?reinforced. R&D personnel can directly pull and build business images from?the golden repository.
Having cooperated with CNNVD,Dosec.cn's vulnerability library will be updated directly following?synchronization. Additionally, Dosec.cn will maintain its golden image?repository in real-time according to the daily vulnerability updates. Moreover,Dosec.cn has its own scanner and security researchers investigating the latest?vulnerabilities and zero-day attacks.
The recommendation for enterprises?is to maintain two image repositories and set trust judgments for the?production image repositories in the cluster. Thus, hackers are prevented from?entering the clusters and pulling down business containers directly.
Image scanning is used for business?development to scan the configuration of the application layer, and if a?vulnerability is discovered, it blocks synchronization. A trust judgment can be?set up in the production environment that incorporates all conditions, such as?whether the enterprise is using its own environment image repository.
Using the platform, it is also?possible to assess the risks associated with vulnerabilities in cluster?components and microservices. Among other things, scanning and analyzing?vulnerability in images can filter out images so that each image can be?identified as its creator, technical impact components, software component?analysis, source code scanning, development security scanning, and application?vulnerability scanning.
In the event that a container?security platform detects an attack, it will provide overall security?prevention prior to, during, and after the event. A full evaluation and?reinforcement of clusters are conducted beforehand, and all behavior learning?will start after the enhancement. When an event occurs, it will check for and?implement zero-day defenses, with real-time notifications sent out.
When an attack is detected, the?image running should be terminated first. The image will not be uploaded during?the R&D, downloaded to storage, or run in production. For images after the?running of containers, segmentation policies can be executed automatically or?manually for existing images, and rules can be set up for automatic and manual?execution.
As the network domains between?clusters vary, and the K8S network plug-in operates as the overlay network?plug-in by default, the network domain can naturally serve as the security?domain between clusters.
Microsegmentation in cloud-native?must support IP blocking, both in a way that supports zero-trust and Label?blocking as well as IP configuration.
The design of cloud-native security?platforms is based on this principle. Meanwhile, we should not only deploy a?dedicated cloud-native security firewall but also take full advantage of?traditional security firewalls to protect security.
The prevention of zero-day attacks?can be modeled based on the following five factors:
· Learning in-container behaviors to?build a security model;
· Analyze the product risk event?list based on events such as file accesses, abnormal network connections, and?system calls outside the model when detected;
· Team members must respond and take?responsibility for the prevention of abnormal behavior or for correcting errors?as soon as possible;
· Develop models in the test?environment and apply them directly to the production environment without the?need to re-learn them;
· Zero-vulnerability, supporting?0-day mitigation.
During a particular learning cycle,the process starts and stops, and the files that are read and written by the?process are required to be learned. Suppose that, after the learning cycle, a?brute force attack is launched on a database, causing a large number of network?and validation errors in a short period, and it could be directly considered as?not meeting the learning specifications.
The first four factors above learn?the behavior of running containers, while the last one predicts the state of?running containers before they run. In addition to this, historical containers,as well as all previous containers, keep a record of the learning process in?order to prevent zero-day attacks in the future.
Guest Introduction
Mr. Bai Liming is a technical?partner with Dosec.cn and was previously responsible for the cloud-native?platform for OurGame.com. He has over seven years of experience in DevSecOps?R&D and is one of the key developers of the first cloud-native security product in China. Aside from this, he was also a key contributor to the?establishment of "Classified Protection of Cybersecurity 2.0" issued?by the Ministry of Public Security and the white paper on Cloud Native?Architecture Security from the China Academy of Information and Communications?Technology (CAICT).