Six Elements of Developing an Automated Ops Architecture

Operation automation is something we all aspire to, but while we emphasize the power of automation, we overlook a key factor that affects how automation is implemented. It's the business architecture that we love to hate and spend so much time with.

The first point: architecture independence

Any architecture is generated to meet specific business requirements, if we meet the business requirements at the same time, can take into account the non-functional requirements of operations and maintenance on the management of the architecture. Then it is reasonable to assume that such an architecture is O&M friendly.

From an O&M perspective, the required architectural independence consists of four aspects: independent deployment, independent testing, componentization, and technical decoupling.

Independent deployment

This means that a piece of source code can be deployed, upgraded, scaled and so on in accordance with the management requirements of the operation and maintenance, and can be configured to differentiate between geographical distribution. Services call each other through the interface request to achieve deployment independence is also a prerequisite for operation and maintenance independence.

Independent testing

Operation and maintenance can verify the availability of the business architecture or service through some convenient test cases or tools. A business architecture or service with this capability gives Ops the ability to go live on its own, without the need to involve development or testers with every release or change.

Component specification

Component specification refers to the ability to have good framework support for related technologies within the same company, so that different development teams do not use different technology stacks or components, resulting in an uncontrolled technology architecture within the company.

This approach limits the number of objects that can be added to the production environment, and allows operations to stay in control of the production environment. It also allows Ops to stay focused on building more efficiency and quality around standard components.

Technical decoupling

Reducing the dependencies between services and services also involves reducing the dependency of code on configuration files. This is also the basis for implementing microservices, enabling independent deployment, independent testing, and componentization.

Point 2: Deployment Friendly

There is a lot of space in DevOps about the technical practice of continuous delivery, hoping to bridge all the technical aspects of development, testing, and operations and maintenance from end to end to achieve the goal of rapid deployment and delivery of value. It can be seen that deployment is a very important part of the daily work of operations and maintenance, is part of the planned work, high repetition, must improve efficiency.

To achieve efficient and reliable deployment capabilities, it is important to do a good job of global planning to ensure that the deployment, as well as the operational phase of the full range of O&M control. There are five latitudes that are deployment-friendly:

CMDB Configuration

Before each deployment operation, O&M needs to have a clear grasp of how the application relates to the architecture and to the business, in order to have a better global understanding and assessment of the workload and potential risks.

In WeavingCloud's automated O&M platform, we are accustomed to managing configuration information such as business relationships, cluster management, operational status, importance levels, and architectural layers as the management objects of O&M in the CMDB configuration management database. The benefits of this management approach are obvious: centralized storage of configuration information for operation and maintenance objects will provide a lot of configuration data support and decision-making assistance for the automation capabilities involved in operation and maintenance operations, monitoring and alerting in the future.

Environmental Configuration

In enterprises with a low degree of O&M standardization, one of the original sins that impede the efficiency of deployment and delivery is environmental configuration, which is one of the main O&M pain points that containerization technology hopes to address.

Tencent's operation and maintenance practice is to standardize the management of the three main environments of development, testing, and production, by enumerating and managing the collection of resources and operation and maintenance operations related to the environment, and combining them with automated initialization tools in order to realize the landing of standard environment management.

Dependency management

Solve the dependency management of application software on libraries and operation environment. In our experience, we utilize package management to resolve the challenges of deploying applications in different environments by packaging the dependent library files or environment configurations as a whole and executing scripts before and after. There are also lighter containerized delivery methods in the industry that are good options.

Deployment methods

The principle of continuous delivery refers to the need to create a reliable and repeatable delivery pipeline, and we strongly plan for the deployment of application software according to this goal. There are many cases in the industry that can be referred to, such as Docker's Build, Ship, and Run, and Weaving Cloud's one-click deployment through configuration descriptions and standardized processes, and so on.

Release self-testing

Release self-testing consists of two parts:

Lightweight testing of applications;

Proofreading of release/change content.

These two capabilities are built to meet the needs of different operation and maintenance scenarios. For example, when incremental releases are made, using the ability to proofread release content, operation and maintenance personnel can quickly obtain the md5 of the change file or check the configuration information of the relevant processes and ports to ensure that each release of changes is reliable.

Similarly, lightweight testing meets the need for service availability testing at the time of release, a step that can test service connectivity and also run some trunk test cases.

Graying out

There is a phrase in the 36 Tips for Daily Operations and Maintenance that says, "For irreversible deletions or modifications, try to delay or slow down the execution as much as possible. This is the idea of grayscale, whether it is from the user, time, server and other latitude of the grayscale on-line, are hoping to minimize the risk of on-line operation, the business architecture to support the ability of grayscale release, so that the application deployment process is less risky, more friendly to the operation and maintenance.

The third point: maintainability

The ideal microservice architecture in the mind of the operation and maintenance, the first is certainly the type of strong maintainability. An application or architecture that is not maintainable is not just a black mark against the operations team, it is also y damaging to their careers because maintaining an architecture that is not maintainable is a waste of an operations person's life.

Operatability can be categorized into the following seven points in terms of operational and management specifications:

Configuration management

In microservices architecture management, we are proposing to manage application binaries separately from configurations for the purpose of independent deployment.

The application configuration that is separated out is managed in three ways:

File mode;

Configuration item mode;

Distributed configuration center mode.

Limited to space will not discuss the advantages and disadvantages of these three approaches. Different enterprises can choose the most applicable configuration management approach, the key is to require the business to use a consistent program, operations and maintenance can be targeted to build tools and systems to do a good job of configuration management.

Version management

One of the eight principles of DevOps Continuous Delivery is to "bring everything under version control". In the case of an Ops object, to manage it, you need to be able to describe it clearly.

Similar to the requirements of source code management, Ops needs to script the objects it operates on daily, such as packages, configurations, scripts, etc., so that when Ops automates an operation, it can accurately select the object and version being operated on.

Standardized operations

Operation and maintenance have a lot of repetitive tasks that need to be performed on a daily basis, and from a lean perspective, there is a lot of waste: learning costs, worthless operations, duplication of scripts/tools, and the risk of human execution.

If a unified operation specification can be formed within the enterprise, such as file transfer, remote execution, application start and stop operations are standardized, centralized, and one-click operation, the efficiency and quality of operation and maintenance will be greatly improved.

Process management

Including application installation paths, directory structure, standardized process names, standardized port numbers, start/stop methods, monitoring programs, and so on, are included in the scope of process management. A good global planning of process management can greatly enhance the degree of automated operation and maintenance and reduce the occurrence of unplanned tasks.

Space management

Managing the use of disk space is an effective means to ensure the orderly storage of business data and to reduce the occurrence of unplanned tasks.

Requires advance planning: backup strategy, storage solutions, capacity warning, cleanup strategy, etc., supplemented by proven tools, so that these tasks no longer plague operations and maintenance.

Log management

Log specifications for the implementation and implementation of the need for close cooperation with R & D, the experience gained in practice, the operation of the ideal log specification to include these requirements:

Separation of business data and logs

Logs decoupled from the business logic

Log format unity

Return code and clear comments

Accessible business metrics (request, request, request, request, request, request, request, request, request, request, request, request, request, request, request, request, request, request, request, request, request, request, request). p>Accessible business indicators (request volume/success rate/delay)

Definition of key events

Output level

Management program (storage length, compression and backup, etc.)

When the specific logging specifications of the above conditions can be put into practice, the development, operation, and business can accordingly obtain a better ability to monitor and analyze.

Centralized control

Operation and maintenance work is inherently easy to be cut into different parts, release changes, monitoring and analysis, troubleshooting, project support, multi-cloud management and so on, we are asking for a one-stop operation and maintenance management platform, which makes it possible to connect all the work information and pass on the experience to eliminate the operational risks caused by the information silos or manually transferring information, and to improve the overall efficiency and quality of operation and maintenance control. We are looking for a one-stop operation and maintenance management platform that can connect all work information and pass on experience, eliminate operational risks caused by information silos or manual transmission of information, and improve the overall efficiency and quality of operation and maintenance control.

Element 4: fault tolerance and disaster recovery

The four major responsibilities of Tencent's technical operations (O&M) are: quality, efficiency, cost, and security. Quality is the first guarantee of the position, converted into an architectural perspective, the ideal highly available architecture architecture design in the eyes of the operation and maintenance should include the following points:

Load balancing

Whether it is software or hardware is responsible for balancing the program, from the perspective of the operation and maintenance, we always hope that the business architecture is stateless, the routing and addressing is intelligent, the fault tolerance of clusters is automatically achieved.

In Tencent's routing software practice over the years, software load balancing solutions have been widely used to achieve high availability for the business architecture.

Schedulability

In the era of the prevalence of mobile Internet, schedulability is an extremely important operation and maintenance tool for disaster tolerance. Dispatching users or services away from abnormal areas when the business encounters failures that cannot be resolved immediately is a tried-and-true technique in massive operational practice, and one of Tencent QQ and WeChat's core O&M capabilities to safeguard the quality of the platform's services.

Combined with domain names, VIPs, access gateways and other technologies, this allows the architecture to support scheduling capabilities, enriches O&M management tools, and has the ability to respond to a variety of failure scenarios more comfortably.

Offsite Multi-Activity

Offsite Multi-Activity is the demand for high availability of data, and is the prerequisite for schedulability. For different business scenarios, there is no limit to the means of technical realization.

The practice of Tencent social networking can be referred to Mr. Zhou Xiaojun's article, "Architecture Design and Efficient Operation Behind the Scheduling of 200 Million QQ Users".

Master-slave switching

Master-slave switching is the most common disaster recovery and fault-tolerance solution in database high availability programs. By realizing read-write separation in the business logic, and then combining it with intelligent routing to achieve unattended master-slave switching automation, it is undoubtedly the best gift of architectural design to DBAs.

Flexible Availability

"Carry and then optimize" is one of Tencent's ideas for massive operations, and it also points the way for us to do the highly available design of business architecture.

How to maximize the availability of business in the case of a sudden increase in business volume? It is an unavoidable problem when doing architecture planning and design. Cleverly setting up flexible switches or building in logic in the architecture to automatically reject excess requests can ensure that back-end services do not avalanche at critical moments and ensure high availability of the business architecture.

Element 5: Quality Monitoring

Assuring and improving business quality is a goal that O&M strives to pursue, and the ability to monitor is an important technical tool for us to achieve our goals. Operation and maintenance hope that the architecture for quality monitoring to provide convenience and data support, the requirements to achieve the following points:

Metrics

Each architecture must be able to be metrics metrics, at the same time, we hope that it is best to have only a unique metrics metrics. As the business becomes more sophisticated and three-dimensional, the number of metrics grows exponentially. Therefore, the metrics of the architecture, we hope that it is best to have only unique metrics.

Basic monitoring

Refers to the low-level metrics capabilities of networks, leased lines, hosts, systems, etc. Most of these monitoring points are non-intrusive and easy to realize data collection.

In enterprises with robust automated operations and maintenance capabilities, the vast majority of alarm data generated by basic monitoring will be converged. At the same time, this part of the monitoring data will provide data support and decision-making basis for high-level business monitoring, or be packaged to be used as business monitoring data closer to the upper-level application scenarios, such as capacity, multi-dimensional indicators, and so on.

Component monitoring

Tencent is accustomed to development frameworks, routing services, middleware, etc. are collectively referred to as components, this type of monitoring between the basic monitoring and business monitoring, operation and maintenance often hope that the monitoring logic is embedded in the component, through the promotion of the component, so that the coverage of component monitoring to improve the cost of obtaining data is medium. For example, by utilizing the monitoring of the routing component, O&M can obtain status and quality indicators such as request volume and latency for each routing service.

Business monitoring

Business monitoring implementation methods are divided into active and passive monitoring, which can be realized intrusively, but also in a bypass way to achieve the purpose. This type of monitoring program requires the cooperation of development, related to coding and architecture.

Often, business monitoring metrics can be summarized as request volume, success rate, and latency. There are many means of implementation, log monitoring, streaming data monitoring, wave testing, etc. Business monitoring is a high-level monitoring, often direct feedback on business issues, but if you want to y analyze the root cause of the problem, it must be combined with the necessary operation and maintenance monitoring and management specifications, such as the definition of the return code, logging protocols and so on. This requires that the business architecture be designed to take into account the demands of operations monitoring and management, and that the scope of the global plan be well defined.

All-link monitoring

Basic, component, business monitoring means more focused on point monitoring, in the distributed architecture of the business scenario, to do a good job of monitoring, we must take into account the service request link monitoring.

Based on the unique transaction ID or RPC invocation relationship, the invocation relationship chain is restored through technical means, and then monitoring alarms are triggered through models or events to provide feedback on the status and quality of the service link. This monitoring tool is a high-level application of monitoring, which also requires good pre-planning and code burial points during business architecture planning.

Quality assessment

Any advancement of monitoring capabilities, quality optimization, need to have a closed loop of management, assessment is a good means, from monitoring coverage, indicator comprehensiveness, event management mechanism to the report assessment and scoring, operation and development can work together to create a continuous feedback quality management closed loop, so that the business architecture can continue to evolve and improve.

Point 6: Performance Costs

At Tencent, all technical operations personnel are tasked with an important function, which is to ensure that business operations costs are reasonable. To this end, we must have appropriate management practices for application throughput performance, business capacity planning, and operating costs.

Throughput Performance

In DevOps Continuous Delivery methodology, one of the most important aspects of non-functional requirements testing during the testing phase is to pressure test the throughput performance of the architecture to ensure the health of the application's business capacity after launch.

In Tencent's practice, it is not only limited to the performance pressure testing in the testing phase, but also combined with the functionality of the routing component, we will conduct real request pressure testing on business modules and business SETs to establish a benchmark for business capacity modeling. Also from the side to provide data to demonstrate whether the throughput performance of the business architecture to meet the requirements of the cost assessment, the use of different business performance data comparison, to promote the continuous improvement of the performance of the architecture.

Capacity planning

The word capacity in English can be translated as: application performance, service capacity, and total business requests. Capacity planning for operation and maintenance refers to reasonable service capacity planning based on total business requests under the premise that application performance meets the standard.

Operating Costs

Reducing operating costs is an investment in reducing cash flow for the company, and its value to the organization is no less than the improvement in quality and efficiency.

Tencent's rich-media business, which is dominated by social, UGC, cloud computing, gaming, video, etc., consumes a huge amount of bandwidth, equipment, and other operating costs every year. O&M wants to optimize operating costs, which often involves the optimization of product features and business architecture. Therefore, the operation and maintenance of the ideal business architecture design need to have enough cost consciousness,

Summary

This article is purely a personal perspective on the operation and maintenance of microservices architecture design of some of my humble opinion, in order to maximize the value of the operation and maintenance, in order to ensure that the quality of the business, the efficiency of the cost of the overall improvement of the business architecture of this piece of hard bone has to be gnawed on.

Operation and maintenance people need to have a sense of architecture, can stand in a different perspective on the business architecture to put forward proposals or needs, which is also the spirit of DevOps advocated, development and operation and maintenance to join hands, continue to optimize the best business architecture.