Network Automation in the Age of the Internet

Network Automation in the Internet Era

There are two main elements on the Internet, "content and eyeballs", which are the network services provided by Internet companies (or ICPs), such as web pages, games, instant messaging, and so on, and the "eyeballs", which refer to the huge number of Internet users. The content of Internet companies is often distributed in multiple large or small IDCs, more and more "eyeballs" are staring at the content provided by ICPs, and the infrastructure of Internet companies for content storage has also shown explosive growth. In order to guarantee the access experience to the content, Internet companies need to deploy business servers in bulk in different carriers and provinces/cities to provide services to the outside world, and set up IDC internal networks, MANs and WANs for the communication between business modules, and at the same time cover the blind spots of the service through self-constructed CDNs or CDN professional service companies. Therefore, with the growth of business, the operation and maintenance department also becomes more and more important. They have gradually formed an efficient operation and maintenance system after years of accumulation. This article will combine the experience of domestic Internet companies, focusing on the new generation of automated operation and maintenance system for IT infrastructure to discuss.

I. The three stages of operation and maintenance

● The first stage: everyone's operation and maintenance

In the early days, a company's IT infrastructure has not yet reached a certain size (usually in the size of a few to dozens of machines), not necessarily have a specialized operation and maintenance personnel or department, the work of operation and maintenance is shared. department, the work of operation and maintenance is shared among various types of positions. The developers have access to the servers and maintain and manage the online code and business themselves.

● The second stage: vertical automation

With the growth of business, IT infrastructure development to another level (usually in the scale of hundreds to thousands of machines), began to have specialized operations and maintenance personnel, engaged in day-to-day installation and maintenance work, play the role of "firefighters", to collect the alarms, there are operation and maintenance specifications, but the main operation and maintenance is still for the R & D to provide back-up. The main operation and maintenance is still for the research and development to provide after-sales service.

This stage has begun to gradually transition to the process of processing, operations and maintenance departments began to output a list of common problems to deal with the scope of their own business applicable automation scripts, and began to use open source software assembly to complete most of the work.

Specifically: each product line has its own scripts written to utilize, such as SVN+puppet or chef to complete the server on-line and configuration management and other work.

● Stage 3: Everything is automated

In the tide of Internetization, more and more dark horse teams came into being, and all of them have had the experience of doubling the number of user visits in a short period of time. In the process of traffic explosion, whether the ICP's Internet infrastructure is able to follow up well directly determines whether the business content can meet the concurrent access of massive users.

At the same time, the operation and maintenance system needs to be sufficiently complete, efficient, and process-oriented. Google, Tencent, Baidu and Ali scale companies generally have a unified operation and maintenance team, one or more sets of automated operation and maintenance systems can be referred to, operation and maintenance departments and development departments will be a parallel perspective of each other. And also began to pay more attention to IT infrastructure optimization at the architectural level as well as automated management and switching under hyperscale clusters (as shown in Figure 1).

Figure 1. Overview of the IT infrastructure of large Internet companies

Second, BAT (Baidu, Ali, Tencent) operation and maintenance system analysis

Domestic Internet companies Baidu, Ali, Tencent (hereinafter referred to as: BAT) provide different major business content, different IT architecture, operation and maintenance system. have different concerns in the development process.

1. Tencent O&M: ITIL-based O&M service management

It is expected that by 2015, Tencent will have 600,000 servers across the country. With the success of the automation deployment practice in 2012, the work of automation acceptance is currently underway. In terms of network equipment, the subsequent will realize the full automation of the work from the demand side: automatic generation of equipment list - > procurement list automatically issued - > port connection relationship, topology relationship automatically generated - > configuration automatically issued - > automatic acceptance. The entire operation and maintenance process has also evolved from the initial traditional IT management to the ITIL-based service management process (as shown in Figure 2).

Figure 2. Tencent's ITIL-based operation and maintenance service management

2. Ali operation and maintenance system: CMDB-based infrastructure management + logical hierarchical modeling

CMDB (Configuration Management Database) configuration management database ( Hereinafter referred to as: CMDB), all the components of the IT infrastructure is stored as configuration items, maintaining the detailed data of each configuration item, maintaining the relationship data between the configuration items, as well as events, change history and other management data. By consolidating this data into a central repository, the CMDB can provide organizations with the assurance of understanding and managing the cause and effect relationships between data types. At the same time, the CMDB is closely linked to all service support and service delivery processes, supporting the operation of these processes, utilizing the value of configuration information, and relying on the relevant processes to ensure the accuracy of the data. It can realize process integration and automation within and between IT service support, IT operation and maintenance, and IT asset management. In actual projects, the CMDB is often considered the basis for building other ITIL processes and prioritized, and the success or failure of an ITIL project has a great deal to do with whether or not the CMDB is successfully established.

3. Baidu's automated O&M: deployment + monitoring + business systems + correlation

Baidu's main O&M challenges include: sudden changes in traffic, the correlation of the impact of complex environments, the rapid iteration of the development model and the balance between O&M efficiency, O&M quality, and cost, etc. The Baidu O&M team believes that when a company's business is in a state of flux, it should be able to provide the best possible service. Baidu's operation and maintenance team believes that when the size of the server reaches tens of thousands of units, the operation and maintenance perspective needs to be changed to the service as a granularity. The 10,000 units are not equal to "100 units * 100"; the running state of the machine no longer represents the working state of the business; the O&M department provides front-end services for R&D, and the relationship between service and service is also gradually complicated with the expansion of the cluster.

Figure 3: Baidu's automated O&M technology framework

Baidu's automated O&M technology framework is divided into four parts: deployment, monitoring, business systems, and correlation, with the entire framework highlighting the integration of business and IT infrastructure, and focusing on the "correlation" linkage. The so-called correlation relationship mainly refers to the timing dependency relationship between tasks, the data dependency relationship between tasks, and the reference dependency relationship between tasks and resources, which corresponds to the service process of task scheduling, data transmission, and resource positioning, forming multiple service chains.

The operation and maintenance of the correlation relationship is strongly related to the business, and requires a system that can sort out the whole picture of the relationship, so as to locate the link where the operation is located in the complex service chain, and predict the impact scope in the event of a failure, locate it in a timely manner, and notify the corresponding departments. In such a system, automated monitoring system is very important. Baidu's technical monitoring framework, mainly through data collection, service detection, third-party information collection, monitoring and evaluation to the data processing and alarm linkage module processing, through the API interface for functional expansion (shown in Figure 4).

Figure 4. Baidu automated technical monitoring framework

In fact, whether it is Internet companies such as BAT or companies in other industries, they follow the best practices of IT Infrastructure Library (ITIL) or ISO20000 service management in IT construction, and use automated IT management solutions to achieve important business goals, such as reducing service interruptions, lowering operating costs, improve IT efficiency, and more. With the release and promotion of ISO20000 and ITIL v3.0, both have become de facto standards of sorts. In today's enterprise IT management field, there is an urgent need for the two standards. In particular, the certification requirements of ISO20000 has become an increasingly common demand for enterprises. ITIL v3.0 includes the management of IT operations and maintenance from strategy, design to conversion, operation, improvement of the full life cycle of the service, the relevant program often covers a number of areas and a number of products, the planning and implementation and the choice of tools will be more entangled. If you choose open-source tools, you will encounter a lot of development work from the CMDB, for many companies focusing on the cost-benefit ratio, you can refer to, but due to the inability to guarantee the performance and effectiveness is not necessarily applicable. Therefore, a mature commercial solution would be a better choice.

The latest version of iMC V7, innovated around the three dimensions of resources, users, and business, and released components such as SOM Service Operation and Maintenance Management (based on ISO20000 and ITIL standards), which increases the management of servers and can well meet the needs of more Internet-based scenarios.

It is generally accepted that an efficient and useful configuration management database generally needs to meet six important criteria, namely federation, flexible information model definition, standards compliance, support for built-in policies, auto-discovery, and strict access control. Enterprise IT infrastructure often has more than one type of element and type of management data, such as network devices, servers, virtual machines, etc., so there needs to be a suitable federated approach for storing multiple types of information. Although the iMC intelligent management platform has been able to better meet the needs of network devices and server devices, with the development of server virtualization technology, virtual machines are increasingly becoming a major element of IT infrastructure. Therefore, in response to this demand Huasan Communications based on CAS CVM virtualization management system, server CPU, memory, disk I/O, network I/O and other more detailed important resources and virtual machine resources for comprehensive management. Unlike BAT, Huasan Communication's network management software is industry-wide, and although there is no 'management' of special resources such as domain name management, it can be linked to specific systems through API interfaces and other means, thus meeting the needs of customized operation and maintenance, especially in the Internet-based scenarios, where a lot of customized docking requirements can be realized for different business needs, for example, the iMC+WSM component For example, the iMC+WSM component is docked with the own Portal system of a major domestic Internet company, which opens up the iMC tool and the user's own operation and maintenance platform, and well realizes the architectural integration. In addition, similar to Ali's logical hierarchical modeling, H3C's "iMC+CAS" software system also does a lot of logical abstraction and layering in the upper layers, resulting in a number of modules, which are the various components that you see.

Third, the network automation operation and maintenance system

"Even a stranger with only basic technical skills can do professional IT operation and maintenance; even a junior high school education of operation and maintenance personnel, but also be able to lead the team to complete the construction of small and medium-sized room nodes, and is responsible for the maintenance and management of hundreds of thousands of servers". This is an overall evaluation of some companies on their IT operation and maintenance level. It may seem a bit of an exaggeration, but in fact, relying on a strong IT operation and maintenance system, there are many domestic Internet companies have been able to reach or close to this standard.

These companies have experienced various stages in the development process of operation and maintenance, operation and maintenance department was once passive, isolated, scattered "fire brigade" type of team, in the later development process, IT system architecture gradually towards standardization, modeling, operation and maintenance department to establish a complete equipment, system resource management database and knowledge base, including all hardware configuration, all software parameters configuration, purchase date, maintenance records. The operation and maintenance department has established a complete equipment and system resource management database and knowledge base, including the configuration of all hardware, the configuration of all software parameters, the date of purchase, maintenance records, the operation and maintenance risk Kanban and so on. During the operation and maintenance process, the system will collect all problems, events, changes, service levels and other information and enter them into the management system, which will be continuously improved to form a set of operation support mechanisms that tend to be automated. According to the system architecture of cloud computing, in such a system, the main IT resources, including computing, storage, network resources, in recent years, with the promotion of network equipment manufacturers, network equipment management automation technology has been fully developed.

To summarize, an enterprise in the construction of the Internet at the beginning of the need to take into account with the increase in the number of user visits, how to expand resources. Specifically can be broken down into five aspects of planning, construction, management, monitoring, operation and maintenance.

1. Planning modeling

In order to ensure that the subsequent smooth expansion of the business, the network management system can be followed up smoothly, the Internet business is generally in the early stages of the overall system architecture design will be fully taken into account standardization, modeling, new business resources is like a point of fast food, on-demand.

Standardized: First, the use of standard protocols and technologies to build, scalability, the use of more uniform products, easy to manage; second, the use of data center-level equipment to ensure reliability, flexibility, and give full consideration to the business system on the low-latency requirements.

Modeling: Design network architecture model based on business requirements, validate the formation of a baseline, which can be batch replicated, unified management, is also suitable for automation to improve deployment efficiency, network management efficiency.

Figure 5. Common Internet IDC architecture

2. Construction automation

After the Internet IT infrastructure has the ability to batch replication, it can be automated through automation technology, to improve the efficiency of on-line. In the new node construction process, a small team of 3 to 5 people can complete the server room on-line work. For example, an Internet company a time for overseas urgent business needs, a **** sent 2 engineers to the scene for equipment installation and deployment and basic configuration, and then through the Internet link, the equipment from the headquarters management system to automatically obtain the configuration and equipment version, download the business system, complete the installation of the equipment to the server room on-line in less than 1 week.

To achieve the goal of automated operation and maintenance, the construction process needs to focus on two aspects of batch replication and automated on-line (as shown in Figure 6).

Batch replication: Based on business needs, sort out technical concerns, design network models, conduct sufficient testing and piloting, and output software and hardware configuration templates, which can then be deployed in batches.

Automated on-line: make full use of TR069, Autoconfig and other technologies, using zero-configuration function batch automated on-line equipment, the efficiency can be multiplied.

Figure 6. Batch configuration and automated on-line

○ Autoconfig and TR069 have three main differences:

○ Autoconfig is suitable for zero-configuration deployment, and the subsequent general need for a specialized network management system; TR069 is a complete set of management programs, not only in the initial zero-configuration is useful, and then you can always monitor and configure the device management, software management and management. The TR069 is a complete management solution, not only useful in the initial zero-configuration, but also for subsequent monitoring and configuration management of devices, software upgrades, etc.

○ Autoconfig uses DHCP and TFTP - simple, TR069 zero-configuration uses DHCP and HTTP - complex, requiring a dedicated ACS server.

Security: TR069 is more secure and can be based on HTTPS/SSL.

While H3C iMC BIMS implements the ACS (Autoconfiguration Server) function of the TR-069 protocol, and remotely manages the CPE devices through the TR-069 protocol, BIMS has the ability and advantage of zero-configuration with flexible networking capabilities to manage DHCP devices and private network devices after NAT. the workflow of BIMS is shown in Figure 7.

Figure 7. H3C iMC BIMS workflow

3. Management Intelligence

For the network management team, there is a need to provide other teams with convenient tools for information querying, alarm management, and other operations. Early network management tools, often inseparable from the command line operation, and for batch processing of the operation of the support is not good, such as network equipment MIB library compared to the new intelligent technology Netconf, as C and C + + +, seems to be a lot of clunky. Therefore, from a usage perspective, graphical, intelligent management tools are often preferred.

Intelligent: using new technologies to improve the processing efficiency of the traditional MIB-style management, the introduction of embedded automation architecture to achieve intelligent terminal APP-based management (as shown in Figure 8).

Figure 8. Message and event processing intelligence

● Netconf technology

The current network management protocols are mainly SNMP and Netconf. SNMP adopts UDP, which is simple to implement and has mature technology, but it can not meet the management requirements in terms of security and reliability, management operation efficiency, interactive operation and complex operation implementation. Netconf adopts XML as the data encoding method for configuration data and protocol message content, and adopts SSHv2 based on TCP for transmission to realize operation and control by RPC. XML can express complex, intrinsic logic, modeled management objects, such as ports, protocols, services, and relationships between them, which improves the operation efficiency and standardization of the objects; SSHv2 is adopted as the transmission method. SSHv2 is used, which has better reliability, security and interactivity. The main differences between the two comparisons are shown in Table 1.

Table 1 Comparison of network management technologies

● EAA Embedded Automation Architecture

The execution of EAA Automation Architecture includes the following three steps.

○ Define the event sources of interest, which are software or hardware modules in the system, e.g., specific commands, logs, TRAP alerts, etc.

○ Define EAA monitoring policies, such as saving device configurations, master/standby switching, restarting processes, etc.

○ Define the EAA monitoring policies.

○ Trigger the execution of EAA monitoring policies when the defined event sources are monitored.

4. Monitoring platform

Using basic monitoring tools such as Show, Display, SNMP, Syslog, and so on, to make a platform monitoring integrated environment, to achieve a full range of monitoring (as shown in the figure).

;