It operation and maintenance solution

IT is suggested that the IT operation and maintenance service system should follow the order of "easy to use, easy to summarize and easy to manage", solve objective problems from heavy to light, and speed up the construction of IT operation and maintenance service system to the maximum extent. The operation and maintenance service system consists of six parts: operation and maintenance service system, operation and maintenance service process, operation and maintenance service organization, operation and maintenance service team, operation and maintenance technical service platform and operation and maintenance object, involving four elements: system, people, technology and object.

Operation and maintenance system is the basic guarantee for standardizing operation and maintenance management, and it is also the basis for process establishment. Relevant personnel of the operation and maintenance organization adopt advanced operation and maintenance management platform, and carry out standardized operation management and technical operation on various operation and maintenance objects according to system requirements and standardized processes.

IT fault location refers to the diagnosis of the direct cause or root cause of the fault, and fault location is helpful for fault recovery actions to be more effective. Fault location is usually the most time-consuming link in the whole fault process. The goal of positioning is to recover quickly, not to find the root of the problem, which is the responsibility of problem management. Usually, most usability failures are solved through the hypothetical judgment of operation and maintenance experts' experience or the implementation of known schemes, but some failures, especially performance, usage logic and data failures, need multi-party cooperation and tool support.

In the data center, many technical operation and maintenance personnel often have the keen ability to find known faults and can quickly find the root of the problem according to their own faults. More experienced experts can guess the possible reasons behind a phenomenon from some common fault phenomena through the internal principles of the system. Judging the possible diagnosis path according to the symptoms of the fault is an essential ability of an operation and maintenance technical expert, which is often accumulated through a large number of operation and maintenance cases. This is also where experts are different from ordinary operation and maintenance personnel. Accurate data collection actually depends on operation and maintenance knowledge.

For example, we need CPU resources to do fault analysis, so how to collect data? Find the average or the highest threshold of CPU utilization in a certain period of time? Is there a problem with CPU utilization 100%? It's not that simple. In fact, the sudden peak of CPU is mostly harmless and may not have a bad influence on our system. Only when the long-term CPU utilization is close to a high level, the CPU may have a bottleneck of insufficient resources, thus affecting the performance of the system.

I. Principles of Operation and Maintenance

IT system will inevitably have problems or failures during its operation. The principles of troubleshooting can be summarized as two:

All measures or methods give priority to the rapid recovery of business.

Bugs or matches need to be upgraded and optimized in time.

1. 1. It is urgent to resume business.

Business recovery priority means that no matter what level of failure occurs under any circumstances, business should be recovered first. This is different from fault location. Many people will have ambiguity and think that they have not found the root of the problem. How to resume business? Here is a simple example:

If the debugging of A and B systems finally fails, how to find the problem and solve it?

(1) Ping the network using B from the server using A. If the port is connected to the network, directly bind the host of server B..

(2) Troubleshoot the problem, find out which links will pass between A and B, and find out the problematic links, including cross-server areas and cross-network segments. If the HA connection is abnormal, please restart or expand and recover.

Usually, the first method takes a short time. If there is cross-computer room access between A and B, the first method will take longer to check. Although it destroys the structural balance between A and B, it can take effect immediately, which is what we call business priority recovery.

1.2. Upgrade in time.

This is easy to understand. When any fault occurs, anyone can only make a simple prediction of the impact of the fault, so it is necessary to upgrade to your leader in time, so that he can master first-hand information and coordinate resources.

4. Safety upgrade packages or equipment or upgrade systems of large manufacturers;

Second, the operation and maintenance mode

According to the requirements of operation and maintenance work and the response time of operation and maintenance, it is decided to build a complete operation and maintenance scheme and determine the service standard. On-site software and hardware inspection is the main way to improve the execution of operation and maintenance plan. Generally speaking, the operation and maintenance workflow of a data center is as follows:

(1) Build a complete operation and maintenance plan: In the whole operation and maintenance process, the plan is the core of the whole workflow. According to the principle of planning first, according to this year's work plan, the sub-item work plan and time dimension plan are formulated, and implemented and guaranteed according to the process and plan.

(2) Importance of on-site inspection: The on-site inspection plan is the focus of the operation and maintenance work plan. Through on-site inspection, we can find out the weak links, key business nodes and hidden dangers of the system, especially it is very important to make emergency plans and spare parts plans.

(3) Importance of execution: The implementation of operation and maintenance plan is the focus of operation and maintenance work. During the implementation of the operation and maintenance plan, the operation and maintenance should be carried out in strict accordance with the process specifications, and attention should be paid to control to reduce the operation and maintenance risks. For the implementation of operation and maintenance, users should be given regular feedback.

(4) Operation and maintenance service standard: sign an after-sales service commitment letter and agree on the service level with customers. The promised service level, including the resources provided (spare parts, etc.). ).) And the provided scheme shall be implemented in strict accordance with the agreement.

Three. Operation and maintenance processing method

First, ITIL, especially ITIL 4, is the latest version of international IT service standards in the new era, and IT is also a brand-new version for sensitive IT. It includes the functions of ITIL V3 and adds support for DevOps.

Secondly, the sensitive IT operation and maintenance methodology SRE(Site Reliability Engineering), that is, the operation and maintenance service methodology of Internet and public cloud;

Third, infrastructure as code integrates infrastructure automation processes, operations and maintenance, as well as global best practices and cases.

Fourth, strengthen the connection between operation and development, and integrate the organization, culture and process of IT service management.

Cheng and Dvorap have combined.

Operation and maintenance services include network equipment, security equipment, computer room infrastructure, host equipment, operating system, database and storage equipment related to information systems and other information systems to ensure the normal operation of users' existing information systems, reduce overall management costs and improve the overall service level of network information systems. At the same time, according to the daily maintenance data and records, the overall construction plan and suggestions of the user information system are provided, which better provides a strong guarantee for the user's information development.

The composition of user information system can be mainly divided into two categories: hardware equipment and software system. Hardware devices include network devices, security devices, host devices, storage devices, etc. Software equipment can be divided into operating system software, typical use software (such as database software, middleware software, etc. ), business software, etc.

Fault handling is generally divided into three stages: before fault, during fault and after fault. Pre-fault refers to fault location analysis, during fault refers to fault handling process, and after fault refers to fault summary, which is very important.

(1) From the perspective of fault service, put forward the operation and maintenance methods to deal with faults.

From the point of view of fault service, the three most important methods of operation and maintenance recovery business are isolation, restart and degradation.

(1) isolation

Isolation refers to the process of separating the failed object from the cluster, so that the failed object can no longer provide services. Isolation methods include the following two, arranged by common frequency:

Adjust the upstream weight to zero. If there is a self-detection mechanism in the architecture, the service of the fault object can also be stopped directly, so that the upstream health detection is effective.

By binding the host or configuring the route, the failed object can be bypassed. For example, an intelligent routing management domain shuts down a line. What needs attention here is to prevent avalanche effect.

(2) restart

Restart includes service restart and server restart (os restart). Once a fault occurs, any link involved can be restarted. The general sequence of restart is: fault object >; Upstream of fault object >: downstream of fault object, generally, the farther away from fault object, the later the restart sequence.

(3) demotion

Demotion refers to the plan taken to prevent greater failure. Generally speaking, demotion must not be the optimal state of current users. Even if there is no technical impact, it will bring some business impact to some extent. Although users can temporarily reply to some services in other ways, it will bring bad user experience and some user influences.

Downgrading is not only a problem of operation and maintenance, but also a problem of joint business research and development or promoting business research and development. Therefore, to do any project, the primary consideration is not how much performance the project can achieve, but what should be considered if something goes wrong?

This is true for projects, as are core uses and components. As the person in charge of the use, it must be considered whether there are plans for the use of this object if there is a major failure, and the trigger conditions of these plans should be clarified by the executor.

Demotion, from a certain point of view, is the last life-saving means of operation and maintenance, and we must pay attention to it.

The above operation methods, especially restart and isolation, have an important premise, that is, the object must be stateless, and if development retry is needed, the requirements must be idempotent. Stateless objects are not allowed unless they are very special businesses and can exist temporarily, so the objects in production should have only three states:

(B) from the fault affected parties to see the operation and maintenance fault handling methods.

First of all, in the process of fault handling, you will encounter various internal or external organizational structures to participate in system faults. Generally, the following three types of people are required to handle faults at the same time:

Information transmitter: Their duty is to transmit the effective information of fault handling and fault location, and at the same time transmit the information of fault progress to the outside;

Fault locator: Their duty is to solve the fault when the method of fault handler fails or the root cause of the problem needs to be found;

Troubleshooting personnel: Their duty is to resume business as soon as possible.

For IT operation and maintenance systems, these three types of people often do not appear at the same time. For example, when you are on duty in the early morning, you only need a fault handler to handle it. After the business is resumed, the fault locator will find the root cause and optimization measures the next day.

In addition, after the failure, the affected parties will be divided into two categories:

(1) internal user

Internal users include internal use of their own calls and internal users to find problems, similar to external users.

(2) External users

Dealing with external users will be more troublesome. The idea is how to turn external users into internal users. For example, if the supplier can't open the company website, there are two aspects to be done:

If the above two aspects are not good, it will be more troublesome. At this time, it is necessary to collect some necessary external user information before processing, such as the exported IP and the client version used. It is suggested to collect information in a template and complete it at one time, because the processing time of external users is often spent on communication costs.

For more related big coffee video courses, please download the "Jifu Little Coffee App" in the Apple App Store or various Android markets.