May I ask: How is the intensity of the work of the operation and maintenance engineers?

Work intensity is low or not is to look at the individual companies, in the A company operations and maintenance of the intensity is very high, basically all the night back to do some overtime, mainly because of the relatively small number of people; and the B company's operations and maintenance of the intensity of per capita load is proportional to the B company, this I have served for the B, but now how the B I'm not too clear;

Operation and maintenance personnel requirements are particularly demanding, because the operation and maintenance of the personnel for different problems, need to constantly add to expand their knowledge and scope of study.

In the early stages, good operations staff will demonstrate exceptional initiative and responsibility, and in the face of unfamiliar business will take the initiative to learn and expand their understanding of the business and the corresponding scope of knowledge, in order to be able to sufficiently competent to maintain the business independently.

In the gradual development stage, the engineers who focus on summarization and introspection will gradually grow into high-level operation and maintenance personnel, usually they will have a more systematic understanding of service operation and maintenance. Some engineers also become project managers due to their excellent project management planning skills.

Further down the line, the higher-level O&M staff will have a very thorough understanding of the product, so in this case, the higher-level O&M staff can even become the product manager of the product, the consultant of the product development, and play a crucial role in the design and development of the product features.

Job content

Operation and maintenance engineers need to participate and play different roles in the whole life cycle of software products, so the work content and direction of operation and maintenance engineers are very much:

Incident management: the goal is to restore the service as quickly as possible in the event of service anomalies. The goal is to restore the service as quickly as possible when the service is out of order, so as to guarantee the availability of the service; at the same time, to y analyze the reasons for the failure, to promote and repair the problems of the service, and at the same time, to design and develop the relevant plans to ensure that the service can be efficiently stopped when the service is out of order. In this regard, the main work includes:

Problem discovery: design and develop efficient monitoring platforms and alert platforms, use machine learning, big data analysis and other methods to summarize and analyze a large amount of monitoring data in the system, in order to quickly discover the problem and determine the impact of the failure when the system is abnormal.

Problem Handling: Design and develop efficient problem handling platforms and tools to quickly/automatically make decisions and trigger relevant stop-loss plans when system anomalies occur, and quickly restore services.

Issue tracking: Determine the root cause of the issue by analyzing the various system behaviors (logs, changes, and monitoring) at the time of the issue, and formulate and develop planning tools.

Change management: in a controlled manner, as efficient as possible to complete the product functionality of the iterative change work. In this regard, the main work of the operation are:

Configuration management: through the configuration management platform (self-developed, open source) to manage the service involves multiple modules, multiple versions of the relationship and the accuracy of the configuration.

Release Management: Ensure that every version change can be released to the production environment in a safe and controlled manner by building an automated platform.

Capacity management: In the service operation and maintenance phase, in order to ensure the reasonableness of the service architecture deployment and at the same time master the overall redundancy of the service, we need to constantly evaluate the system's carrying capacity and optimize it. In this regard, the main work:

Capacity assessment: through technical means to simulate the actual user requests, test the maximum throughput that the entire system can bear; through the establishment of a capacity assessment model to analyze the data in the stress test process to assess the capacity of the entire service.

Capacity optimization: Based on the capacity assessment data, determine the bottleneck of the system and provide capacity optimization solutions. For example, by adjusting system parameters, optimizing service deployment architecture and other methods to efficiently improve system capacity.

Architecture optimization: In order to support the continuous iteration of the product, it is necessary to constantly optimize and adjust the architecture. In order to ensure that the entire product can be constantly rich in functionality and complexity of the conditions, while maintaining high availability.

References:

Baidu Encyclopedia - Operations and Maintenance Engineer