This time, I mainly want to share the problems we encountered in the practice of building a private cloud based on Docker in the past year, the experience of how to solve them, and our experience and thinking, so as to encourage everyone.
Have some experience in using Docker in production environment. The private cloud project was launched during the Christmas period of 20 14. After more than half a year's development and three major promotions, it has gradually formed a certain scale.
structure
Cluster management
As we all know, Docker's own cluster management ability was not mature at that time, so we didn't choose the newly emerging Swarm, but used the most mature OpenStack in the industry, so that we could manage Docker and KVM at the same time. We run Docker as a virtual machine to meet the business needs of virtualization. The future thinking is micro-service, which splits the application into micro-services and realizes the deployment and release of PaaS based on the application.
How to manage Docker through OpenStack? We adopt the architecture mode of OpenStack+nova-docker+Docker. Nova- docker is an open source project on StackForge. As a plug-in of nova, it controls the start and stop of the container by calling the RESTful interface of Docker.
On the basis of IaaS, we have developed components such as scheduling, which support the functions of application flexibility and gray level upgrade, and support certain scheduling strategies, thus realizing the main functions of PaaS layer.
At the same time, continuous integration (CI) is realized based on Docker and Jenkins. If a project in Git is pushed by Git, it will trigger Jenkins Job to build automatically. If the project is successfully built, a Docker image will be generated and pushed to the mirror warehouse. Docker image generated based on CI can update the instance of development and test environment through the API or interface of PaaS, and finally update the instance of production environment to realize continuous integration and continuous delivery.
Network and storage
On the network side, we don't adopt the NAT network mode provided by Docker by default, and NAT will cause certain performance loss. Linux bridge and Open vSwitch are supported through OpenStack, and iptables does not need to be started. Docker's performance is close to 95% of that of a physical machine.
Container monitoring
In the aspect of monitoring, the container tool is developed to calculate the container load value, which replaces the original commands such as top, free, iostat and uptime. In this way, when using a common command in a container, the business party sees the value of the container, not the whole physical machine. At present, we are porting Lxcfs to our platform.
We also added several threshold monitoring and alarms on the host, such as key process monitoring, log monitoring, real-time pid number, network connection tracking number, container oom alarm and so on.
Redundancy and isolation
In terms of redundancy and isolation, we have done a lot of redundancy schemes and technical preparations. We can recover the data in docker offline without starting the Docker daemon. At the same time, it supports Docker cold migration across physical machines, dynamic CPU expansion/contraction, and network IO disk IO speed limit.
Problems encountered and solutions
In less than a year, we have encountered various problems in productization and practical use. The process of using Docker is also a process of constantly optimizing Docker, constantly locating problems and solving problems.
Our current production environment is CentOS 6.5. Once a business party mistakenly thought that the Docker container it used was a physical machine, and another Docker was installed in the Docker container, which instantly caused the kernel to crash and affected other Docker containers of the same physical machine.
After analysis, it is because the kernel of version 2.6.32-43 1 does not support the network namespace well, and creating a bridge in Docker will cause the kernel to crash. Upstream fixed this bug, and the problem was solved after upgrading from 2.6.32-43 1 to 2.6.32-504.
There is also a bug in the program written by a user, and the created threads are not recycled in time, resulting in a large number of threads in the container. Finally, the host can't execute the command or ssh login, and the error is "bash: fork: unable to allocate memory", but it is enough to look at the free memory.
Through analysis, it is found that the kernel's isolation support for pid is not perfect, and PID _ max (/proc/sys/kernel/PID _ max) is shared globally. When the number of pid in a container reaches the upper limit of 32768, it will cause the host and other containers to be unable to create new processes. The latest 4.3-rc 1 only supports the pid_max limit of each container.
We also observed that docker's host kernel log will lead to disorder. After analysis, it is found that the kernel has only one log_buf buffer, and all logs printed by printk are put into this buffer first. Docker host and rsyslogd on the container will get logs from the log_buf buffer in the kernel through syslog, which will lead to log confusion. This problem can be solved by modifying the rsyslog configuration in the container and only letting the host read the kernel log.
In addition, we also solved the problems such as kernel crash caused by dm-thin discarding of device mapper.
Experience and thinking
Finally, share our experiences and thoughts. Compared with the mature virtualization technology of KVM, the container still has many imperfections. Besides cluster management, network and storage, the most important thing is stability. The main factor affecting stability is the imperfect isolation, and the problems caused by one container may affect the whole system.
Memcg of the container can't recycle slab cache, and it doesn't limit the number of dirty caches, which is more prone to OOM problems. In addition, some file interfaces on procfs cannot be container-based, such as pid_max.
Another point is the impact on the operation and maintenance means and experience under the container. Some system maintenance tools, such as ss, free, df and so on. , can't be used in the container, or the result is inconsistent with the physical machine, because the system maintenance tools usually access the files under procfs, and these tools need to be modified or adapted.
Although the container is not perfect, we are still very optimistic about its future development. Open source software related to containers, such as Kubernetes, Mesos, Hyper, CRIU, runC, etc. , is the focus of our attention.
Ask and answer. A
Q: How is the load balance between containers achieved?
A: The load balance between containers is more at the level of PaaS and SaaS. Our P layer supports dynamic routing at Layer 4 and Layer 7, and exposes interfaces to the public through domain names or name services. We can achieve container-based gray scale upgrade and elastic scaling.
Q: Is your OpenStack running CentOS 6.5?
A: Yes, but we upgraded the packages that OpenStack and Docker depend on. We maintain an internal yum source.
Q: Is the container IP statically arranged or dynamically acquired?
A: This is related to the network mode managed by the operation and maintenance department. There is no DHCP service in our intranet, so for IaaS layer, the IP of the container is statically assigned. For the PaaS layer, if there is DHCP service, the IP and port exposed by the container's App can be dynamic.
Q: Did you try it when you deployed Ubuntu? Have you studied the differences between the two systems? In addition, how do you monitor these virtual machines on OpenStack?
A: We haven't tried Ubuntu because the company uses CentOS in its production environment. Our middleware team is responsible for monitoring the company's machines. We work with the monitoring team to deploy the monitored agent to the host and each container, so that we can monitor it like a virtual machine.
Of course, the data of the container needs to be retrieved from cgroups, and this part of the work of extracting data is realized by us.
Q: Do you have any suggestions on the network selection between containers? It is said that the virtual network card has no small performance loss than the physical network card. Is Docker's own weave and ovs competent?
Answer: It is not recommended to use the default NAT method for container networks, because NAT will cause certain performance loss. As I mentioned before, you don't need to start iptables, and the performance of Docker is close to 95% of that of a physical machine. Docker's weaves bottom layer should still use a bridge or an open vSwitch. I suggest you look at the source code of nova-docker, which will be easier to understand.
Q: Is static IP implemented through LXC?
A: the implementation of static IP is implemented in novadocker/virt/docker/vifs.py of nova-docker. The implementation principle is to add veth pairs through ip commands, and then use a series of commands such as ip link set/ip netns exec to achieve it. The principle of setting is similar to that of fabric.
Q: How do you manage the process gdb in the container? Have you packed Guangfa into the container?
A: There will be no problem with the gdb in the container. You can directly yum install gdb.
Q: Can the * * * memory be directly loaded into the container?
A: I haven't tried it, but there should be no problem through docker -v V V.
Q: How can I recover the data in Docker offline without starting the Docker daemon?
A: The principle of offline recovery is to create a temporary dm device with the dmsetup create command and map it to the dm device number used by the Docker instance. By installing this temporary device, the original data can be restored.
Q: Docker supports dynamic CPU expansion/contraction in cold migration across physical machines. How is the speed limit of network IO disk IO realized? Can you elaborate on it?
A: Docker's cold migration is an interface to realize OpenStack migration by modifying nova-docker. Specifically, it is submitted by docker, pushed to the internal registry by docker, and then snapped by docker, which is completed between two physical machines.
Dynamic CPU capacity expansion/contraction, network IO disk IO speed limit is mainly achieved by novadocker modifying the parameters of CPU, iops, bps and TC in cgroups.
Q: Will you consider using the Magnum project or choose Swarm in the future?
A: These are our choices. We can consider bees. Because the bottom layer of Magnum still calls a cluster management scheme like Kubernetes, it is better to choose Swarm or Kubernetes directly than Magnum. Of course, this is just my personal opinion.
Q: Are your services based on the same image? If they are different images, how can the computing node ensure that the container can be started quickly?
A: The operation and maintenance department will maintain a unified basic image. Mirrors of other services will be made based on this mirror. When we initialize the compute node, we will pull the basic image locally through docker pull, which is also a common practice of many companies. As far as I know, Tencent and 360 are similar.
Q: Have you considered continuing to use traditional storage for hot migration?
A: Both distributed storage and * * * shared storage are under consideration. Next, we plan to carry out thermal migration of the container.
Q: Is the public IP directly bound to the container, or is it mapped to the container private IP in other ways? If so, how to solve the original layer 2 VLAN isolation?
A: Because we are a private cloud and do not involve floating ip, you can consider it as a public IP. Layer 2 isolation of VLAN can be done on the switch. We use Open vSwitch to divide different VLAN to realize the network isolation between Docker container and physical machine.
Q: Can you explain the problem of DM-thindiscard in detail?
A: In April, two hosts often restart for no reason. My first thought was to check the /var/log/messages log, but I didn't find any information related to the restart when it was close to the restart time. Then in the /var/crash directory, I found the log vmcore-dmesg.txt of kernel crash. The log is generated at the same time as the host restart time, which can indicate that the host automatically restarts after the kernel crashes. "Kernel error at drivers/MD/persistent-data/DM-btree-remove.c:181!" . As can be seen from the stack, dm-thin is preparing to discard the process. Although we don't know the root cause of the bug, the direct cause is caused by the discard operation, so we can turn off the discard support to avoid it.
After we disabled the discard function in all host configurations, the same problem never happened again.
At this year's CNUTCon conference, Tencent and the public comment also mentioned this collapse when sharing their use of Docker. Their solution is exactly the same as ours.
Q: Are there high, medium and low level alarms in the threshold monitoring and alarm block? If there is a low-level alarm at present, will you take some measures to restrict the user's access or cut off the service that the current user is using, or let the situation develop?
A: For alarms, the operation and maintenance department has a special PE to be responsible for the stability of online business. When an alarm occurs, both the service provider and PE will receive the alarm information at the same time. If it affects a single virtual machine, PE will notify the business party, and even drop the business in time. We will cooperate with PE to let the business side move the business in time.
Q: Is your self-developed container tool open source? Is there your code on GitHub? Why not open source? Is it expected to open source in the later period? How to treat the fine granularity of monitoring container?
A: Although we don't have open source at present, I think open source is no problem. Please wait for our good news. Regarding the fine-grained monitoring of containers, the main idea is to monitor the health status of containers at the host level, while the internal monitoring of containers is completed by business parties.
Q: Does the layer of the container care about the number of layers? Is the underlying file system ext4? Is there an optimization strategy?
Of course, we do. We optimize the time for docker to pull the mirror by merging the mirror levels. In docker pull, it takes a long time to check each layer. By reducing the number of layers, not only the size becomes smaller, but also the docker pulling time is greatly shortened.
Q: memcg of the container can't recycle slab cache, and it doesn't limit the number of dirty caches, which is more prone to OOM problems. -How do you handle the caching problem?
A: According to the actual experience value, we calculate a part of the cache as used memory, as close as possible to the real use value. In addition, for containers, the memory alarm threshold is appropriately reduced. At the same time, increase the alarm of container OOM. If you upgrade to CentOS 7, you can also configure kmem.limit_in_bytes for some restrictions.
Q: Can you tell me more about the isolation of your container network?
A: Access isolation. At present, the second layer of isolation mainly uses VLAN, and VXLAN will be considered for isolation later. For network traffic control, we only use the port-based QoS that comes with OVS, and TC is used at the bottom. The traffic-based traffic control will be considered later.
Q: Do you all use CentOS 6.5 for this set? This technology has been realized. Is it operation and maintenance or development?
A: Stability is the first priority in the production environment. CentOS 6.5 is mainly responsible for the operation and maintenance of the whole company. We will give suggestions to the operation and maintenance department when the big version is upgraded. At the same time do a good job in the stability of virtualization itself.
Q: How can containers communicate directly? How to set up a network?
A: You mean on the same physical machine? At present, communication is still through IP. The specific network can adopt bridge mode or VLAN mode. We use Open vSwitch to support VLAN mode, which can isolate or communicate between containers.
Q: do you use nova-api to integrate Dcoker? Can I use advanced functions of Docker, such as docker-api? Besides, why not use heat to integrate Docker?
A: We use the open source software nova-docker. Nova-docker is an open source project on StackForge. As a plug-in of nova, instead of the existing libvirt, it controls the start and stop of the container by calling the RESTful interface of Docker.
Whether to use Heat or NOVA to integrate Docker industry has always been controversial, and we are more concerned about the problems we want to solve. Heat itself relies on a complex relationship, but it is not widely used in the industry, otherwise the community will not launch Magnum.
Q: At present, do you have the practice of containers crossing DC or similar directions?
A: We have deployed multiple clusters in multiple computer rooms, and each computer room has an independent cluster. On this basis, we developed our own management platform, which can realize the unified management of multiple clusters. At the same time, we have built Docker registry V 1, and we are going to upgrade to Docker registry V2 internally, which can realize the cross-DC mirror function of Docker mirror.
Q: I am also promoting the continuous integration and cluster management of Docker, but I find that managing more containers is also a problem, such as flexible management of containers and resource monitoring. Which is better, Kubernetes or Mesos? If it is used for business, how to resolve the external domain name, because all communication is through the host, and it has only one external IP?
A: For Kubernetes and Mesos, we are still in the pre-research stage. Our current P-layer scheduling is self-developed. We maintain instance status, ports and other information through etcd. For layer 7, it can be solved by Nginx, and for layer 4, it depends on naming service. We have our own naming service, so we can solve these problems. Although there is only one IP, the exposed ports are different.
Q: Have you considered using Hyper Hypernetes? Isolate the container from the host kernel while ensuring the startup speed?
A: Hyper, we have been paying attention to it. Hyper is a very good idea and will not be ruled out in the future. In fact, what we want Hyper to achieve most is hot migration, which Docker can't do at present.
Q: What configuration does your host usually use? Stand-alone host or cloud server?
A: We have our own computer room, using independent servers and physical machines.
Q: What solution does the container use for cross-host communication?
Answer: When the container crosses the host, it must use three layers for communication, namely IP. Containers can have independent IP or host IP+ port mappings. At present, independent ip is still used, which is easier to manage.
Q: I feel that your company's use of Docker is more like a virtual machine. Why not consider using it directly from the perspective of containers? Is it a historical reason?
A: Our first consideration is the user's acceptance and the cost of renovation. From the user's point of view, he doesn't care whether the business runs in a container or a virtual machine. He is more concerned about the deployment efficiency of the application and its impact on the stability and performance of the application itself. From the container point of view, some existing applications of business parties may need to be greatly transformed. Such as logging system, full link monitoring and so on. Of course, the most important thing is that the impact on the existing operation and maintenance system will be greater. The management of containers is a challenge for operation and maintenance, and the acceptance of operation and maintenance needs a process.
Of course, it is our goal to use Docker as a container to package applications and realize the deployment and dynamic scheduling of PaaS. In fact, we are also working in this direction. This also requires the business side to split the application and realize micro-service, which requires a process.
Q: Actually, we also want to use containers as virtual machines. What middleware do you run with a virtual machine? We have to solve the contradiction between testing key and a large number of relatively independent environments, WebLogic?
A: We operate many businesses, from the front-end master Web to the back-end middleware services. Our middleware service is a product developed by another team, which realizes the separation of front and back business logic.
Q: Does your company use OpenStack to manage both Docker and KVM? Do you develop your own Web configuration interface, or just use API to manage it?
A: We have a self-developed network management platform. We hope to manage multiple clusters through one platform, and connect the operation and maintenance, logging, monitoring and other systems to expose a unified API interface.
Q: In the case shared above, regarding the bug in the 2.6 kernel namespace, can this lower version of the kernel install the Docker environment? Docker's isolation of procfs is not perfect at present. Is the container tool you developed based on the application layer or does it need to modify the kernel?
A: There should be no problem in installation and use, but if you go to a production environment, you need to consider it comprehensively, mainly because of insufficient stability and isolation. Lower versions of the kernel are more likely to cause system crashes or various serious problems, some of which are actually not bugs, but imperfect functions. For example, creating a bridge in a container will lead to a crash, which is caused by the imperfect support of the network namespace kernel.
The container tool we developed is application-based and does not need to modify the kernel.
Q: Is there a more detailed introduction about redundancy, such as how to recover data offline?
A: I have answered this question before. Specifically, I use the dmsetup create command to create a temporary dm device and map it to the dm device number used by the docker instance. By installing this temporary device, the original data can be restored. Other disaster relief programs can be rearranged and shared because of their more contents. You can pay attention to http://mogu.io/,, and then we will share it.
Q: Is your online containerization system stateless or stateful? What are the considerations or difficulties in scene selection?
A: The applications of Internet companies are mainly stateless. Stateful services can actually be transformed from the business level into partially stateful or completely stateless applications. I don't quite understand your scenario choice, but we try our best to meet the needs of the business side.
For some services that require high stability or are particularly sensitive to delayed IO, such as redis service, we do not recommend using containers. These services cannot be completely isolated or stateless.
Multi-process or multi-thread is better, and so on. , does not mean that just because Spark is very popular, it must be used. When we encounter these problems, we are still working on this: as a popular big data processing technology? Chen, it can soon create a spark cluster for everyone to use. Do we use OpenStack? Chen. Q: Hadoop software and hardware collaborative optimization, performance analysis and optimization of Spark on OpenPOWER architecture server: What topics will be shared in this speech? Ask. Participate in the discussion of Spark Community. I have shared many articles about distributed computing, Docker and Spark building a public cloud of SuperVessel big data in Programmer magazine, and contributed the code to upstrEAM. This is a good way to cut in and SQL, and has eight technical patents in the fields of big data, MapReduce performance analysis and tuning tools. For example, there are still many companies using Impala for data analysis: enterprises want to embrace Spark technology and optimize the performance of Swift object storage. For example, better integration with Docker Container, technical director of big data cloud direction, Spark still has a lot of work to do? What should enterprises do if they want to apply Spark quickly? The specific technology choice should be based on your own business scenario. Docker container has attracted much attention because of its advantages in improving resource utilization and production efficiency of cloud. Apply high-performance FPGA accelerators to big data platforms and other projects, and then adjust relevant parameters to optimize these performance bottlenecks. Some companies are using Storm and Samaza for stream computing: compared with MapReduce, the performance has been greatly improved?