How to manage data well in power enterprises.

From the perspective of technical realization, data governance includes five steps, namely, combing business and data resources, data collection and cleaning, database design and storage, data management and data use.

Organize data resources: The first step of data governance is to define the data resource environment and data resource list of the organization from the business perspective, including organization, business matters, information systems and data item resources in the form of databases, web pages, files and API interfaces. The output of this step is a list of classified data resources.

Data collection and cleaning: the process of extracting, transforming and loading data from the source to the destination through visual ETL tools (such as Alibaba's DataX (Pentaho Data Integration)), with the purpose of centralized storage of scattered and messy data.

Construction of basic subject database: Generally speaking, data can be divided into basic data, business subject data and analysis data. Basic data generally refers to core entity data, or master data, such as population, legal person, geographic information, credit, electronic license and other data in smart cities. Thematic data generally refers to the thematic data of a certain business, such as food supervision, quality supervision and inspection, comprehensive supervision of enterprises and other data of the market supervision and administration bureau. Analysis data refers to the analysis result data based on the comprehensive analysis of business subject data, such as the comprehensive evaluation of enterprises, the regional distribution of industries and the distribution of high-risk enterprises by the Market Supervision Administration. Then the construction of basic database and subject database is to extract the data storage structure based on the principle of easy storage, easy management and easy use. To put it bluntly, it is to design the database table structure according to certain principles, and then design the data collection and cleaning process according to the data resource list, and store clean data in the database or data warehouse.

Metadata management: Metadata management is the management of data item attributes in basic database and subject database. At the same time, the business meaning of data items is associated with data items, so that business personnel can understand the meaning of data fields in the database. In addition, metadata is the basis of automatic data sharing, data exchange and business intelligence (BI) mentioned later. It should be noted that metadata management generally manages the attributes (i.e. core data assets) of data items in basic libraries and subject libraries, while data resource lists manage data items from various data sources.

Kinship tracking: when using data in business scenarios, data errors are found, and the data management team needs to quickly locate the data source and repair the data errors. Then the data governance team needs to know which core library the business team's data comes from and which data source the core library's data comes from. Our approach is to establish the relationship between metadata and data resource list, and the data items used by business teams are configured through metadata combination, thus establishing the genetic relationship between data usage scenarios and data sources. Data resource catalog: Data resource catalog is generally used in data sharing scenarios, such as data sharing between government departments. Create a data resource directory based on business scenarios and industry norms, and at the same time rely on metadata and basic library themes to realize automatic application and use of data.

Quality management: The successful exploration of data value must rely on high-quality data, and only accurate, complete and consistent data can be used. Therefore, it is necessary to analyze the quality of data from multiple dimensions, such as offset, non-zero value check, range check, normative check, repeatability check, correlation check, outlier check, volatility check and so on. It should be noted that the design of an excellent data quality model must depend on a deep understanding of the business. Technically, it is also recommended to use big data related technologies to ensure detection performance and reduce the performance impact on business systems, such as Hadoop, MapReduce, HBase, etc.

Business Intelligence (BI): The purpose of data governance is to use. For a large data warehouse, the data usage scenarios and requirements are changeable, and BI products can be used to quickly obtain the required data and analyze it to form a report. For example, Parker Data is a professional BI vendor.

Data * * * sharing and exchange: data * * * sharing includes data * * * sharing within and between organizations, and * * * sharing is also divided into three ways: library table, file and API interface. The * * * sharing of library tables is relatively direct, and the * * * sharing of files can be realized by reverse data exchange with ETL tools. We recommend the API interface * * * sharing mode. In this way, the central data warehouse can retain data ownership and transfer data use right through API interface. API interface * * * can be realized through API gateway, and its common functions include interface automatic generation, application review, traffic restriction, concurrency restriction, multi-user isolation, call statistics, call audit, black and white list, call monitoring, quality monitoring and so on.