Summary of the work in the big data industry in these two years

These two years in the big data industry work summary

Today, the main review of these two years, in the big data industry company engaged in big data front-end development work. Recently just changed a job, my experience slightly summarized to share with you, there are any suggestions you enthusiastically in the comments section. Thank you.

Today's topic, mainly from the perspective of big data development, to the necessity of big data governance, and then to the graphical modeling of the best ideas, and finally in the data quality control, and then to the application of big data visualization, the blogger summed up the two years of insights, and the fruits of my learning, I do not know if the understanding of bias bar, I hope that you can give advice.

Big data development

Big data development, there are several stages:

1. data collection of raw data

2. data aggregation after cleaning and merging of usable data

3. data conversion and mapping after the classification, extraction of specialized thematic data

4. data application to provide api intelligent system application system, etc.

4. data application to provide api intelligent system application system. /p>

Data collection

Data collection online and offline in two ways, online generally through the crawler, through the crawl, or through the collection of existing applications, at this stage, we can make a big data collection platform, relying on automated crawlers (using python or nodejs to create crawler software), ETL tools, or customized extraction and transformation engine If this step is done through an automated system, it is easy to manage all the raw data, and it can standardize the work of developers by tagging the data from the beginning of the data collection. And the target data source can be more easily managed.

The difficulty of data collection lies in the multiple data sources, such as mysql, postgresql, sqlserver, mongodb, sqllite, and local files, excel statistical documents, and even doc files. How to organize them in a regular, programmatic way into our big data process is also an indispensable part.

Data aggregation

Data aggregation is the most critical step in the big data process, you can add data standardization here, you can also do data cleansing here, data merging, you can also archive the data in this step to confirm the availability of the data after the monitoring process for the organization of the classification of the output of all the data is the entire company's data assets to a certain amount is a fixed asset. The amount is a fixed asset.

The difficulty of data aggregation is how to standardize the data, such as standardization of table names, table label classification, table usage, the amount of data, is there a data increment? , is the data available? Need to put a lot of effort in the business, if necessary, but also the introduction of intelligent processing, such as automatic labeling based on the content training results, automatic allocation of recommended table names, table field names and so on. There is also how to import data from raw data, etc.

Data transformation and mapping

How can the data assets that have undergone data aggregation be made available for use by specific users? In this step, the main thing is to consider how the data will be applied, how will two? Three? data tables into one sheet of data that can provide a service. Then update the increment periodically.

After the previous steps, in this step is not too much difficulty, how to convert the data and how to clean data, standard data, no two, the value of two fields into one field, or according to more than one available table statistics out of a chart of data and so on.

Data application

There are many ways to apply the data, there are external and internal, if you have a large number of data assets in the early stage, provided to the user through the restful API? Or provide streaming engine KAFKA to the application consumption? Or directly compose thematic data for their own application query? Here the requirements for data assets is relatively high, so the preliminary work is done well, the degree of freedom here is very high.

Summary: the difficulty of big data development

The difficulty of big data development is mainly monitoring, how to plan the work of developers? Developers casually collect a bunch of garbage data and connect directly to the database. In the short term, these problems are relatively small and can be corrected. But when the volume of assets is increasing, this is a time bomb that can be detonated at any time, and then trigger a series of impacts on the data assets, such as data disruption brought about by the decline in the value of data assets, and customer trust becomes lower.

How to monitor developers' development process?

The answer can only be an automation platform, which is the only one that can do the job of making developers feel comfortable while embracing new matters and abandoning the manual era.

This is the front-end development engineers in the big data industry has the advantage of the point, how to create a good interaction with the visualization of the interface? How to turn the existing workflow, work requirements into a visual operation interface? Can we use intelligence to replace some brainless operations?

In a certain sense, in big data development, I personally think that front-end development engineers occupy a more important position, second only to big data development engineers. As for backend development, system development is third. Good interaction is crucial, how to convert data, how to extract data, to a certain extent, there are ancestors have stepped on the pit, such as kettle, and then for example, kafka, pipeline , numerous solutions. The key is how to interact? How to realize the visualization interface? This is an important topic.

Existing friends of different focus, that the role of the front-end are dispensable, I think it is wrong, the background is indeed very important, but the background of the solution more. The actual status of the front-end is more important, but basically no open source solutions, if you do not pay enough attention to the front-end development, the problem is that the interaction sucks, the interface sucks, the experience is poor, resulting in the rejection of the developers, and visualization of this piece of knowledge is numerous, the quality of the developers require more.

Big Data Governance

Big Data Governance should run through the entire Big Data development process, it has an important role to play, a brief introduction to a few points:

Data blood

Data Quality Review

Whole Platform Monitoring

Data blood

From the data blood, data blood should be the entry point of Big Data Governance, through a table, the data quality review

All platform monitoring

Data Blood

From data blood, data blood should be the entry point of Big Data Governance. The entrance to the governance, through a table, can clearly see its ins and outs, the splitting of fields, the cleaning process, the flow of the table, the volume of data changes, should be from the data lineage, I personally believe that the entire goal of big data governance is the data lineage, the data lineage from the data lineage to be able to have the ability to monitor the overall situation.

Data margins are dependent on the big data development process, it surrounds the entire big data development process, every step of the development history, data import history, there should be a corresponding record, data margins in the data assets have a certain size, basically essential.

Data quality review

Data development, the end of each model (table) creation, there should be a data quality review process, in the system of large environments, it should also be added to the key steps for approval, for example, in the data conversion and mapping of this step, which involves the provision of data to the customer, a comprehensive data quality review system should be established, to help companies first time find problems with the data, in the event of data problems can also be the first time to see where the problem is, and from the root cause of the problem, rather than blindly by connecting to the database over and over again to query the sql.

Platform-wide monitoring

Monitoring, in fact, contains a lot of points, such as application monitoring, data monitoring, early warning system, work order system, etc., on each of the data sources and data tables that we have taken over need to be done. Data sources, data tables need to do real-time monitoring, once a critical machine, or power outages, can be the first time the phone or SMS notification to the person in charge of the specific, here you can learn from the experience of some automated operation and maintenance platforms, monitoring is about the same as the operation and maintenance, good monitoring to provide the protection of data assets is also very important.

Big Data Visualization

Big Data Visualization is not just a graphical presentation, Big Data Visualization is not just a graphical presentation, Big Data Visualization is not just a graphical presentation, Important thing to say three times, Big Data Visualization categorized as a data development, part of which belongs to the class of applications, part of which belongs to the development class.

In the development, big data visualization plays the role of visual operation, how to build a model through the visualization mode? How to drag and drop, or three-dimensional operation to achieve the quality of data operability? Drawing two tables and a few buttons to realize the complexity of the operation process is unrealistic.

In the visualization application, more also have how to convert data, how to display data, charts and graphs are part of it, usually more work or analysis of data, how to express data more intuitively? This requires a deep understanding of the data, a deep understanding of the business in order to make the appropriate visualization applications.

Intelligent visualization platform

Visualization can be re-visualized, for example, superset, through the operation of sql to achieve charts, there are a number of products can even do according to the content of the data intelligently categorized to recommend the type of charts, real-time visualization of the development of such a function is the direction of the development of the visualization of the existing we need to a large number of visualization content to the Company output, such as the apparel industry, the sales department: incoming shipments, the impact of the color scheme on the user, the impact of the season on the choice of the production department: the price trend of fabrics? Production: fabric price trends? statistics on production capacity and efficiency? And so on, each department can have a big screen of data, can be arbitrarily planned through the platform of their own big screen, all the people every day to be able to focus on their own field of movement, which is the specific meaning of the application of big data visualization.

Written in the end

Written a lot, I nearly two years of what I have seen and heard what I have learned and thought of some summary, some children's shoes will ask, is not technology? Why no code? Bloggers want to say, code bloggers want to learn, to write, but has nothing to do with the work, the code is my personal skills, personal evening, to realize the important skills of personal ideas. However, the code has little to do with the business, in the work, know the business of the people who write code better, because he knows what the company wants. If you business is very poor, it does not matter, you code well on the line ah, according to other people's account work, is also very good. Technology and business go hand in hand, and later the blogger summarizes the refinement of the code.

Written, the anxiety is not a trace less, my code is not standardized enough, the current technology stack js, java, nodejs, python .

The main business js proficiency 80% it, is studying Ruan Yifeng's es6 (see almost) and vuejs source code (a little stranded), vuejs is considered medium, css and layout can be said to be okay, in addition to the d3.js, go.js are in the will be used, can work. nodejs it, express and koa no problem, read some express source code, also wrote two middleware.

java, python are in the degree of being able to do the project, and currently do not want to draw a lot of energy to go deeper into them, just want to keep in the want to use can use it.

The next few years, work hard, learn more about artificial intelligence, big data development, the future of this piece should still have some heat it.

Lastly, and everyone **** encouragement, more hope that you can give some planning advice, three people, there must be my teacher.