Guo Huijin, Jia Guofeng, Ma Feifei, Zhang Xi
(National Geological Data Library)
Abstract This paper describes the characteristics of double-layer PDF and OCR technology and its application prospects based on the exploration of the significance of the digitalization of geologic data graphic data double-layer PDF conversion; puts forward the selection of the conversion method, and a detailed introduction to the OCR digital processing system, as well as to improve the recognition rate of the method; finally, the significance of double-layer PDF in the construction of geological data library is proposed.
Keywords Double-layer PDF OCR recognition rate
Currently, geological data collection institutions are stepping up their digitization work. By the end of 2013, more than 20 provincial data libraries have completed the digitization of their collections, and the digitization of the resultant geological data of the national geological data libraries is nearing the end, and the huge amount of data formed has become an important data resource for the socialization service of geological data information. This kind of digitized data is static, which is conducive to reading and using, but it is not possible to carry out full-text search and is not conducive to further analysis and processing. Therefore, on the basis of the existing data, carry out OCR recognition, so that it is transformed into a double PDF file, to realize the static to dynamic transformation, the establishment of a full-text database, complete the full-text information on the geological data retrieval, to become a geological data collection institutions to promote the digitization of data.
1 on the double PDF and OCR technology
Double PDF is based on scanned data through OCR recognition to generate a searchable PDF file, that is, the upper layer is the original image, the lower layer is the result of the recognition, and the location of the upper and lower one-to-one correspondence. Double-layer PDF files can not only retain 100% of the original layout effect, but also support the selection, copying, retrieval and other functions, so that the PDF file can finally be stored in the CD-ROM, hard disk or disk array, and through the establishment of an indexed database for scientific management.
OCR (Optical Character Recognition), that is, Optical Character Recognition, refers to electronic devices (such as scanners or digital cameras) to check the characters printed on the paper, through the detection of dark and light patterns to determine the shape, and then use the character recognition method of translation of the shape into the process of computer text. That is, the process of scanning text information and then analyzing and processing the image files to obtain text and layout information. With the rapid development of computer networks, the electronic information has become an inevitable trend of the times. Text as the most important and centralized carrier of information, its electronic process is particularly important. The OCR technology is the most important part of the text electronic process, it changes the traditional concept of paper media information input. Through OCR technology, users can get through the camera, scanner and other optical input methods of newspapers, books, manuscripts, forms and other printed materials image information into text information that can be used for computer recognition and processing. Therefore, compared with the traditional manual entry methods, OCR technology greatly improves the efficiency of people for data storage, retrieval and processing.
2 Application Status
PDF is the world's governments, financial and financial, legal, engineering, medical and many other sectors to obtain a wide range of applications, and has become the government, academic departments and other units of the standard modernization of the official document format specification, so the PDF electronic format documents will be the main body of the collection of the archives department in the future. The emergence of double PDF effectively solve the cost of identification and reading the use of contradictions, is a more promising resource format. The application of foreign OCR technology is relatively mature, including IBM, Motorola, HP and Microsoft and other major companies around the world have one after another launched a study in this area, in their products are bound to the OCR technology.
Today, OCR technology is also extremely widely used in China. Information retrieval technology research, that is, double PDF retrieval technology research, China's "863" program before 2008 has begun to Chinese OCR, automatic word segmentation, automatic summary, automatic search, automatic positioning of the unified test evaluation. On this basis, a series of digitization-based implementation cases such as digital libraries, digital archives, digital newspapers and magazines, digital campus networks, etc. have been gradually established in China, such as full-text databases of literature of the General Administration of Press and Publication, the Ministry of Outreach, and the Central Committee of the *** Youth League; and full-text databases of periodicals such as 75 Years of China's Youth, and 20 Years of Xinhua Abstracts. As early as 1999, the National Library set up a "National Library Literature Digitization Center" to digitize various types of collection of literature and OCR recognition, on the basis of which the formation of the bibliographic library, title database and full-text database of three major categories, and gradually become the central hub of China's online information resources.
With the full popularization of China's information construction, the application of OCR technology has a broader outlook, digital libraries, digital archives, digital libraries and other concepts also make OCR in the digitization of paper archives more and more in its unique role, not only saves manpower and material resources, but also make the use of archival information resources to maximize the value of the better to serve the people.
3 The significance of digitized data double-layer PDF conversion
3.1 is an important part of the construction of geological data informationization
With the increase in the degree of informationization of the society, people's dependence on information resources is also increasingly high, and the demand for efficient archive resource management, retrieval and utilization is becoming more and more urgent. Digitization is an important part of information construction, and the core of information construction is resource construction. Resource construction includes three major tasks: first, the scanning and digitization of the collection of paper materials and the construction of catalog databases; second, the archiving and management of electronic documents; and third, the construction of full-text databases and full-text retrieval systems. According to the progress of the digitization work of each library, taking into account the needs of the user's use, if we want to get the real text form of electronic information, so that the digitization of information is more effective, more thorough, and maximize the broadening of the user's use of the surface, it is necessary to apply the OCR technology for the scanning of raster files of the two-layer PDF conversion, and then carry out the construction of full-text database of geological information and full-text retrieval work.
3.2 is the realization of full-text search of geological information and the premise of full-text database construction
Practice has proved that the full-text search based on the double-layer PDF document, effectively improving the query and utilization efficiency. It is through the archive database data and dual-layer PDF document Text layer to establish an index, the query can not access the database, effectively reducing the pressure on the database and the system. Can support at least 10 million level of data, milliseconds query time, thousands of concurrent access per second, so as to achieve high-capacity, high-speed goals, and can be adapted to Linux and Windows platforms, support for a variety of database interfaces. It has the architecture and functions of a general-purpose search engine, which can perform word splitting on user input, and can perform multi-keyword search, keyword combination search, and is user-friendly; at the same time, it is able to carry out user data mining according to the needs of customers, and improve the value of the full-text archive retrieval system.
3.3 is a prerequisite for the standardized construction of modern data centers
The construction of modern data centers should first realize the standardization of electronic document storage structure, that is, the establishment of a universal, widely used electronic document information storage and exchange format. Based on the PDF format has been used as electronic document management in the electronic document long-term preservation format of the latest standards in the international full implementation, and has the advantages of compatibility, strong original records, security control strategy is perfect, etc., is the best choice for long-term preservation of electronic documents. So the collection of digitized data PDF conversion is imperative.
4 double PDF conversion methods
4.1 common double PDF conversion methods
Currently, the domestic double PDF conversion technology has been relatively mature, from the existing technical conditions down to the view, can be roughly divided into the following three kinds:
4.1.1 software conversion
By the more popular Adobe Acrobat, ABBYY FineReader12 (English and Chinese recognition), Readiris Corporate 12 (English recognition rate is high), Foxit Phantom 5 (you can display the text layer alone), Tsinghua Wentong TH-OCR XP8 (recognition rate is higher), Hanwang Text King 5800 (layout recognition is better, the pure Chinese recognition rate (high), Shangshu seven OCR and other conversion programs, can be processed by the OCR recognition directly after the generation of double-layer PDF files, fast and efficient. However, the recognition rate and the original paper-based information (such as printing methods, clarity, paper quality, etc.) and the operator's technical level is directly proportional. If the original paper quality is good, the recognition rate is relatively high; poor quality, the recognition rate is relatively low.
4.1.2 Processing
According to the relevant technical requirements, the image of the new OCR recognition process processing, re-generation of PDF documents, with a high rate of text correctness, text positioning accuracy and other characteristics. This approach is equivalent to the full-processing production of two-layer PDF documents, workload, time-consuming, high cost.
4.1.3 Identification and reconstruction
Regenerate PDF files to achieve the layout of the font, font size, color recovery and reconstruction. The text is correct, the page is clear, but the difference between the original and the original layout is large, mainly in the book more applications.
4.2 Double PDF conversion of geological data
The National Museum began in 2011 in the scanning and digitization based on double PDF conversion test work, mainly using the first method of software conversion, that is, after the software automatic OCR processing directly after the formation of double PDF files. Geological data is different from ordinary paper files, paper styles and printing methods, handwritten and old information, stratigraphy, mathematics and other special symbols, etc., to the OCR automatic recognition of the difficulties brought about by a single software recognition can not meet the full-text search more than 90% recognition rate requirements.
On the basis of the conversion test, we get the following conclusions:
1)Geological information itself is diverse, the actual recognition rate is mainly affected by the quality of printing, the formation of the age and other factors, old information, poor paper quality information recognition rate is generally lower; by the writing habits of the penman and the clarity of the impact of the handwritten document recognition accuracy is generally less than 30%; oil-printed document recognition accuracy is generally less than 50%; the recognition accuracy is generally less than 50%. Recognition of the accuracy rate is generally below 50%; printing, printing and offset printing document recognition rate is higher, generally up to 90% or more. In either type of document, the recognition rate of punctuation is very low, and the recognition rate of stratigraphic and mathematical symbols and other special symbols is almost zero.
2) the current recognition technology can not reach 100% recognition, according to the actual needs of the paper file against the initial recognition of the results of manual proofreading in order to meet the needs of full-text search.
3)Geological data scanned documents in large quantities, large capacity, conversion speed affected by the computer response speed, high-volume conversion and recognition of the need to choose a high-configuration computers, and batch conversion and artificial recognition of long time-consuming, labor-consuming, need special funding to support the work.
4.3 The introduction of OCR digital processing system and function introduction
After the comparison of the current domestic double PDF conversion methods, combined with the complexity of the geological data, as well as data testing results of the study, it is recommended that the geological data of the double PDF conversion is mainly used in the combination of software recognition and process processing method, that is, the use of OCR digital processing system, which can ensure that High efficiency and high quality to complete the double-layer PDF conversion. The system mainly contains the following main processes:
Figure 1 Schematic diagram of OCR digital processing system
1) Image processing. In order to improve the recognition rate, the image of the "blue decontamination" processing, remove the image to affect the recognition rate of the noise, such as pockmarks, underlining, etc., by the image quality control program to automatically monitor the quality of image processing.
2)Layout analysis. Automatic layout understanding and positioning, to determine whether the delineated area is a horizontal text area, vertical text area, table area or image area, and different attributes of the area with different colors of the wireframe identified. The automatic layout analysis runs in the background, and the operator can confirm in the foreground and add manual intervention to the automatic layout analysis results if necessary.
3) Recognition. The text image into a computer text code, can recognize the printed and handwritten Chinese (including simplified Chinese and traditional Chinese), Chinese and English mixed text, tables, recognized text code can be GB code, BIG5 code, GBK code or Unicode code. The recognition process runs in the background.
4) Vertical proofreading. It has a strong ability to check and correct errors, which is to list and display the text images identified as the same word in one or several images, and mark the suspicious words with prominent colors, so that it is easy for the operator to find out the errors and modify them.
5) Horizontal proofreading. This is a traditional manual proofreading method where the operator directly compares the recognition result text and image to find the recognition error text. The system automatically calls up the image corresponding to the text for comparison. At the same time, the recognition of the credibility of the text is not away from the eye-catching color marking.
6) Layout Restore. Recognition and modification of the text to restore the same layout with the layout of the scanned manuscript, can be read by the computer and query retrieval of the RTF, PDF, HTML, SGML / XML format digital documents.
7) Data storage. Layout reduction digital document preservation.
4.4 methods to enhance the OCR recognition rate
The use of OCR digital processing system generated by the two-layer PDF, the text layer error rate can be as low as one in 10,000, can be presented in the original version of the background and color style, can be full-text search and copy references, and retrieve information can be accurately located to the characters, to facilitate rapid search for the target information. In order to reduce the horizontal proofreading that is, manual proofreading workload, improve efficiency, we must fundamentally improve the recognition rate. After testing, the following methods can improve the raster file OCR recognition rate.
1) image color settings. Although the grayscale or color mode can maximize the original appearance of the paper material, is our first choice for scanning digital, but these two color modes will increase the impact of recognition rate of the background noise. For text recognition and general black and white illustration selection only, it is recommended that the image color setting of the scanning program be set to black and white to increase the recognition rate. However, the final image color settings should also be set in accordance with the specific requirements of various types of work to set the specifications.
2) Resolution settings. We all know that the lower the scanning resolution settings, the faster the scanning speed, but also lead to poor image quality, the text recognition accuracy is low. On the contrary, high resolution, scanning speed is slow, but the text recognition accuracy is high. But this is not absolute, because the resolution is set too high, the paper on the small defects may also be recognized as punctuation marks or Chinese characters, text recognition accuracy will be reduced. After repeated tests, the resolution is set to 300dpi, which is the best balance between scanning speed and text recognition accuracy.
3) Image processing. Here image processing refers to the scanning output image before the tilt correction and decontamination. Tilt correction is to adjust the direction of the text to make it positive, so as to help OCR recognition.
Double PDF conversion is completed, on this basis, you can realize the data management system and PDF documents hooked up to the data content and its metadata and other related information to establish a link and the formation of data packages; and then by calling the full-text database to create the original data index file, and finally realize the full-text search. Through the realization of full-text database and full-text search, we can get the high search rate and accuracy rate, greatly improve the utilization value of geological data, promote the compilation and research of geological data, and lay the foundation for the research and deep-level service of geological data information aggregation.
References
[1]Xu Chengchen. Application of OCR technology in the process of archive digitization[J]. Archives Management, 2011(1).
[2]Xu Yongfang. Application of OCR technology in the process of archive digitization[J]. Art Technology, 2011(2).
[3]Zhang Xuan.Research Progress and Prospect of OCR Technology[J]. Science and Technology, 2010(4).
[4]Guo Jinguang. Double-layer PDF technology and its application in archive digitization[J]. New Observation, 2013(1).
[5]Liu Jiazhen. File preservation format and PDF documents[J]. Archival Research, 2002(2).