1 Scope
This standard specifies the main technical requirements for the digitization of paper archives.
This standard applies to the use of scanners on a variety of paper archives digitization processing.
2 normative references
The following documents in the provisions of this standard through the references and become the provisions of this standard. All subsequent revision sheets (excluding errata) or revisions of dated referenced documents do not apply to this standard; however, parties who have entered into agreements based on this standard are encouraged to examine whether the latest versions of these documents may be used. Where undated references to the document, its latest version applies to this standard.
GB/T 17235.1 Digital compression coding for continuous tone still images
GB/T 17235.2 Digital compression coding for continuous tone still images
GB/T 18894?2002 Specification for the Archiving and Management of Electronic Documents
ITU(CCITT)G3 Binary Image Compression Algorithm
ITU(CCITT)G3 Binary Image Compression Algorithm
This standard is based on the latest version of this standard. p>ITU (CCITT) G4 binary image compression algorithm
DA/T18?1999 file entry rules
Temporary provisions of the functional requirements of the file management software National Archives Administration File [2001] No. 6
3 Terminology and Definitions
The following terminology and definitions are applicable to this standard.
3.1
Digitization Digitization
The process of converting analog images into digital images using computer technology.
3.2
Digitization of Paper?Based Records
The process of digitizing and applying to ordinary (black handwriting is clear) oil-printed, lead-printed, offset-printed, etc. printed or photocopied paper records.
3.3
Digital Image Digital Image
An array of integers representing an image of a scene. A sampled and quantized function of two or more dimensions, generated from successive images of the same dimensions. An array that samples a continuous function on a matrix (or other) network and minimizes the values at the sampling points.
3.4
Binary Image
Binary Image
A digital image that has only two levels of gray, black and white. It corresponds to the two states of black and white transcripts, line drawings, fingerprint drawings, etc.
3.5
Continuous Tone Still Image Continuous?tone Still Image
A static digital
image made up of more than two levels of gray with different shades of intensity or a combination of different color channels.
3.6
Distortion Measure
Difference in color, geometry, compression algorithm, etc., between a digital image and an archival original
after digital conversion of the archival original, in the same test environment.
3.7
Intelligibility
Indicates the ability of a digital image to provide information to a person or machine.
3.8
Either process of removing redundancy from an image or approximating an image with the goal of representing the image in a more compact form.
3. 9
Resolution Resolution
The number of points or pixels contained in an image per unit length.
3.10
TIFF Tagged Tmage File Format
The tagged image file format. A marker-based lossless (no information loss) compression format for exchanging files between applications and between computer platforms. Because it stores so much information at the subtle level of the image, the quality of the image is improved, making it very conducive to reproducing and storing the original as a black-and-white file.
3.11
JPEG Joint photographic Experts Group
Joint Photographic Experts Group. A compression format that loses a small amount of information, especially for screen and print displays, and supports all major computer platforms and Web browsers.JPEG files are small, and image quality is acceptable in most cases. Considering the storage space and transmission efficiency, the original color files can be relocated to this format for copying and storage.
4 The basic process of digitization
The basic process of digitizing paper archives mainly includes file arrangement, directory construction, batch scanning, data processing, information storage, retrieval and use of processes.
4.1 File organization
The case files to be scanned are properly organized and marked.
4.2 Directory construction
The necessary directory database is established for digital archive retrieval.
4.3 Batch Scanning
Scanning is done in batches as planned according to the overall arrangement of the specific task of digitizing the archives.
4.4 Data Processing
4.4.1 Proofread the scanned images to ensure that the images are complete and error-free, and as necessary, the scanned images of the problematic correction, decontamination, splicing and other technical processing.
4.4.2 The corresponding processing of the bare data before acceptance, including file format conversion, logical subdisk processing, adding descriptive documents, as well as data hookup, inspection, uploading, quality checks and backup.
4.5 Information storage
Select the appropriate data format, encoding method and storage medium for information preservation according to different scanned images.
4.6 retrieval and utilization
Provide retrieval and utilization according to user needs.
5 Organize files
Before batch scanning, organize files according to the following steps to ensure the quality of archive digitization.
5.1 Separation
5.1.1 Separate scanned and non-scanned items in the same file.
5.1.2 In the scanned parts will be large pictures, photographs inserted into the instruction page to facilitate the scanning process batch scanning, the actual image is placed in the image processing for rescanning, replace the instruction page.
5.1.3 Fill out the "Data Processing Processing Sheet" (see Appendix A), the need for special handling of the page, identify clearly.
5.2 Paging
Page number, piece number labeling of the archives before batch scanning. If in the labeling and the original file in the number of pieces, page number is not consistent, should be used as a standard.
5.3 Unwrap
Remove the original binding in the file for subsequent scanning.
5.4 assembly
Restore the original binding in accordance with the requirements of the archives.
6 built directory library
6.1 directory entry
According to DA / T18 to determine the file entry, and entry.
6.2 Data format selection
Selected data format is common, should be able to directly or indirectly with the DBF file format or through the XML file
file for data exchange.
6.3 Directory input
The recorded directory will be entered into the computer to create a machine-readable directory database.
7 Batch Scanning
7.1 Scanning Methods
Paper archives can be scanned using black and white binary images and continuous tone images.
7.1.1 The page for the single-color text of the paper file, it is appropriate to use black and white binary image scanning; page for the multi-color text image of the archive file, can be used to continuous tone image scanning.
7.1.2 Clear handwriting, archival materials without pictures, using black and white binary image scanning; poor clarity or with pictures of archival materials, continuous tone image scanning can be used.
7.2.2 Resolution Selection
7.2.1 Monochrome page archive documents, scanning resolution is generally recommended to select 100 to 200 dpi.
7.2.2 Color page archive documents, scanning resolution can be selected from 100 dpi or more parameter values.
7.2.3 large-format archive documents, such as engineering drawings, newspapers and other sizes over A3, you can choose a large-format image scanner (such as A0), large-format digital platforms, micrographics after the film digital conversion, but also can be used in small-format scanning of the image splicing. Scanning resolution should be selected above 1OOdpi.
7.2.4 When needed, according to the clarity of the original can be properly adjusted scanning resolution. If the original quality is poor and small size, you can increase the appropriate resolution; conversely, you can also reduce the resolution accordingly, increase or decrease the number of scanned images according to the original size of the display is clear subject to.
7.3.3 Scanning of special pages
7.3.1 Paste pages and forms
For the paste folded pages, available large format scanner scanning, or scanning the first part of the scanning after splicing; for some of the fonts are very small, dense handwriting, you can appropriately increase the scanning resolution, select grayscale scanning or color scanning, the use of local deepening technology to solve the problem; for the depth of the color of the handwriting and forms Different, the use of local fading technology to solve
7.3.2 General text flow chart
Appropriate resolution scanning and local deepening technology to ensure that the text flow chart clear. At the same time using different equipment to meet the text flow chart scanning.
7.3.3 Illustrations
The use of high-resolution grayscale or color scanning technology will be scanned with the illustrations and text to ensure that the original page layout and illustrations clear.
7.3.4 Photographs
For documents with black-and-white or color photographs on the page, scanning in JPEG format ensures the clarity of the photographs and avoids excessive image storage space.
7.4 File naming
7.4.1 Create folders
The creation of folders when scanning should be established in accordance with the hierarchy of the archival entity, in which each document also needs to be a separate folder.
7.4.2 folder naming method, named after the file entity in the file number, generally 3-digit, less than 3
digits to the left of the complementary "0".
8 Data Processing
8.1 Image Processing
8.1.1 Deskewing
The skewed image of the scanning process for the overall correction to ensure that the skew angle of the digital image is less than 1 degree. (Image skew no more than half the text within the page).
8.1.2 Decontamination
Remove impurities in the digital image that affect intelligibility. Archival digitized image decontamination, should follow the principle of not affecting the intelligibility of the premise to show the original appearance.
8.1.2.1 Localized decontamination, such as the removal of black edges, stains.
8.1.2.2 Overall decontamination, which removes stains from the page at once.
8.1.3 Splicing
Splicing of format-separated digital images to ensure the integrity of the digitized image of the archive.
8.1.4 Proofreading
8.1.4.1 A proofreading, check the quality of the scanned image, and mark the unqualified images to be returned for rescanning.
8.1.4.2 Corrective processing, according to a proofreading of such as stains, black edges, skew, image quality
quantity and other issues, combined with the "Data Processing Processing Sheet" (see Appendix A), the corresponding processing of each image.
8.1.4.3 secondary proofreading, a proofreading and data processing after the digital image to check again, and
mark the image processing process is not clear page. Failure to return to reprocessing.
8.2 Data Quality Inspection
8.2.1 Text Entry Quality
The text entry of the entry fields marked on the list of entries to control the correct rate, to ensure that the error rate is less than 3 ‰.
8.2.2 image quality
The image file formed after scanning the paper document clarity, stains, black edges, skewed and other issues such as control, in order to achieve the required image quality.
Ensure that the scanned digital images are clear, easy to read clearly, and adapt to a variety of paper and handwritten and printed words.
8.3 Data Hookup
Control of the degree of accuracy of the correspondence between archive entry data and image files.
Before generating the bare data CD-ROM, verify whether the total number of image files is equal to the actual number of files according to the information collated, the citation information before scanning, and the page number information of the files in the volume, if not equal, the bare data CD-ROM can not be generated, and the list can be printed out and returned to the image processing staff to make up for the scanning.
8.4 Data Inspection
8.4.1 Bare data will be accepted before the corresponding processing, including file format conversion, logical sub-disc processing, add descriptive documents.
8.4.2 Data inspection, give a pass or fail conclusion.
8.4.3 Convert the digitally processed data to bare data CD-ROM format and copy it to the active hard disk.
8.4.4 The system should automatically record the progress of the inspection.
8.5 Data upload
The data of each process of archive digitization is uploaded to the data server side in time through the network to summarize the data, in which the digital image automatically searches for the corresponding directory data, adding the corresponding electronic address of the digital image filename, to establish a one-to-one correspondence.
8.6 Data Backup
The various types of data on the server are regularly backed up to prevent data loss.
9 Information Storage
In JPEG digital compression coding or TIFF international common standard format. Using the international standard fax document compression format compressed, respectively, according to the page number and then stored in pages.
Digital image storage management should maintain the original paper file storage mode corresponds to the storage.
9.1 compressed storage format
File digital image format using TIFF and JPEG digital compression coding.
9.1.1 Black and white binary image
Black and white binary scanned image files using TIFF (CCITT G3) format binary image compression algorithm, compression rate ready (Cr) for 15: 1. TIFF (CCIFF G4) format binary image compression algorithm can be used, compression rate (Cr) for the 3 O: 1.
9.1.2 Continuous tone still images
Continuous tone still images using JPEG digital compression coding, the average compression ratio (cr) of 15:1.
9.2 encoding
Should be used as far as possible, real-time operability of the encoding method. Distortion coding, which obtains better image quality with less bit rate; and distortion-free coding, which keeps the information undistorted with low compression. In general
It is appropriate to use the international common coding and decoding algorithms.
9.3 Storage and carrier
Online and offline, different carriers can be selected for storage.
10 retrieval and utilization
10.1 Retrieval and utilization
Archive digital image retrieval and utilization can be used stand-alone, local area network (LAN) and the Internet in three ways.
Transmission and utilization of LAN, should ensure that the data in the internal LAN 10Base?T bandwidth, an average of 1 second to display the response to complete. Internet transmission and utilization, the Internet 56Kbit / s bandwidth, an average of 5 seconds to complete the display response. For this reason, the general requirements for each page of the file digital image storage capacity of 50K or less.
10.2 retrieval software configuration
Retrieval software should be in line with the State Archives Administration of the development of the "Interim Provisions on Functional Requirements for Archives Management Software" requirements. Retrieval software should have the basic functions of directory retrieval and easy access to digital images of archives.