ABSTRACT: This paper describes how the advantages of both SAS(R) and UNIX were combined to create the 'Census Reference System' (CRS), a unique on-line database used by the U.S. Bureau of the Census for conducting its many household-based surveys. This system provides rapid access to over one hundred million data records without the use of a traditional relational database. The unique design of the CRS significantly reduced hardware and software requirements and their associated costs, as compared to other more traditional processing approaches. BACKGROUND: The U.S. Bureau of the Census regularly conducts surveys of households and housing across the United States. In order to collect survey information, the Census Bureau creates a master list called the 'Census Reference File' of all the housing units (homes) in the country. The file is based on information from the Decennial Census of Population and Housing and other current surveys. This reference information is used by the Census Bureau to locate and survey housing units for all of its household-based surveys. Some of the surveys using this file are the Current Population Survey, the American Housing Survey, and the Survey of Income and Program Participation. Prior to the implementation of the CRS only a small portion of the housing unit information was accessible via off-line computer tapes. This was due in part to the time required to search such a large volume of data, and in part, to the high cost of keeping the large volume of data in an on-line status on a mainframe computer. As a consequence, prior to the CRS, only a small portion of the information was easily accessible via computer with the remaining information only available by manually searching through vast amounts of microfiche of original Census forms. Therefore, accessing the reference information was time consuming and prone to errors. Traditional relational database solutions to resolve the problems in accessing the Census Reference File information had been considered and tested. Unfortunately, due to the high volume of data and the number of fields (variables) involved, traditional relational database software proved to be extremely slow and required tens of gigabytes of storage. The CRS, the topic of this paper, was developed out of a need to overcome the problems associated with traditional databases and provide Census Bureau field staff, analysts, and management with timely access to information contained in the Census Reference File in the most cost effective manner possible. The basic original requirements for the system are detailed below. REQUIREMENTS: Design a system that would: 1) Give non-programmers rapid access to over one hundred million data records in a simple to use environment 2) Permit on-line updating of some record level information 3) Store the data as compactly and cheaply as possible 4) Support multiple concurrent users with minimum contention for resources 5) Take advantage of inexpensive workstation hardware 6) Develop, test, and implement quickly. The system must support access to four different types of information: 1) Non-alterable housing unit data ( 40 variables, 405 characters per record) 2) Alterable housing unit 'notes' data (4 variables, 200 characters per record) 3) Non-alterable block summary data (23 variables, 254 characters per record) 4) Alterable block summary data (4 variables, 200 characters per record) While the above requirements would seem to be satisfied by a simple relational database application, the volume of the data prohibits this solution. Traditional relational database software products tend to be stricken with the equivalent of 'data traffic grid-lock' when processing a large volume of very large records stored in multiple tables. This is especially true if more than a very small percentage of records are being requested. As the amount of data and the size of records increase, performance does not degrade linearly, it tends to degrade exponentially. As an additional drawback, artificially splitting large logical records into different database tables and adding indexes to join them frequently increases the size of a data base several times over. This in turn increases disk storage, costs, and degrades processing due to the increased data management overhead. SOLUTION: The CRS solves all of the problems associated with a relational database solution and then some, by taking advantage of UNIX pipes, UNIX file compression, UNIX file system hierarchy, the ability of SAS to read data directly from UNIX pipes, and the many application tools available with SAS. The Census Reference File data has a natural hierarchical structure. Housing units (1 housing unit = 1 record) are grouped by blocks. Blocks are grouped by address register areas (ARAs). Address register areas are grouped by district offices (DOs). The CRS capitalizes on this existing data hierarchy by mirroring it in its use of the UNIX file system. The top level CRS-UNIX file storage directory contains a sub-directory for each DO. Each DO has a sub-directory for each of its ARAs. Each ARA sub-directory contains 1 to 4 files containing all of the data for all of the blocks in the ARA. In order to conserve disk space, all CRS data are stored as compressed ASCII text files. In most cases, standard UNIX compression of each data file enables a 6 to 1 decrease in data storage requirements whereas, a traditional relational database product could require a 3 to 1 increase in storage requirements. This results in a very significant 18 to 1 reduction in disk requirements when comparing the CRS with a traditional database solution. The data storage requirement for the CRS is 3.1 gigabytes of disk space which easily fits on 2 high capacity Small Computer System Interface (SCSI) disk drives (formatted capacity of 2.8 gigabytes each) with enough remaining storage for work space and 'notes' data added by system users. The disk space required for a relational database to store and manipulate the same amount of data is estimated to exceed 60 gigabytes. When initially loaded the Census Reference File Information consumed 3,118,443,000 bytes of disk storage (even when compressed). DESIGN AND IMPLEMENTATION: Several system designs for the CRS were rapidly prototyped using SAS and tested with a sample of the Census Reference File data. The first prototype design involved keeping the data in large compressed ASCII data files, one data file per district office. This approach worked, but it was rejected because it took several minutes and considerable amounts of additional temporary disk space to load the requested data into SAS data sets. Concurrent data requests slowed this prototype design dramatically due to competition for disk access to the very large files and data sets. A second prototype design placed the data in large SAS data sets (one per district office) and added indexes on key fields. While this approach increased access speed dramatically, it was also rejected because of the large amount of disk space required to store the large data sets. The third and final prototype design for the CRS proves to be very efficient for both data access and data storage. This design stores data in thousands of compressed ASCII files (1 to 4 per ARA). Splitting the data up into relatively small compressed files (approximately 60 kilobytes compressed), based on the data's natural hierarchical structure, greatly reduces the time required for the system to load the desired data. Users supply the system via SAS/AF(R) screens, with the DOs and ARAs they wish to examine. The system then un-compresses the appropriate data files into UNIX pipes which are read directly by SAS and loaded into SAS data sets. The data sets are then merged, as necessary, and placed into SAS/FSVIEW(R) screens with custom pop-menus. The average time to load data is approximately one to two seconds of real time. This reduction in data loading time virtually eliminates disk access contention. The performance achieved by the CRS when compared to a relational database processing similar size data requests, was evaluated to be many times faster. File System Hierarchy: UNIX Directory/ UNIX Sub-Directory/ Compressed Text Files District Office (Directories) Count = 449 Address Register Area (Sub-directories) Count = 153,272 Housing Unit Data (Files) Count = 152,833 Variables Per File = 40 Length of File in Characters = 405 Av. Records Per File = from 500 to 2,000 Block Summary Data (Files) Count = 123,690 Variables Per File = 23 Length of File in Characters = 254 Av. Records Per File = from 20 to 500 Housing Unit Notes (Files) Count = 0 (built later by system users) Variables Per File = 4 Length of File in Characters = 200 Av. Records Per File = from 500 to 2000 Block Summary Notes (Files) Count = 0 (built later by system users) Variables Per File = 4 Length of File in Characters = 200 Av. Records Per File = from 20 to 500 The CRS also includes some other interesting innovations. In addition to viewing data, printing reports, and performing extractions for analysis, the system users requested that the system be able to add and modify note fields at the record level for both housing unit and block summary data. All of this was accomplished by means of compressed ASCII files for both source data and related note data. Note files contain only record identification information and the actual note data and only for those records having notes. When a selection of records is requested 1) the appropriate data files are loaded into SAS data sets, 2) their corresponding note files (if any) are loaded into SAS data sets, and 3) the appropriate SAS data sets are then merged for viewing and/or modification. Upon termination of a records request and if a modification has been made by an authorized user, the alterable notes data is written back out to the appropriate ASCII file and compressed. File locking is handled by SAS changing the UNIX file protections and creating a lock file so that only one user at a time can access a specific note file for modification. Another interesting feature of CRS is that the system uses customized SAS pop-menus over SAS/FSVIEW screens under SAS Version 6.07.02 running on a SUN UNIX Version 4.1.3 operating system (see Figures 4 & 5). This feature was designed and implemented in the CRS even though, at the time, there was no documented way to accomplish this. It took several days each to develop working prototypes for the 3 system designs on a SUN SPARC 1+ workstation. The adding of refinements and user requested features to the final design version took less than a week. Loading of the data into the CRS took over 2 months of near continuous processing across 3 major computer systems. First, the data was prepared from hundreds of tapes into separate DO files on the Census Bureau's Unisys mainframes. Next the data were downloaded from the mainframes to Digital Equipment Corporation (DEC) minicomputers where the files were compressed and directly accessible by the SUN-SPARC workstation. The compressed files on the DEC minicomputers were 'mounted' from the SUN workstation with the Network File System (NFS) and processed directly into the hundreds of thousands of directories, sub-directories, and data files on disks local to the SUN workstation. (The data files were compressed on the DEC computers to reduce network transfer traffic and because there was insufficient disk space to both store and process all the uncompressed data directly on the SUN workstation's disks.) In summary, the primary features of SAS and UNIX which made this system possible were: 1. UNIX file compression 2. UNIX pipes 3. The UNIX file system and its hierarchy 4. The ability of UNIX to uncompress data as it flows into a pipe (the zcat command) 5. The ability of SAS under UNIX to read data directly from a pipe as if it were a file CONCLUSION: The CRS has now been in full service for nine months. The system hardware is located in a remote site and provides access for 25 to 100 users a day from both the Census Bureau's headquarters in Suitland, Maryland and its main processing site in Jeffersonville, Indiana via a secure wide area network. On occasion the system provides complete searches and large extracts from the complete collection of CRS data. This is accomplished by using simple custom UNIX and SAS programs. By taking advantage of the best characteristics of both SAS and UNIX, the Census Reference System was developed quickly and was able to meet, if not exceed, the specified design requirements. This paper, and all of the software programs for this system are available via Internet File Transfer Protocol (FTP) connection to 'ftp.census.gov'. This should be more convenient than retyping partial examples from a publication. The paper, catalogs, programs (sorry, no data) will initially be located in a sub-directory under 'src/sas'. If you have any questions about this paper, feel free to contact: Edward 'Cary' Bean, Jr. U.S. Department of Commerce Bureau of the Census Washington, D.C. 20233 phone - 301-457-2157 Internet - cbean@census.gov SAS, SAS/AF, SAS/FSP, and SAS/FSVIEW are registered trademarks of the SAS Institute Inc., in the USA and other countries. (R) indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies.