NCSC TECHNICAL REPORT - 005 Volume 1/5 Library No. S-243,039 FOREWARD This report is the first of five companion documents to the Trusted Database Management System Interpretation of the Trusted Computer System Evaluation Criteria. The companion documents address topics that are important to the design and development of secure database management systems, and are written for database vendors, system designers, evaluators, and researchers. This report addresses inference and aggregation issues in secure database management systems. ACKNOWLEDGMENTS The National Computer Security Center extends special recognition to the authors of this document. The initial version was written by Victoria Ashby and Sushil Jajodia of the MITRE Corporation. The final version was written by Gary Smith, Stan Wisseman, and David Wichers of Arca Systems, Inc. The documents in this series were produced under the guidance of Shawn P. O'Brien of the National Security Agency, LouAnna Notargiacomo and Barbara Blaustein of the MITRE Corporation, and David Wichers of Arca Systems, Inc. We wish to thank the members of the information security community who enthusiastically gave of their time and technical expertise in reviewing these documents and in providing valuable comments and suggestions. TABLE OF CONTENTS SECTION PAGE 1.0 INTRODUCTION l 1.1 BACKGROUND AND PURPOSE 1 1.2 SCOPE 1 1.3 INTRODUCTION TO INFERENCE AND AGGREGATION 2 1.4 AUDIENCES OF THIS DOCUMENT 3 1.5 ORGANIZATION OF THIS DOCUMENT 4 2.0 BACKGROUND 5 2.1 INFERENCE AND AGGREGATION - DEFINED AND EXPLAINED 5 2.1.1 Inference 5 2.1.1.1 Inference Methods 6 2.1.1.2 Inference Information 7 2.1.1.3 Related Disciplines 7 2.1.2 Aggregation 8 2.1.2.1 Data Association Examples 9 2.1.2.2 Cardinal Aggregation Examples 9 2.2 SECURITY TERMINOLOGY 10 2.3 THE TRADITIONAL SECURITY ANALYSIS PARADIGM 12 2.4 THE DATABASE SECURITY ENGINEERING PROCESS 14 2.5 THE REQUIREMENTS FOR PROTECTING AGAINST INFERENCE AND AGGREGATION 15 2.5.1 The Operational Requirement 16 2.5.2 The TCSEC Requirement 16 3.0 AN INFERENCE AND AGGREGATION FRAMEWORK 18 3.1 DETAILED EXAMPLE 18 3.2 THE FRAMEWORK 20 3.2.1 Information Protection Requirements and Resulting Vulnerabilities 20 3.2.2 Classification Rules 20 3.2.3 Vulnerabilities from Classification Rules 25 3.2.4 Database Design 26 3.2.5 Vulnerabilities from Database Design 27 3.2.6 Database Instantiation and Vulnerabilities 28 3.3 OVERVIEW OF RESEARCH 28 4.0 APPROACHES FOR INFERENCE CONTROL 31 4.1 FORMALIZATIONS 31 4.1.1 Database Partitioning 31 4.1.2 Set Theory 32 4.1.3 Classical Information Theory 33 4.1.4 Functional and Multivalued Dependencies 33 4.1.5 Sphere of Influence 34 4.1.6 The Fact Model 34 4.1.7 Inference Secure 35 4.1.8 An Inference Paradigm 35 4.1.9 Imprecise Inference Control 36 4.2 DATABASE DESIGN TECHNIQUES 37 4.2.1 Inference Prevention Methods 37 4.2.1.1 Appropriate Labeling of Data 37 4.2.1.2 Appropriate Labeling Integrity Constraints 38 4.2.2 Inference Detection Methods 39 4.3 RUN-TIME MECHANISMS 40 4.3.1 Query Response Modification 41 4.3.1.1 Constraint Processors 41 4.3.1.2 Database Inference Controller 42 4.3.1.3 Secondary Path Monitoring 43 4.3.1.4 MLS System Restrictions 43 4.3.2 Polyinstantiation 44 4.3.3 Auditing Facility 44 4.3.4 Snapshot Facility 45 4.4 DATABASE DESIGN TOOLS 47 4.4.1 DISSECT 47 4.4.2 AERIE 48 4.4.3 Security Constraint Design Tool 49 4.4.4 Hypersemantic Data Model and Language 49 5.0 APPROACHES FOR AGGREGATION 51 5.1 DATA ASSOCIATION 51 5.1.1 Formalizations 52 5.1.1.1 Lin's Work 52 5.1.1.2 Cuppens' Method 54 5.1.2 Techniques 54 5.2 CARDINAL AGGREGATION 56 5.2.1 Techniques 56 5.2.1.1 Single Level DBMS Techniques 56 5.2.1.1.1 Database Design and Use of DAC 56 5.2.1.1.2 Use of Views 57 5.2.1.1.3 Restricting Ad Hoc Queries 58 5.2.1.1.4 Use of Audit 58 5.2.1.1.5 Store all Data System-High 59 5.2.1.2 Multilevel DBMS Techniques 59 5.2.1.2.1 Use of Labeled Views 59 5.2.1.2.2 Start with All Data System-High 60 5.2.1.2.3 Make Some Data High 60 5.2.2 Mechanisms 61 5.2.2.1 Aggregation Constraints 61 5.2.2.2 Aggregation Detection 61 6.0 SUMMARY 63 REFERENCES 65 LIST OF FIGURES FIGURE PAGE 2.1: TRADITIONAL SECURITY ANALYSIS PARADIGM 13 2.2: DATABASE SECURITY ENGINEERING PROCESS 14 3.1: ENTITIES AND ASSOCIATIONS 19 3.2: ATC ATTRIBUTES 19 3.3: INFERENCE AND AGGREGATION FRAMEWORK 21 3.4: ATC EXAMPLE DATABASE DESIGN 27 4.1: PARTITIONING OF THE DATABASE FOR A USER U 32 4.2: DATABASE WITHOUT AN INFERENCE PROBLEM 32 4.3: DATABASE WITH AN INFERENCE PROBLEM 32 4.4: QUERY PROCESSOR 41 4.5: DBIC SYSTEM CONFIGURATION 42 4.6: GENERAL-PURPOSE DATABASE MANAGEMENT SYSTEM MODEL 46 4.7: DATABASE MANAGEMENT SYSTEM MODEL AT THE U. S. BUREAU OF THE CENSUS 46 5.1: ATC COMPANY DATA ASSOCIATION EXAMPLES 55 LIST OF TABLES TABLE PAGE 3.1: TAXONOMY OF PROTECTION REQUIREMENTS AND CLASSIFICATION RULES 22 3.2: FORMAL DEFINITIONS 29 3.3: RESEARCH SUMMARY (PART 1) 29 3.3: RESEARCH SUMMARY (PART 2) 30 SECTION 1 INTRODUCTION This document is the first volume in the series of companion documents to the Trusted Database Management System Interpretation of the Trusted Computer System Evaluation Criteria [TDI 91; DoD 85]. This document examines inference and aggregation issues in secure database management systems and summarizes the research to date in these areas. 1.1 BACKGROUND AND PURPOSE In 1991 the National Computer Security Center published the Trusted Database Management System Interpretation (TDI) of the Trusted Computer System Evaluation Criteria (TCSEC). The TDI, however, does not address many topics that are important to the design and development of secure database management systems (DBMSs). These topics (such as inference, aggregation, and database integrity) are being addressed by ongoing research and development. Since specific techniques in these topic areas had not yet gained broad acceptance, the topics were considered inappropriate for inclusion in the TDI. The TDI is being supplemented by a series of companion documents to address these issues specific to secure DBMSs. Each companion document focuses on one topic by describing the problem, discussing the issues, and summarizing the research that has been done to date. The intent of the series is to make it clear to DBMS vendors, system designers, evaluators, and researchers what the issues are, the current approaches, their pros and cons, how they relate to a TCSEC/TDI evaluation, and what specific areas require additional research. Although some guidance may be presented, nothing contained within these documents should be interpreted as criteria. These documents assume the reader understands basic DBMS concepts and relational database terminology. A security background sufficient to use the TDI and TCSEC is also assumed; however, fundamentals are discussed whenever a common understanding is important to the discussion. 1.2 SCOPE This document addresses inference and aggregation issues in secure DBMSs. It is the first of five volumes in the series of TDI companion documents, which includes the following documents: · Inference and Aggregation Issues in Secure Database Management Systems · Entity and Referential Integrity Issues in Multilevel Secure Database Management Systems [Entity 96] · Polyinstantiation Issues in Multilevel Secure Database Management Systems [Poly 96] · Auditing Issues in Secure Database Management Systems [Audit 96] · Discretionary Access Control Issues in High Assurance Secure Database Management Systems [DAC 96] This series of documents uses terminology from the relational model to provide a common basis for understanding the concepts presented. This does not mean that these concepts do not apply to other database models and modeling paradigms. 1.3 INTRODUCTION TO INFERENCE AND AGGREGATION Many definitions of inference and aggregation have been suggested in the literature. In fact, one of the challenges to understanding inference and aggregation is that there are different (but sometimes similar) notions of what they are. We will discuss the differences and similarities at length in later sections. At this point, the following general definitions are sufficient: Inference or the inference problem is that of users deducing (or inferring) higher level information based upon lower, visible data [Morgenstern 88]. Aggregation or the aggregation problem occurs when classifying and protecting collections of data that have a higher security level than any of the elements that comprise the aggregate [Denning 87]. Inference and aggregation problems are not new, nor are they only applicable to DBMSs. However, the problems are exacerbated in multilevel secure (MLS) DBMSs that label and enforce a mandatory access control (MAC) policy on DBMS objects [Denning 86a]. Inference and aggregation differ from other security problems since the leakage of information is not a result of the penetration of a security mechanism but rather it is based on the very nature of the information and the semantics of the application being automated [Morgenstern 88]. 1.4 AUDIENCES OF THIS DOCUMENT This document is targeted at four primary audiences: the security research community, database application developers/system integrators, trusted product vendors, and product evaluators. Inference and aggregation are problems based on the value and semantics of data for the part of the enterprise being automated. Understanding inference and aggregation is important for all of the audience categories, but the audience for whom this document is most relevant includes those involved in designing and engineering automated information systems (AIS)--in particular, the database application developers/system integrators. In general, this document is intended to present a basis for understanding the issues surrounding inference and aggregation especially in the context of MLS DBMSs. Implemented approaches and ongoing research are examined. Members of the specific audiences should expect to get the following from this document: Researcher This document describes the basic issues associated with inference and aggregation. Important research contributions are discussed as various topics are examined. By presenting current theory and debate, this discussion will help the research community understand the scope of the issue and highlight approaches and alternative solutions to inference and aggregation problems. For additional relevant work, the researcher should consult two associated TDI companion documents: Polyinstantiation Issues in Multilevel Secure Database Management Systems [Poly 96] and Entity and Referential Integrity Issues in Multilevel Secure Database Management Systems [Entity 96]. Database Application Developer/System Integrator This document highlights the need for analysis of the application-dependent semantics of an application domain in order to understand the impact of inference and aggregation on MLS applications. It describes the basic issues and current research into providing effective solutions. Trusted Product Vendor This document describes the types of mechanisms from the research literature that can be incorporated into products to assist application developers with enforcing aggregation-based security policies. Research on tools to support inference analysis provides the basis for additional tools that might be provided with a DBMS product. Evaluator This document presents an understanding of inference and aggregation issues and how they relate to the evaluation of a candidate MLS DBMS implementation. 1.5 ORGANIZATION OF THIS DOCUMENT The organization of the remainder of this document is as follows: · Section 2 discusses several first principles that form the foundation for reasoning about inference and aggregation. Included in Section 2 are subsections that define and explain many aspects of inference and aggregation, present relevant security terminology (e.g., comparing inference to direct disclosure), and describe two fundamental processes: the traditional security analysis paradigm and the database security engineering process. · Section 3 presents a framework for reasoning about inference and aggregation based on the two processes described in Section 2. Section 3 also introduces a detailed example that is used throughout the document to illustrate the material. Section 3 concludes with a matrix that shows how each research effort discussed in Sections 4 and 5 relate to the framework. · Section 4 summarizes research efforts on inference. · Section 5 summarizes research on aggregation. · Section 6 summarizes the contents of this document. SECTION 2 BACKGROUND Inference and aggregation are different but related problems. In order to understand what they are, what problems they embody, and their relationship, it is essential to revisit several first principles. In Section 2.1, we give further descriptions and discuss inference and aggregation in some detail, including examples. Some important terminology and security concepts are covered in Section 2.2. The next two sections describe fundamental processes that form the basis for reasoning about inference and aggregation: the traditional security analysis paradigm (Section 2.3) and the database security engineering process (Section 2.4). The final section summarizes the operational and TCSEC requirements for inference and aggregation. Although much of what will be discussed in this section will be considered by some as intuitively obvious, it is these principles that are often forgotten and lead to misunderstandings. Even experienced researchers and practitioners should review this section. 2.1 INFERENCE AND AGGREGATION - DEFINED AND EXPLAINED This section provides further definitions and explanations of inference and aggregation problems. Many examples are given to illustrate the relationship and complexity of these problems. 2.1.1 Inference Inference, the inferring of new information from known information, is an everyday activity of most humans. In fact, inference is a topic of interest for several disciplines, each from its own perspective. In the context of this document: Inference addresses the problem of preventing authorized users of an AIS from inferring unauthorized information. Preventing unauthorized inferences is complicated both by the variety of ways in which humans infer new information and by the wide variety of information they use to make inferences. Some researchers characterize the inference problems in terms of inference channels [Morgenstern 88; Meadows 88a]. SRI's research identifies three types of inference channels based on the degree to which HIGH data may be inferred from LOW data [Garvey 94]: · Deductive Inference Channel. The most restrictive channel, requiring a formal deductive proof (described by propositional or first-order logic) showing that HIGH data can be derived from LOW data. · Abductive Inference Channel. A less restrictive channel where a deductive proof could be completed assuming certain LOW-level axioms. · Probabilistic Inference Channel. This channel occurs when it is possible to determine that the likelihood of an unauthorized inference is greater than an acceptable limit. These descriptions of inference channels combine the method by which the inference is made and the information used to make the inference. Both of these areas of complexity are discussed further in the following paragraphs. Section 2.1.1.3 identifies related disciplines and how they address inference. 2.1.1.1 Inference Methods One of the factors that makes the inference problem so difficult is that there are many methods by which a human can infer information. The following list [Ford 90] illustrates the variety of inference strategies that can be used: · Inference by Deductive Reasoning New information is inferred using well-formed rules, either through classical or non-classical logic deduction. Classical deductions are based on basic rules of logic (e.g., assertion that "A is true" and if "A then B" deduces "B is true"). Non-classical logic-based deduction includes probabilistic reasoning, fuzzy reasoning, non-monotonic reasoning, default reasoning, temporal logic, dynamic logic, and modal logic-based reasoning. · Inference by Inductive Reasoning Well formed rules are utilized to infer hypotheses from observed examples. · Inference by Analogical Reasoning Statements such as "X is like Y" are used to infer properties of X given the properties of Y. · Inference by Heuristic Reasoning New information is deduced using various heuristics. Heuristics are not well defined and may be a "rule of thumb." · Inference by Semantic Association The association between entities is inferred from the knowledge of the entities themselves. · Inferred Existence One can infer the existence of entity Y from certain information (e.g., from the information, "John lives in Boston," one can infer that "there is some entity called John"). · Statistical Inference Information about an individual entity is inferred from various statistics compared on a set of entities. 2.1.1.2 Inference Information The other factor that makes preventing unauthorized inferences within a DBMS a difficult problem is the wide variety of information that can be used to make an inference. The following are some types of information that can be used to make inferences: · Schema metadata including relation and attribute names · Other metadata such as constraints (e.g., value constraints or other constraints that may be implemented through triggers or procedures) · Data stored in the relations (i.e., the value of the data) · Statistical data derived from the database · Other derived data that can be determined from data in the database · The existence or absence of data (as opposed to the value of the data) · Data disappearing (e.g., upgraded) or changing value over time · Semantics of the data that are not stored in the database but are known within the application domain · Specialized information about the application domain (both process and data) not stored in the database · Common knowledge, and · Common sense Denning [Denning 86a] and many other researchers adopt a so-called "closed-world assumption" that limits inference to the database. With this limitation, both the information used to make the inference and the inferred information must be represented in the database. The first seven types of information described above encompass information that would be included using this closed- world assumption. Unfortunately, in the real world in which a DBMS must operate, significant amounts of the remaining types of information will be used by an adversary to attempt to infer unauthorized information. 2.1.1.3 Related Disciplines Inference in the context of this document focuses on the information contained in a DBMS and preventing inference of unauthorized information by authorized users of the DBMS. Inference is also a key issue in several other disciplines including: · Operations Security (OPSEC) OPSEC is concerned with denying external adversaries the ability to infer classified information by observing unclassified indicators. The key difference is that the adversary is external to the organization, while this document addresses individuals that are internal to the organization (i.e., DBMS users). · Artificial Intelligence (AI) The ability to infer new facts from existing knowledge is a positive goal of several of the sub-disciplines of AI. The AI research community is looking for better ways to represent knowledge and approaches for inferring new information. · Uncertainty Management The inverse problem of preventing inferences is the need for individuals, such as intelligence analysts, to infer information about an adversary based upon evidence that can be collected. There are several approaches to evaluating evidence in an effort to quantify one's confidence in the evidence and the resulting inferred information [Schum 87]. 2.1.2 Aggregation Like the word inference, the word aggregation has multiple meanings in the information technology community. An aggregation, or aggregate, is a collection of separate data items. In the database context, data aggregation normally has a positive connotation. Aggregation is an important and desirable side effect of organizing information into a database. Information has a greater value when it is gathered together, made easily accessible, and structured in a way that shows relationships between pieces of data [National Academy Press 83]. In some sense, every database is an aggregation of data. The term aggregation has a different, more common use in the commercial database management community, where it usually refers to the use of aggregate operators. The five standard operations are sum, average, count, maximum, and minimum. Others such as standard deviation or variance may also be found in query languages [Ullman 88]. These operators are applied to columns of a relation to obtain a single quantity. They allow conclusions to be drawn from the mass of data present in a database. The aggregate operators in commercial DBMSs relate closely to the operations performed on statistical databases. Statistical databases are used to answer queries dealing with sums, averages, and medians, and use aggregation to hide the separate data items found in individual records. The concern in a statistical database is for security in the sense of privacy, not classification. Security in statistical databases has a large body of literature and will only be discussed briefly in this document. In the context of this document: Aggregation involves the situation where a collection of data items is explicitly considered to be classified at a higher level than the classification assigned to the individual data elements [Lunt 89]. Denning has identified two classes of the aggregation problem [Denning 87]. We describe them here in terms of entities and their attributes. · Attribute Associations Associations between instances of different data items of which there are two sub-classes. Associations between Entities Associations between instances of attributes of different entities. Associations between Attributes of an Entity Associations between instances of different attributes of an entity. · Size-based Record Associations Associations among multiple instances of the same entity. We will use the current terminology to represent these classes of aggregation: Data Association [Jajodia 92] for Attribute Associations (both Inter- and Intra-Entity) and Cardinal Aggregation [Hinke 88] for Size-based Record Associations. The difference between data association and cardinal aggregation is important. Data association refers to aggregates of instances of different data objects; cardinal aggregation refers to aggregates of instances of the same type of entity. Distinguishing between data association and cardinal aggregation as two types of aggregation is also important since data association can be effectively controlled using database design techniques with labeled data [Denning 87]. On the other hand, effective mechanisms to enforce cardinal aggregation policies are still a research issue. 2.1.2.1 Data Association Examples A traditional example of intra-entity data association involves salary: the identity of an employee and the actual number that represents the employee's salary may both be unclassified, but the association between a particular employee and the employee's salary is considered sensitive. This is an instance of intra-entity data association as defined above: employee name (or other identifying attributes such as employee number) and salary are both attributes of the employee entity. An example of inter-entity data association is the following: the identity of commercial companies and government agencies are unclassified, but the association of a particular company as the contractor for a particular government agency may be considered classified. In this case, the data association is between two entities, commercial company and government agency. 2.1.2.2 Cardinal Aggregation Examples The traditional example of cardinal aggregation is the intelligence agency telephone book: individual entries are unclassified but the total book is classified [Lunt 89]. The challenge with this example is understanding what information the agency is trying to protect with the policy. Is it the name and number of staff personnel (and therefore capability) of each (or some) organizational element? The total staffing size of the agency? Identification of special employees? Special skills? A clear understanding of what information needs to be protected must precede a cardinal aggregation policy. Another example of a cardinal aggregation policy relates to an Army troop list, that is, a list of Army organizations [Smith 91]. Information about each Army organization is unclassified (even if the organization's mission is classified), but the list as a whole is classified. Getting someone to determine how many organizations can be aggregated and still remain unclassified information is difficult. If the intent is to protect the Army's war fighting capability, then aggregating the combat arms units (as opposed to the support and civilian-based organizations) will directly disclose that information. Of course, one has to know about the war fighting capabilities of each type of combat arms unit, which is unclassified information and can be obtained externally to the troop list. Unfortunately, there may be other ways to infer the unauthorized information. For example, if one can determine the tactical communications capability (i.e., aggregation of communications units), can one then infer the war fighting capability? And what about any number of other aggregates of units where there is a reasonable understanding of their relationship to war fighting capability? It quickly becomes a difficult problem to describe, let alone to provide mechanisms to solve this type of aggregation problem. Yet keeping information about individual units unclassified is essential for operational efficiency. A final example is network management data: information about outages of individual circuits is not sensitive, but the aggregation of information about all circuit outages for a network is sensitive. In this case, one can postulate that this cardinal aggregation policy is attempting to protect the identification of vulnerabilities of the network from an adversary that might want to disrupt or render the network inoperative. With information about only one or two circuit outages, an adversary cannot mount an effective attack to bring the network down. On the other hand, if the adversary knows all the circuit outages, the network may have a temporary single point of failure, a vulnerability that the adversary can exploit. It is operationally impractical to treat information about each circuit as sensitive, yet in the face of a credible threat, it is important to protect the aggregate information. All of these example cardinal aggregation policies were established as a trade-off between the need to protect sensitive information versus the need for operational efficiency. To some researchers, cardinal aggregation policies are considered fundamentally unsound [Meadows 91). Thus, it should not be surprising that effective mechanisms to implement what some consider an unsound policy elude the community. 2.2 SECURITY TERMINOLOGY This section discusses several security-relevant terms and concepts. We distinguish between two ways in which information can be revealed: · Direct Disclosure. Classified information is directly revealed to unauthorized individuals. · Inference. Classified information may be deduced by unauthorized individuals with some level of confidence often less than 100%. The primary distinction, then, between direct disclosure and inference is whether. the information is revealed (direct disclosure) or must be deduced (inference). For example, if a user at LOW can execute a query to a DBMS and receive a response where the data displayed to the LOW user is actually classified HIGH (i.e., based on an explicit classification rule), then this is a direct disclosure problem not an inference problem. In most instances of inference, the adversary has some uncertainty with respect to the accuracy of the inference of the unauthorized information. For the purposes of this document, we do not distinguish between where the HIGH data that is directly disclosed or inferred is located (i.e., in the database or not). The important consideration is that HIGH data was directly disclosure or inferred. This consideration is independent of whether the HIGH data is actually stored in the database or it exists somewhere else in the application domain of discourse. Data association policies are stated to prevent direct disclosure of classified information related to an association between data elements. In most cases, cardinal aggregation policies are stated to prevent inferences that could be made from the aggregate data. The discussion above refers generically to "unauthorized" individuals. In fact, we must distinguish between two types of unauthorized access to information: · based on authorized access to classified information (i.e., a label-based policy); and · based on informal "need-to-know" policies. "Need-to-know" is a well-known concept and is defined in the DoD Information Security Regulation [5100.1-R 86] as: · "A determination made by a possessor of classified information that a prospective recipient, in the interest of national security, has a requirement for access to, or knowledge, or possession of the classified information in order to accomplish lawful and authorized government purposes." There are two distinct types of need-to-know: "informal" (sometimes called discretionary), and "formal" need-to-know. With informal need-to-know, individuals may have the clearance (i.e., are considered trustworthy to properly protect the information) but not the job-related need to actually access the information. Informal need-to-know is enforced in the people and paper world by the individual who physically controls access to a classified document. In an AIS, informal need-to- know is normally enforced by discretionary access control (DAC) mechanisms as described in the TCSEC. Formal need-to-know is enforced through compartments (or other release markings) to which individuals are formally granted access. Mandatory access control (MAC) categories are used in an AIS to enforce access based on formal need-to-know. Direct disclosure is normally only thought of as being a problem when classified information is revealed to an individual without the appropriate clearance, including formal access to compartments (i.e., formal need-to-know). Given that U.S. Government information is classified based on laws or executive orders, direct disclosure of classified information is punishable by imprisonment. On the other hand, disclosing information in violation of an informal need-to-know policy is normally not considered a compromise of classified information, since the individual has the appropriate clearance to see the information and is trusted to properly protect it. Thus, a violation of an informal need-to-know policy may result in an employee being fired or sued, but not imprisoned. Although an authorized user can attempt to infer information for which the user does not have an informal need-to-know, research efforts to date in inference and aggregation have dealt exclusively with direct disclosure or inference based on compromise of classified information, not on violations of informal need-to-know policies. As such, this document focuses almost exclusively on protecting classified information against inference and aggregation threats from individuals who are not cleared to receive this information. 2.3 THE TRADITIONAL SECURITY ANALYSIS PARADIGM To better understand inference and aggregation and how they relate to each other, they need to be discussed in the context of the traditional security analysis paradigm shown in Figure 2.1. A brief description of the activities in the paradigm follows: · Identify information protection requirements The protection requirements traditionally addressed by the security community relate to confidentiality. However, information protection requirements for integrity and availability should also be specified. Although, the inference and aggregation problems strictly address confidentiality protection requirements, solutions to these problems typically adversely impact the ability to meet the integrity and availability requirements. · Identify vulnerabilities Given a set of protection requirements, one must identify the vulnerabilities that exist which could be exploited to prevent those requirements from being satisfied by the system. As the next section shows, vulnerabilities need to be assessed continuously throughout the design and engineering process. · Identify threats Given identified vulnerabilities, one must identify credible threat agents that can exploit those vulnerabilities. For inference and aggregation, the primary threat is the insider; an authorized user of the system who is authorized access to some, but not all, information in the system. Malicious code is a secondary threat. · Conduct a risk analysis The risk analysis must take into account the likelihood of a threat agent actually exploiting a vulnerability. The risk analysis considers tradeoffs between effectiveness, cost, and the ability to meet other requirements (e.g., integrity and availability) to determine appropriate controls. · Select controls Based on the risk analysis, controls are selected to eliminate or reduce a threat agent's ability to exploit the vulnerability. Note that some controls are selected to prevent while others are selected to reduce or detect a threat agent's ability to exploit the vulnerability. This distinction is important in that most cases dealing with inference involve determining during the risk analysis the likelihood of an individual inferring unauthorized information given a particular set of controls. The goal is to establish controls that provide an acceptable level of risk at an affordable cost, where cost is cast in terms of both money and adverse impact on functionality, ease of use, and performance. Figure 2.1: Traditional Security Analysis Paradigm Details of how this process is related to inference and aggregation are discussed in the next section. 2.4 THE DATABASE SECURITY ENGINEERING PROCESS The process of implementing an AIS with a database that meets the specified security requirements is shown in Figure 2.2. The process should begin with a clear understanding of what information needs to be protected, in this case from unauthorized disclosure. If one does not have a clear understanding of what has to be protected, then it makes secure database design difficult. Unfortunately, many inference and aggregation problems are stated at the database design level (e.g., with sensitivity labels on tables) without a description of what information is to be protected from disclosure. Given a precise statement of the information to protect, there are two distinct aspects of establishing rules for implementing those confidentiality protection requirements: · classification rules must be established that specify the confidentiality properties of the information, and · access control rules must be established that specify which authorized users can access information based on the classification rules. Figure 2.2: Database Security Engineering Process Classification rules should address all information that needs to be protected, not just information that must be protected for national security reasons (i.e., Confidential, Secret, and Top Secret) based on Executive Order 12356. For example, the Computer Security Act of 1987 requires unclassified but sensitive information have confidentiality protection. Categories of unclassified but sensitive information include, but are not limited to, personnel (i.e., subject to the Privacy Act of 1984), financial, acquisition contracting, and investigations. Certain other information is classified under the Atomic Energy Act of 1954 as Restricted Data or Formerly Restricted Data. Confidentiality protection requirements are also applicable to non-government organizations. For example, proprietary information of many types may be considered "bet your company" data and needs to be protected to ensure the viability of the company. Given a set of classification rules, the database designer must provide a database design which labels data objects and correctly implements the classification rules. The database instantiation represents the database as it is populated with data and used. Access control rules are generically described as MAC for access based on the military classification system (including formal need-to-know) and DAC for access based on informal need-to-know. However, to effectively implement the protection requirements, one must explicitly identify the application-dependent information which identifies which authorized users, or classes of users (e.g., roles), are authorized access to each type of information identified in the classification rules. The access control rules must be incorporated into the application system design. Although the access control rules can be implemented through mechanisms in the operating system or application software, inference and aggregation concerns are primarily addressed by DBMS access controls. Classification rules are normally found in a Classification Guideline as required by the DoD Information Security Program Regulation [DoD 5100.1-R]. Generic access control rules are documented in the system security policy as required by the TCSEC. The instantiation of MAC and DAC with application-dependent access control rules that identify specific users and data is usually documented in a Security Concept of Operations. The classification rules for an application domain need to be complete. That is, all information to be contained in the support information system should be explicitly addressed. 2.5 THE REQUIREMENTS FOR PROTECTING AGAINST INFERENCE AND AGGREGATION In this section, we identify both operational and TCSEC requirements related to protecting against inference and aggregation. Operational requirements, identified in Section 2.5.1, include more than just the need to protect against inference and aggregation but also operational considerations, or tradeoffs, that impact decisions on how, or to what extent, inference and aggregation protections are implemented. Section 2.5.2 discusses the lack of direct TCSEC requirements for protecting against inference and aggregation. 2.5.1 The Operational Requirement There are at least four operational requirements related to inference and aggregation protection. 1. The enterprise's information must be properly protected from disclosure either directly or through inference. 2. Employees (i.e., users) must have access to the information they need to do their jobs. 3. The enterprise must conduct its operations in an efficient and cost-effective manner. 4. The enterprise must interact efficiently with external organizations, which may involve mostly unclassified information. Unfortunately, these operational requirements often are in conflict. Over-classifying data to meet the first requirement may reduce availability of information to certain employees (conflicts with second requirement). Cost-effective operations (the third requirement) may demand granting the lowest clearances possible to employees, which may conflict with the established classification rules to protect information (first requirement). Finally, the need to protect information (first requirement) may conflict with the efficient interface with other organizations (fourth requirement). This last conflict is the genesis of most cardinal aggregation policies. 2.5.2 The TCSEC Requirement There are no requirements in the TCSEC, TDI, or Trusted Network Interpretation (TNI) [TNI 87] that directly address inference and aggregation issues. Inference and aggregation are properties intrinsic to data values and to the structure of a database (schema, metadata). Inference and aggregation are based on an interpretation of the enterprise's security policy and on specific requirements of the application domain being automated. Thus, inference and aggregation should best be addressed by the system developer, acquisition agency, and certification and accreditation authorities for each system and should not be looked at during a TCSEC /TDI evaluation. Since inference is often characterized as an inference channel, there is often some confusion as to the relationship between inference and covert channels. Inference and aggregation are possible because of the existence of a set of specific known data values and a set of logical implication rules. Knowledgeable database administrators or information security officers may be able to apply inferential engines to the analysis of a populated database to uncover a set of potential unauthorized inferences which they can then protect. However, new inferences may be introduced through error on the part of users authorized to update the database. It is important to note that the introduction of inference or aggregation is a data value- based problem (including metadata that represents the database design), and not one of DBMS (i.e., the product) design or implementation. Unlike Trojan horses or covert channels, values needed to support an inferential attack are introduced consciously by benign actions of users, and not clandestinely by malicious implanted code or by flaws in a DBMS trusted computing base (TCB) design or implementation. In this sense, inference problems are different from covert channel issues since all the data used to make the inference are at LOW, and there is no need for an active agent at HIGH to signal information from HIGH to LOW [Jajodia 92]. Note also that inference will still produce the same result, independent of whether the compromised (inferred) data remains in (or ever existed in) the database at HIGH, as opposed to the need for HIGH data to be present in order for it to be signaled to LOW by a covert channel. SECTION 3 AN INFERENCE AND AGGREGATION FRAMEWORK The purpose of this section is to present a framework to reason about inference and aggregation. Section 3.1 begins with a description of a detailed example application domain that will be used to illustrate parts of the framework. Section 3.2 describes the framework using instances from the detailed example for illustration. Section 3.3 presents a matrix which summarizes how the research efforts described in Sections 4 and 5 relate to the framework. 3.1 DETAILED EXAMPLE This section introduces a detailed example used to illustrate the concepts presented in this document. We use a mythical company, the Acme Technology Company or ATC. ATC contracts with many government agencies and commercial companies to provide technology-oriented products and services. ATC also conducts extensive applied research for its clients in several areas of technology. ATC needs extensive support for its business operations by AISs that must represent information about: · entities of interest in the application domain, and · functional or business processes used in the application domain. Although there are other security considerations that should be addressed in designing functional processes, for the purpose of this document, we will only address the process view as it relates to access control. We are most interested in understanding and modeling the information about entities. Tn particular, we are concerned with identifying the entities, associations, and attributes of interest to ATC. Entities (real objects or concepts of interest): · Employees ATC employs a wide variety of personnel from clerks to very specialized engineers. · Education Data about the education of each ATC employee is retained. · Research Projects Although ATC has thousands of projects that cover a wide variety of purposes, research projects represent a special category of projects, especially from a security standpoint. · Critical Technologies Certain technologies are particularly sensitive. · Clients Information about the organization that sponsors a project. Associations (relationships between instances of the entities). Figure 3.1 shows the following associations (in italics) between entities: · Employees work-on Projects. · Employees have Education. · A Client sponsors Projects. · Some Projects use Critical Technologies. Figure 3.1: Entities and Associations Figure 3.2.: ATC Attributes Attributes (characteristics or properties of an entity). The addition of attributes adds more detail and, therefore, realistic complexity to the example. Figure 3.2 shows selected attributes for each ATC entity. 3.2 THE FRAMEWORK The framework, shown in Figure 3.3, is an extension to the database security engineering process described in Section 2.4. The framework provides the context to discuss how inference and aggregation are applicable at each step in the development process. Given that inference and aggregation are inherently application-dependent, examples, based on the ATC example in the previous section, are given to illustrate the concepts. This framework with examples provides a basis to illustrate and compare the proposed approaches described in Sections 4 and 5. 3.2.1 Information Protection Requirements and Resulting Vulnerabilities There are at least three types of information protection requirements: confidentiality, integrity, and availability. Inference and aggregation relate exclusively to the confidentiality objective of security. Therefore, although integrity and availability protection requirements are also important, this document will only address confidentiality requirements. Given a set of information confidentiality protection requirements, there are two vulnerabilities that must be addressed by appropriate controls: direct disclosure and inference. Both of these vulnerabilities were discussed in Section 2.2. Aggregation is not yet a problem, because the protection requirements address specific facts or information that must be protected as opposed to a protection requirement for an aggregate of data. 3.2.2 Classification Rules Classification rules for information are stated such that direct disclosure will be prevented, and the ability to infer unauthorized information is reduced to an acceptable level. Preventing direct disclosure is fairly straightforward: state classification rules for entities, attributes, and/or associations and then restrict access based on user clearance and data classification. Unfortunately, in practice, protection requirements can be so complex that ensuring that a set of classification rules are correct (i.e., consistent and complete) can be difficult. Each of the following types of classification rules from Figure 3.3 is described in the following paragraphs: · entity; · attribute; · data association; and · cardinal aggregation. Figure 3.3: Inference and Aggregation Framework Table 3.1 summarizes this section by showing a taxonomy of types of confidentiality protection requirements along with examples of protection requirements and supporting classification rules for the ATC example. The ATC classification rules from Table 3.1 are used to illustrate the types of classification rules described in this section. The types of protection requirements are a synthesis of those found in papers by Smith, Pernul, and Burns, [Smith 90a; Pernul 93; Burns 92] and are meant to be illustrative of the primary types of protection requirements but not necessarily an exhaustive taxonomy. Type Protection Example Protection Requirement Example Classification Rules Entity Classification 1a existence of all instances of an entity · the fact that ATC is involved with Critical Technologies must be protected at LOW · the existence of all instances of Critical Technologies entity are classified LOW 1b existence of selected instances of an entity · the fact that a Research Project involves Critical Technologies must be protected at HIGH · the existence of Research Projects that use Critical Technologies are classified LOW · the existence of instances of all other entities is UNCLASSIFIED 2a identity of all instances of an entity · the identity of Critical Technologies must be protected at HIGH · the identity of Clients of Research Projects must be protected at LOW (this is a requirement of the client organization) · the value of all instances (i.e., all attributes) of the Critical Technologies entity is classified HIGH · the value of all instances (i.e., all attributes) of the Client entity is classified LOW 2b identity of selected instances of an entity · selected Research Projects will be protected at LOW based on the contents of Project Description as determined by the classification authority · all attributes of the Research Project entity are classified LOW when the Description attribute contains LOW information · the identity of instances of all other entities is UNCLASSIFED Attribute Classification 3 existence of an attribute (could also be of selected instances of an attribute) · the fact that specific Target Platforms are identified for Research Projects is to be protected at HIGH · the existence of the Target Platform attribute of Research Project is classified HIGH · all instances of the Target Platform attribute are classified HIGH · the existence of instances of all other attributes is UNCLASSIFIED except privacy information 4 value of an attribute · privacy information for employees (i.e., home address and phone number) is to be protected at LOW · the existence of the privacy information attributes of employees is classified LOW Data Association 5 the association between two entities or two attributes of different entities · the association between a specific project and the sponsoring client must be protected at HIGH · the association between the Project entity and the Client entity is HIGH · the association between all other entities is UNCLASSIFIED 6 the association between two attributes of the same entry · the association between name and salary is to be protected at LOW · the association between the Salary attribute of the Employee entity and Employee identifying attributes (name, number) is LOW Cardinal Aggregation 7 a collection of instances of the same entry · the identity of Critical Technologies must protected at HIGH · the education entity and all its attributes for all employees are UNCLASSIFIED · the collection of the Degree and Major attributes for all employees is LOW Table 3.1: Taxonomy of Protection Requirements and Classification Rules Entity classification rules can address the need to protect both the identity as well as the existence of instances of entities of interest. In some operational environments, the fact that an entity (or some selected instances of an entity) exists would disclose, directly or through an inference, classified information. In many other operational environments, knowing that something classified exists is not a problem; but you must not be able to know the classified information (i.e., its identity or value). A good example is a compendium of documents including some that are classified. The reader can know that the documents exist without having the clearance to read them. Entity classification rules from the ATC example include: · The existence of all instances of the Critical Technologies entity are classified LOW. · The values of all instances (i.e., all attributes) of the Critical Technologies entity are classified HIGH. · All attributes of the Research Project entity are classified LOW when the Description attribute contains LOW information. Attribute classification rules are similar to entity classification, except that they address specific attributes of an entity. Attribute classification rules from the ATC example include: · The existence of the Target Platform attribute of Research Project is classified HIGH. · Instances of the Employee entity are UNCLASSIFIED, but the value (not existence) of two attributes, Home Address and Home Phone#, are classified LOW. Data Association classification rules are stated when a relationship between entities or attributes is considered more sensitive than information about the entities or attributes. This type of classification rule implements inter- and intra-entity aggregation as discussed in Section 2. The ATC example includes these data association rules: · The association between a specific project (classified LOW) and the sponsoring client (classified LOW) must be protected at HIGH. · The association between an employee and the employee's salary is to be protected at LOW. These three types of classification rules (entity, attribute, and data association), along with appropriate access control rules (e.g., the military classification system) will define the basis for preventing direct disclosure. Direct disclosure can be prevented if the classification rules correctly implement the protection requirements (i.e., are consistent and complete) and if the AIS correctly implements the rules. On the other hand, eliminating inferences is difficult, if not impossible, in that an individual's capability to infer information is based on their already acquired knowledge and the innovative ways in which information can be combined [Morgenstern 88]. Much of the knowledge that may be used to infer unauthorized information is normally outside of the system in the form of domain- specific knowledge or even common sense. At one end of the spectrum, one can ensure that no unauthorized inferences are possible by classifying all information at HIGH and then giving all users a HIGH clearance. Although this approach solves the inference problem (since every user is authorized to access all information in the system), we now have introduced new vulnerabilities (e.g., users have clearances they should not have, and therefore access to information they should not see) and potentially decreased the operational efficiency of the organization (e.g., having to manually downgrade a large percentage of the information produced by the system). Establishing classification rules that will reduce unwanted inferences involves a constant trade-off between the desire to raise the classification of data to reduce inferences and the desire to maintain or even lower the classification of data to improve operational efficiency without directly disclosing classified data to those not cleared to see it. This process of evaluating the relative advantages of over-classifying information to prevent inferences is part of the risk analysis activity identified in Section 2.3. At this point, a fourth type of classification rule, cardinal aggregation, is often added. Cardinal aggregation rules address the classification of collections of instances of the same entity. Often, these rules are established to reduce inference. At this point in the framework we can conclude the high-level relationships between inference and aggregation are: · inference is a vulnerability, e.g., authorized users can deduce information not authorized by the protection requirements, and · aggregation is a class of classification policies that are stated as one form of control, in many cases, to reduce the inference vulnerability. The ATC example includes a cardinal aggregation classification rule: · the education entity and each of its attributes for all employees are individually UNCLASSIFIED, but · the collection of the Degree and Major attributes of all instances of the education entity is HIGH. The ATC cardinal aggregation classification rule illustrates the difficulty of enforcing the intent of the rule, even though we have specified the information to be protected (i.e., the identity of Critical Technologies must be protected at HIGH). The difficulty is to determine how many employees' educational information has to be aggregated to allow an inference. Is it all employees on a particular project? Ten employees? Ten percent of the employees? All employees with blue eyes? To have an effective implementation of a cardinal aggregation classification rule, someone has to state explicit criteria for which collections must be considered HIGH. Unfortunately, cardinal aggregation classification rules introduce a new vulnerability, as discussed in the next section. At this point it is appropriate to distinguish between an aggregation policy, such as those discussed above, and a Chinese Wall policy. Brewer and Nash introduced an important commercial security policy called the Chinese Wall policy [Brewer 89]. They described the policy as follows: It can most easily be visualized as the code of practice that must be followed by a market analyst working for a financial institution providing corporate business services. Such an analyst must uphold the confidentiality of information provided to him by his firm's clients; this means that he cannot advise corporations where he has insider knowledge of the plans, status, or standing of a competitor. However, the analyst is free to advise corporations which are not in competition with each other, and also draw on general market information. This type of policy seems to involve aggregates of information and therefore is sometimes considered an "aggregation problem." In fact, the Chinese Wall policy does not meet the definition of an aggregation problem; there is no notion of some information being sensitive with the aggregate being more sensitive. The Chinese Wall policy is an access control policy where the access control rule is not based just on the sensitivity of the information, but is based on the information already accessed. The situation can be considered similar to having compartmented data and subjects which can be cleared for exactly one compartment. The choice of which compartment the subject is cleared for is effectively made when the first compartmented data is viewed. 3.2.3 Vulnerabilities from Classification Rules Figure 3.3 identifies three residual vulnerabilities given a set of classification rules designed to prevent direct disclosure and reduce inference (i.e., the vulnerabilities of information confidentiality protection requirements): · Flawed Rules that allow Direct Disclosure With complex operational environments, the classification rules can be flawed with respect to how faithfully they implement the information confidentiality protection requirements. They can be inconsistent, incomplete, or just incorrect such that they allow direct disclosure. · Flawed Rules that allow an Unintended Inference Although the entire set of classification rules are intended to prevent unwanted inferences, when faced with a complex set of classification rules there may be opportunities for unintended inferences. Again, this may be caused by rules that are inconsistent, incomplete, or incorrect. · Cardinal Aggregation If a cardinal aggregation rule has been stated that classifies an aggregate as LOW but individual elements are UNCLASSIFIED, then a vulnerability has been added since UNCLASSIFIED users may be able to assemble aggregates in an unintended or undetectable way. 3.2.4 Database Design Database designers and application developers attempt to implement classification rules by accurately representing and attaching security labels to database objects. The Relational Model [Codd 70], one of several data models, is the model of choice for existing commercial MLS products. Figure 3.4 shows a relational model representation of the ATC example using tuple-level labeling. The following are examples of how each type of classification rule can be implemented in a database design: · The rule, "the existence of all instances of the Critical Technologies entity is classified LOW," is implemented by classifying the Critical Technologies table metadata at LOW. This is done by the database designer. · The rule, "the value of all instances of the Critical Technologies entity is classified HIGH," is implemented by labeling each tuple in the Critical Technologies at HIGH. This is enforced by the user entering the data. · The rule, "all attributes of the Research Project entity are classified LOW when the Description attribute contains LOW information," is implemented by having tuples labeled UNCLASSIFIED or LOW based on the content of the Description attribute. This is enforced by the user entering the data. · The rule, "the existence of Research Projects that use Critical Technologies are classified HIGH," is implemented by classifying each Research Project tuple at HIGH when the project uses a Critical Technology. This is enforced by the user entering the data. · The rule, "the existence of the Target Platform attribute of Research Project is classified HIGH," is implemented by decomposing the Target Platform attribute (which would normally be an attribute in the Research Project table) into a separate table, Project-Target. All tuples in Project-Target are classified HIGH. Note that the metadata can also be classified HIGH to prevent obtaining the knowledge that projects have a target platform that could contribute to unauthorized inferences. · The rule, "all instances of the Target Platform attribute are classified HIGH," from an implementation standpoint, is redundant with the previous rule. Thus, this rule requires no changes to the database design. · The rules for data association between the Project and Client entities are, "the value of all instances (i.e., all attributes) of the Client entity is classified LOW," "the value of all instances of the Project entity (except those that use Critical Technologies) are classified LOW," and "the association between the Project entity and the Client entity is HIGH." These rules are implemented by labeling tuples (and metadata) in the Client table as LOW, labeling tuples in the Project table as LOW (except those that are labeled HIGH because they use Critical Technologies), and constructing an additional table, Client-Project, to represent the data association between Clients and Projects. The Client-Project table (metadata and tuples) is classified HIGH. Note that in the absence of a data association rule, the association between Project and Client tables would be represented by a foreign key from the Project table to the Client table. Figure 3.4: ATC Example Database Design · The data association rule, "the association between Salary attribute of the Employee entity and Employee identifying attributes (name and number) is LOW," is implemented by decomposing the Salary attribute into a separate table, Employee-Salary. All tuples in the Employee-Salary table are labeled LOW. · The cardinal aggregation rule established by the rules, "the education entity and all its attributes for all employees are UNCLASSIFIED" and "the collection of the Degree and Major attributes of the entity for all employees is LOW," is implemented by labeling all tuples in the Employee table as UNCLASSIFIED and requiring some (unspecified) mechanism to control access to multiple records. 3.2.5 Vulnerabilities from Database Design Given a database design that attempts to faithfully represent the established classification rules, four vulnerabilities are identified in Table 3.3: · Faithful Representation of Flawed Rules A database design that faithfully implements a flawed set of classification rules will not fix the flawed rules. · Flawed Design Allowing Direct Disclosure A flawed design may allow a direct disclosure at the schema level for uniformly classified data objects or at the content level for data objects where instances may be classified at one of several levels. · Flawed Design Allowing Unintended Inference The flawed design may allow unintended inferences at the schema level for uniformly classified data objects, or at the content level for data objects where instances may be classified at one of several levels. · Cardinal Aggregation Labeled LOW If the data is labeled LOW, then a vulnerability exists, since LOW users may be able to access aggregates in an unintended way. These vulnerabilities are related to the values, application semantics, and the security labels assigned to data objects, not to the functionality (beyond enforcing a MAC policy) or assurance provided by the TCB. The TCSEC class of a TCB that enforces MAC has no impact on countering these vulnerabilities. An Al TCB is just as effective (or ineffective) as a B1 TCB in countering these vulnerabilities. 3.2.6 Database Instantiation and Vulnerabilities For completeness, Table 3.3 shows the database instantiation as the final activity. Databases are populated with data, and users use the database to conduct their job tasks. The database administrator or security officer is tasked with ensuring the system is operating in a secure manner. Assuming clearances are properly assigned to users, two vulnerabilities are identified: · Inappropriate Labeling Allowing Disclosure Current MLS systems, including DBMSs, depend on the user being logged in at the access class that correctly represents the classification of the data being added to the database. If a user with a HIGH clearance enters HIGH data when the user's current access class is LOW, then HIGH data has been directly disclosed, since users with only a LOW clearance can access the HIGH data that is incorrectly marked LOW. · Inappropriate Labeling Allowing Unintended Inference In a similar manner, a user with a HIGH clearance may enter data at LOW that will allow other users with only a LOW clearance to infer HIGH data. The TCSEC class of a TCB that enforces MAC also has no impact on countering these vulnerabilities. 3.3 OVERVIEW OF RESEARCH This section provides a brief overview of the research that will be discussed in Sections 4 and 5. Each of the papers and research efforts on inference and aggregation address one or more parts of the framework. Table 3.2 lists formal approaches to defining the problems. Table 3.3 maps inference and aggregation research efforts against the framework in three categories (techniques, mechanisms, and analysis tools). As Table 3.3 shows, much of the research has been in developing techniques and designing mechanisms. There has also been limited work in developing tools to help address inference and aggregation. The remainder of the table shows which vulnerabilities each of the analysis tools and mechanisms attempt to address. Inference Aggregation Database Partitioning (Denning) Set Theory (Goguen and Meseguer) INFER Function (Denning and Morgenstern) FD and MVD (Su and Ozsoyoglu) Sphere of Influence (Morgenstern) Fact Model (Cordingley and Sowerbutts) Inference Secure (Lin) Inference Secure (Marks) Imprecise Inference Control (Hale) Security Algebra (Lin) Modal Logic Framework (Cuppens) Table 3.2: Formal Definitions Techniques Mechanisms Analysis Tools Classification Rules Entity and Attribute Direct Association Database Design (Denning, Lunt) Cardinal Aggregation Classification Rule Vulnerabilities Direct Disclosure Unintended inference Cardinal Aggregation Rules Faithfully implements flawed classification rules Table 3.3: Research Summary (Part 1) Techniques Mechanisms Analysis Tools Data Design Vulnerabilities Direct Disclosure at schema level Secondary Path Analysis (Hinke, Binns) Direct disclosure at content level Unintended inference at schema level Basic Security Principle (Meadows) Semantic Data Models (Buczkowski, Hinke, Smith) Classification Constraints (Akl) Constraint Satisfaction (Morgenstern) Integrity Constraints (Denning) Secondary Path Analysis (Hinke, Binns) DISSECT (Garvey et al.) AERIE (Hinke and Delugach) Constraint Processor (Thuraisingham et al.) Hyper Semantic Model (Marks et al.) Unintended inference at content level Cardinal Aggregation-LOW Data Separae Structures (Lunt) Use of Views (Wilson) Restrict Ad hoc Queries Use of Audit Some or All Data HIGH Aggregation Constraints (Haigh and Stachour) Aggregation Detection (Motro et al) Database Instantiation Vulnerabilities Labeling allows disclosure Constraint Processor (Thuraisingham et al.) DB Inference Controller (Buczkowski) Secondary Path Monitoring (Binns) Polyinstantiation (Denning) Auditing Facility (Haigh and Stachour) Snapshot Facility (Jajodia) Labeling allows unintended inference Table 3.3: Research Summary (Part 2) SECTION 4 APPROACHES FOR INFERENCE CONTROL Providing a solution to the inference problem is beyond the capability of currently available MLS database management systems. It is also recognized that a general solution to the inference problem is not possible given the inability to account for external knowledge and human reasoning [Binns 94a]. To provide partial solutions, the inference problem must be bounded. Section 4.1 identifies several efforts by researchers to formally define the inference problem. These approaches are summarized to provide different perspectives on the problem. Implementing theoretical inference solutions has proven to be difficult. Section 4.2 describes database design techniques that could be used to prevent inference problems from occurring. Mechanisms for inference prevention and detection during run-time are described in Section 4.3. Finally, Section 4.4 describes several tools that assist the database designer to detect and avoid inference channels. 4.1 FORMALIZATIONS Researchers have proposed numerous ways to characterize or define the inference problem. This section presents several formal definitions from the ongoing research into the discovery of fundamental laws that determine whether the potential for undesirable inferences exist. 4.1.1 Database Partitioning Database partitioning is used by Denning to define the inference problem [Denning 86a]. For each user *, the data in the database can be partitioned into two sets: a VISIBLE set and an INVISIBLE set, as in Figure 4.1. The user * is allowed to access only elements from the VISIBLE set, and in is not allowed to know what elements are in the INVISIBLE set. (Note that if VISIBLE is null, the user has no access to database information and, therefore, the problem is outside the domain of the DBMS. In what follows, we assume that VISIBLE is not null.) Let KNOWN denote the set of data elements that is known to *. The set KNOWN is constructed by the user either as a result of the previous queries or as a result of some external knowledge that in possesses. There is no inference problem if the intersection of the two sets INVISIBLE and KNOWN is empty, as in Figure 4.2. There is an inference problem if the two sets INVISIBLE and KNOWN intersect, as in Figure 4.3. Figure 4.1: Partitioning of the Database for a User U Figure 4.2: Database without an Inference Problem Figure 4.3: Database with an Inference Problem 4.1.2 Set Theory Another formulation of the inference problem has been given by Goguen and Meseguer [Goguen 84]. Consider a database in which each data item is given an access class, and suppose that the set of access classes is partially ordered. Define the relation > as follows: Given data items x and y, relation x > y is said to hold if it is possible to infer y from x. The relation > is reflexive and transitive. A set S is said to be inferentially closed if whenever x is in S and x > y holds, then y belongs to S as well. Now, for an access class L, let E(L) denote the set consisting of all possible responses that are classified at access classes dominated by L. There is an inference problem if E(L) is not inferentially closed. Goguen and Meseguer do not set forth any one candidate for the relation >; they merely require that it be reflexive and transitive, and say that it will probably be generated according to some set of rules of inference (for example, first order logic, statistical inference, monotonic logic, and knowledge-based inference). They do note, however, that for most inference systems of interest, determining that A > b (where A is a set of facts and b is a fact) is at best semidecidable; that is, there is an algorithm that will give the answer in a finite amount of time if A > b, but otherwise may never halt. 4.1.3 Classical Information Theory Yet another definition that uses classical information theory has been given by Denning and Morgenstern [Denning 86a]. Given two data items x and y, let H(y) denote the uncertainty of y, and let Hx (y) denote the uncertainty of y given x (where uncertainty is defined in the usual information- theoretic way). Then, the reduction in uncertainty of y given x is defined as follows: The value of INFER (x > y) is between 0 and 1. If the value is 0, then it is impossible to infer any information about y from x. If the value is between 0 and 1, then y becomes somewhat more likely given x. If the value is 1, then y can be inferred given x. This formulation is especially useful since it shows that inference is not an absolute problem, but a relative one. It provides a way to quantify the bandwidth of the illegal information flow. On the other hand, Denning and Morgenstern point out its serious drawbacks [Denning 86a]: 1. It is difficult, if not impossible, to determine the value of Hx (y). 2. It does not take into account the computational complexity that is required to draw the inference. To illustrate the second point, they give the following example from cryptography: With few exceptions, the original text can be inferred from the encrypted text by trying all possible keys (so Hx (y) = 1 in this case); however, it is hardly practical to do so. 4.1.4 Functional and Multivalued Dependencies Su and Ozsoyoglu have studied inference problems that arise because of the functional and multivalued dependencies that are constraints over the attributes of a relation [Su 86,87,90]. Functional dependencies are defined below; the reader is referred to Date for a definition of multivalued dependencies [Date 86]. Let R be a relation scheme defined over attributes *, and let X and Y be subsets of in. A functional dependency X > Y is said to hold in R if for any relation r, the current instance for R, r does not contain two tuples with the same values for X, but different values for Y; that is, given any pair of tuples t and s of r, t[X] = t[X] implies t[Y] = s[Y]. Su and Ozsoyoglu illustrate how inference problems can arise if certain functional dependencies are known to the low-level users. If attributes are assigned security labels in a manner that is consistent with the functional dependencies, then these inference threats can be eliminated. This process is formalized by Su and Ozsoyoglu. 4.1.5 Sphere of Influence Morgenstern expands upon the INFER function (described in Section 4.1.3) to propose a theoretical foundation for inference [Morgenstern 87,88]. He proposes a framework for the analysis of logical inferences and for the determination of logical inference channels. Morgenstern introduces the concept of a sphere of influence (SOI) and inference channel. The SOI delimits the scope of possible inferences given some base data, called the core that may consist of entities, attributes, relationships, and constraints. Specifically, the sphere of influence relative to some core, i.e., SOI (core), is the set of all information that can be inferred from the core. The SOI is defined in terms of the INFER function. The SOI models the process by which a user's knowledge of an application can give rise to inferences about additional information. The SOI utilizes a forward chaining inference process from the given core to determine the scope of such inferable data. Morgenstern's concept of inference channel serves to isolate the lower level data which could give rise to inferences about higher level data H. The computation of an inference channel is a backward-chaining inference process from some resultant data H to determine all information which contributes to upward inferences about H. An inference channel exists if information about some set of data H may be inferred from data in another set C which is at a lower level than H, or is incomparable relative to H in the classification lattice. 4.1.6 The Fact Model Sowerbutts and Cordingley provide a formal definition of inference based on an abstract data model called the Fact Model [Cordingley 89, 90; Sowerbutts 90]. They identify two aspects of the inference problem: static inference and dynamic inference. Static inference is knowledge of the database together with authorized facts that can be used to determine unauthorized facts. Dynamic inference is where database operations can be used together with knowledge of the structure of the database to determine unauthorized facts. The Fact Model addresses static inference. The Fact Model is a model of knowledge described as a set of facts and a set of constraints. The authors provide a language for specifying application-dependent facts and constraints. The Fact Model provides a formal statement of the standard constraints, such as functional and multivalued dependencies, and defines explicitly the inference threat associated with each [Cordingley 89]. The relationship between facts and subfacts is defined by construction rules. An INFER function is defined on facts, constraints, and construction rules. Given an arbitrary subset of facts and constraints, the INFER function identifies all the facts that are implicit in the given subset, and hence can be inferred by a user. At this point the classification attribute can be added for each fact. This approach forms the basis for assigning classifications such that any facts inferable from a given subset of facts whose classifications are dominated by a classification level k also have a classification level dominated by k. The authors call this a Static Inferentially Secure Database. The Fact Model also is extended to represent operations such that a Dynamic Inferentially Secure Database is defined. The Fact Model was developed as an abstraction of the relational model but can be used to describe any conventional database structure such as network or hierarchical. 4.1.7 Inference Secure Lin introduces the concept of navigational inference as the process of accessing HIGH data via navigating through legally accessible LOW data [Lin 92]. Lin states that inference problems exist if the security classifications of data are inconsistent with some database structures. He identifies three such inference problems: · Logical Inference Problems The security classification of a theorem (a derivable formula) in a formal theory should be consistent with its proof. If not, then logical inference problems arise. · Algebraic Inference Problems The security classification of "algebraically derivable data" should be consistent with its relational algebraic structure. If not, algebraic inference problems arise. · Navigational Inference Problems The security classification of data should be consistent with its navigational operators (which generate navigational paths). If not, navigational inference problems arise. Lin defines a multilevel data model as inference secure if: (1) it is a Bell-LaPadula Model (secure under MAC), and (2) it is navigational inference free. Lin provides further descriptions and theorems to support the definitions [Lin 92]. 4.1.8 An Inference Paradigm Marks defines database inference as [Marks 94b]: Inference in a database is said to occur if, by retrieving a set of tuples {T) having attributes {A} from the database, it is possible to specify a set of tuples {{T'), having attributes {A'), where {T') is not a subset of {T) or {A') is not equal to {A). The definition may be stated as an inference rule: IF ({T), {A}) THEN ({T'}), {A'}), which may be denoted as ({T}, {A}) > ({T'}, {A'}). However, it may be possible to retrieve a set of tuples {T} and the associated attributes {A} from a database and to reason, using data and attributes outside the database, to arrive at a set of tuples {T'} and attributes {A'} that are again within the database. That is, it is possible to form a chain of reasoning, ({T}, {A}) > ({T1}, {A1}) > ({T2}, {A2}) >... > ({T'}, {A'}) where some of the tuples and/or attributes are outside the database. When some of the data or attributes are outside of the database, the chain of reasoning cannot be followed by the database system. Fortunately it is not necessary to actually follow such a chain of reasoning in order to control the inference threat. If the database contains the endpoints of the chain (({T}, {A}) and ({T'}, {A'})) there will exist what is referred to in logic systems as a material implication relating the two sets. If a material implication exists such that ({T}, {A}) > ({T'}, {A'}) where ({T}, {A}) is classified LOW and ({T'}, {A'}) is classified HIGH, then the database can offer no assurance that there does not exist some chain of inference, using outside knowledge, that can connect the two, enabling a LOW user to infer HIGH information. If, however, ({T}, {A}) not > ({T'}, {A'}) within the database, then it can be guaranteed that no chain of inference, using outside knowledge or not, exists which connects the two sets. That is, the absence of a material implication between two sets of data is sufficient to guarantee the absence of any chain of reasoning between the sets of data. It is not necessary for inference control, however, since material implications may be coincidental, and not related to any reasoning process. These arguments may be reduced to: Limitations on Database Inference: ({T}, {A}) > ({T'}, {A'}) is an inference rule capable of being controlled by the database if and only if all the tuples {T} and {T'} are in the database and all the properties {A} and {A'} are attributes in the database. 4.1.9 Imprecise Inference Control Hale, Threet, and Shenoi introduce a powerful, yet practical, formalism for modeling and controlling imprecise functional dependency (FD) based inference in relational database systems [Hale 94]. The existence of an imprecise FD implies that if some tuple components satisfy certain equivalencies, then other tuple components must exist and their values must be equivalent. Imprecise FDs can specify constraints on precise and imprecise data. Examples of imprecise FDs are: "engineers have starting salaries about 40K" and "approximately equal qualifications and more or less equal experience demand similar salary." An imprecise FD is formally defined from which the authors formally define an imprecise inference channel. The non-formal definition of an imprecise inference channel is a chain of imprecise FDs. To control and ultimately eliminate compromising imprecise inferences, it is necessary to specify a set of imprecise inference channels considered to be secure. This compromise specification set is defined by the database administrator. Suspect channels in a database are compared with these secure channels. A potential security compromise exists when a suspect channel allows the inference of information which is not coarser than information inferred by any secure channel. Potentially compromising imprecise inference channels are eliminated by hiding relations attributes or by "clouding" data manifested by imprecise FDs in the compromising channels. 4.2 DATABASE DESIGN TECHNIQUES The problem of deciding how to label multilevel database objects - data, schemata, and constraints - should not be a problem for cleared database designers and users [Millen 91]. They ought to know the classification of whatever they are entering into the database. Unfortunately, many inferences are not simple ones, and the number and complexity of potential inferences can be quite large. Thus, raising the question of how well a person can anticipate such inferences in order to classify the data [Morgenstern 88]. The following subsections examine techniques for avoiding inference problems in the first place, and how to detect inference channels during database design. 4.2.1 Inference Prevention Methods Ideally, if all unauthorized disclosures are to be prevented, the Basic Security Principle [Meadows 88a, 88b] should be followed: Basic Security Principle for Multilevel Databases - The security class of a data item should dominate the security classes of all data affecting it. The reason for the Basic Security Principle is clear: if the value of a data item can be affected by data at levels not dominated by its own level, information can flow into the data item from data at other levels. The task of predicting or detecting all inference problems appears to be very difficult. However, many of these problems can easily be prevented by careful consideration of the data items on which a data item is predicated. In practice, unfortunately, two problems with the Basic Security Principle have been identified: 1. The number of potential inferences can be quite large. 2. All possible inferences cannot be anticipated. The combination of these problems makes the task of appropriately classifying data complex. However, several techniques that have been developed for dealing with inferences are discussed below. If information x permits disclosure of information y, one way to prevent this disclosure is to reclassify all or part of information x such that it is no longer possible to derive y from the disclosed subset of x. There are two approaches to dealing with violations of this type: reclassify either the data or the constraints appropriately. These approaches are discussed next. 4.2.1.1 Appropriate Labeling of Data One of the approaches suggested to handle the inference problem is to design the multilevel database in such a way that the Basic Security Principle is maintained [Binns 92a; Burns 92; Hinke 92; Garvey 93; Smith 90b; Thuraisingham 90c]. Security constraints are processed during multilevel database design and subsequently the schemas are assigned appropriate security levels. Several researchers have proposed that semantic database models be used to detect (and then prevent) some inference problems [Buczkowski 90; Hinke 88; Smith 90b, 91]. Conventional data models (such as hierarchical, network, and relational data models) use overly simple data structures (such as trees, graphs, or tables) to model an application environment. Semantic database models, on the other hand, attempt to capture more of the meaning of the data by providing a richer set of modeling constructs. Since integrity and secrecy constraints can be expressed naturally in semantic database models [Smith 90b], they can be used to detect inference problems during the database design phase. Classification (or secrecy) constraints can also be used to eliminate inference problems [Akl 87]. Classification constraints are rules that are used to assign security levels to data as they are entered into the database. Classification constraints are required to be consistent (meaning that each value is assigned a unique security level) and complete (meaning that rules assign each value a security level). Inconsistent classification constraints indicate potential inference problems, and incomplete classification constraints point to incomplete labeling rules. Two different methods have been proposed for determining the consistency and completeness of the classification constraints. One is based on logic programming [Denning 86b], and the other is based on computational geometry [Akl 87]. Morgenstern views the overall process of classifying a database as a constraint satisfaction problem of the following form [Morgenstern 88]. Each potential inference which involves one or more relations or objects (data) from the database serves as a constraint. It limits the classifications which can be assigned to an object given the classification(s) of the other object(s). In some cases, the constraint may uniquely determine the classification of the remaining data object. Morgenstern defines safety for inference: a data object O, and its assigned classification label, are said to be safe for inference if there is no upward inference possible from O. That is, object O cannot be used to infer information about other data objects at higher or incomparable levels of the classification lattice. A classification level is safe for inference if all the data objects having that classification are safe. The techniques described in this section rely on increasing the classification of one or more elements in an offending path following its detection. Note that in practice, however, it is not always possible to solve inference problems by raising the classification level of data that cause inference violations. Some data may be public knowledge, and therefore, cannot be reclassified. In some cases, certain data must be made public due to legal requirements or national policy. Binns also points out that an upgraded element may open apparently new inference channels [Binns 94b]. 4.2.1.2 Appropriate Labeling Integrity Constraints There are two ways of handling an explicit constraint defined over several data fields. One is to leave the constraint UNCLASSIFIED, but to identify the LOW data that, together with the constraint, will allow users to infer information about the HIGH data, and to reclassify that LOW data upwards. This approach satisfies the Basic Security Principle, but at the cost of reducing data availability to low-level users. The other approach is to classify the constraint itself at the least upper bound of the security classes of the data over which it is defined. This approach satisfies the Basic Security Principle and allows low-level users access to data, but at the possible cost of data correctness; data entered by a low-level user may not satisfy the constraint. This problem can be partially solved by allowing polyinstantiation to take place [Denning 87] and is briefly discussed in Section 4.3.2. When data at a LOW security level is entered, it is not checked against the constraint. However, the LOW data does not replace the HIGH data; both are stored in the database. Such an approach will guarantee correctness from the point of view of the high user. Inference controls mean that complete data correctness is not always possible for LOW users. Since implicit constraints are always available to users whether or not they are allowed access to them in the database, it is useless to attempt to reduce inferences by classifying implicit constraints. Therefore, data over which the constraint are defined must be uniformly classified. Since the data over which the constraints are defined may be overlapping, it is difficult to assign security levels in a consistent manner while limiting the damage to data availability. For example, Su has shown that the problem of classifying data so that data availability is maximized, and the security class of each key in an implicit functional dependency dominates the classes of all data depending on that key, is NP-complete [Su 86] (see also [Su 87,90]). 4.2.2 Inference Detection Methods Inference through secondary path analysis can be characterized as multiple paths between two attributes such that the paths' security levels are inconsistent [Binns 92a]. Given a large enough database, it is not likely that a database designer can be aware of all possible interactions between tables. Alternative, hence, secondary paths may exist which ultimately relate two attributes in a manner unforeseen by the designer. If the classification of the secondary path is different than the primary path, then a classification inconsistency exists. Classification inconsistency is not acceptable regardless of which path is conceived to be "primary". Binns describes an approach that aids the database designer by identifying potential aggregates and inference paths automatically. The use of conceptual structures to handle the inference problem was first proposed by Hinke [Hinke 88]. A semantic network can be regarded as a collection of nodes connected via links. The nodes represent entities, events, and concepts. The links represent relationships between the nodes. Numerous variations of semantic networks have been proposed, including the conceptual graphs of Sowa, and are called conceptual structures [Sowa 84]. Conceptual structures are useful in that they can represent and reason about the real-world like humans. The database security research community became interested in conceptual structures because they can be used to represent a multilevel application [Thuraisingham 91]. Hinke showed how inference may be detected by traversing alternate paths between two nodes in the graph. Graph traversal in semantic nets corresponds to limited inference. The classification of implied links found by graph traversal is determined by the classifications of the traversed links. The technique enables detection of simple inferences via the transitivity property, but does not enable the detection of more complex inferences. Others since have also shown how to represent multilevel database applications using conceptual structures developed for knowledge-based system applications and subsequently reasoning about the application using deduction techniques [Thuraisingham 90f, 91; Delugach 92, 93; Hinke 92; Garvey 92]. Auxiliary semantic nets can be used to express security constraints. Smith suggests extensions to Urban's semantic data model to represent multilevel applications [Smith 90b; Urban 90]. Smith states that eventually the representation could be translated into statements of a logic programming system that could then be used to detect inferences. 4.3 RUN-TIME MECHANISMS The previous section contained many different methods of controlling inferences through careful database design. The methods surveyed in Section 4.2 are, for the most part, application dependent. For example, it was seen that using proper database design, using the designer's knowledge about the data, and classifying data at an appropriate level all help with inference control. These activities occur outside a DBMS and, therefore, are not within its sphere of control. However, certain run-time mechanisms can be provided to help with other approaches. These mechanisms are listed below and discussed in the remainder of the section. 1. Query response modification. Users can build reservoirs of knowledge from responses that they receive by querying the database and it is from this reservoir of knowledge that they infer unauthorized information. Two methods to address this problem are 1) to restrict queries by modifying the query appropriately during the query evaluation phase, and 2) to modify query responses (see Section 4.3.1). 2. Polyinstantiation. Some inference threats arise due to key integrity, which is a basic integrity requirement of the relational model, and these violations can be eliminated if the DBMS allows the use of polyinstantiation (see Section 4.3.2). 3. Audit Facility. Since auditing can be used to detect inferences, a DBMS should provide an adequate audit facility so user activities can be monitored and analyzed for possible inference violations (see Section 4.3.3). 4. Snapshot Facility. There are situations in which it was desirable to release some HIGH information to LOW users for reasons of convenience or necessity (as in the case of U.S. Bureau of the Census described in Section 4.3.4). In these cases, snapshots can be used effectively to eliminate or minimize the harm done by release of HIGH information. Thus, it is desirable that multilevel secure DBMSs provide a snapshot facility as well. 4.3.1 Query Response Modification To block an inference channel during a session, user queries can be restricted or the query results modified. Diminishing the potential of query restriction control is the finding that through certain sequences of seemingly innocuous unrestricted queries called trackers, the answers to restricted queries can always be deduced [Denning 78]. Therefore, response modification, which introduces uncertainty so as to reduce the risk of disclosure, has been the focus of much run-time inference control research. 4.3.1.1 Constraint Processors Some MITRE research focuses on handling inferences during query processing [Keefe 89; Thuraisingham 87, 88, 90a]. The query modification technique is illustrated in Figure 4.4 below [Ford 90]. MITRE has augmented Hewlett Packard's Allbase with a logic-based Constraint Processor and knowledge base in their current prototyping effort [Thuraisingham 94]. Security constraints contained in the knowledge base are utilized by the Constraint Processor to modify queries to prevent security violations through inference. Factors used by the Constraint Processor for query modification and response sanitization include 1) the security constraints, 2) the previously fetched tuples, and 3) external knowledge. Figure 4.4: Query Processor An Update Constraint Processor was developed to accurately label data as it is being loaded into the database during database update or design. This alleviated the need for the Constraint Processor to process all constraint types which could impact query response time. The Update Constraint Processor is designed to operate with the Constraint Processor to accurately label data associated with the simple and context-dependent constraints defined for an application. The prototyping effort has a goal of extending from a centralized to a distributed environment with trusted DBMSs interconnected via a trusted network. MITRE implemented modules to augment the Constraint Processors to process security constraints in a distributed environment during the query and update operations. In addition to the citations above, results of the MITRE-based research on the inference problem1 including previous prototyping efforts, have been presented at various security conferences and documented in MITRE technical reports [Thuraisingham 89a, 89b, 90b-f; Collins 91]. 4.3.1.2 Database Inference Controller The Ford Aerospace proposed Database Inference Controller (DBIC) was a knowledge- based tool or set of tools used by the System Security Officer (SSO) to detect and correct logical inferences [Buczkowski 89a, 90]. The DBIC system configuration is given in Figure 4.5 [Buczkowski 89b]: Figure 4.5: DBIC System Configuration Rules would cause the Inference Engine to take the following actions: 1. Determine which queries (along with relevant data) are presented to the user during knowledge elicitation. 2. Check consistency and completeness of data provided by the user. 3. Initiate procedures to calculate the probability of inference and identify violations. 4. Determine how the Knowledge Base subsystem is updated with the data received from the user or procedure output. 5. Recommend classification changes to parameters to correct inference problems. The DBIC was based on Probabilistic Knowledge Modeling to identify inference, creating and using a Probabilistic Inference Network on and integrated with a semantic model of the target database. The Probabilistic Inference Network is a structure that identifies the logical dependencies of classified parameters on aggregates of objects at lower classifications. The Ford Aerospace research culminated in a top-level DBIC design [Buczkowski 89b]. One of the conclusions of the effort was that aggregation is a special case of the inference problem. 4.3.1.3 Secondary Path Monitoring Section 4.2.3 describes secondary path analysis as an approach for the database designer to detect potential inference paths. Binns also notes that removing all potential inference paths is not always feasible and that run-time analysis based on the potential inference channels detected but not removed during design may be necessary [Binns 92a]. Path definition yields the attributes, relations, and order of traversal used to substantiate the path. Thus, the definition indicates what the path is. Given the potentially offending inference path, Binns proposes that run-time analysis could monitor its existence. Each time a relation in the primary or secondary path is modified, the run-time analysis would fire a trigger to detect if a classification inconsistency exists. Triggers that monitor the existence of an inference path must dominate both the primary and secondary paths, or run as a trusted process. The run-time mechanism can flag an active inference channel to the SSO. 4.3.1.4 MLS System Restrictions Inference problems can arise if queries are conditioned on data that are supposed to be invisible to the user. The inference problems of this type can be eliminated by ensuring that all data used in evaluating a query are dominated by the level of the user. In general, it is difficult to determine when a given operation can be replaced by an operation defined over data classified at lower security levels. Thus, it is necessary to develop algorithms for determining when an operation at a HIGH security level is equivalent to an operation at a LOW security level. Some work has been done on this problem by Denning [Denning 85]. There is a simpler way to deal with this type of inference problem. In certain implementations, the system will be unable to access the higher level data, and the query will not pose an inference problem. For example, a multilevel system rated at class B1 or higher implements a mandatory security policy that includes the simple security condition. Since only. those relations which are dominated by the user's clearance level (UNCLASSIFIED in this case) will be available to a subject processing this query, such queries will not result in an inference violation based on access to higher level data. 4.3.2 Polyinstantiation A solution to inference violations that occur because of key integrity enforcement has been proposed by Denning et al. [Denning 87]. Suppose a user operating at the LOW security level wants to enter a tuple in a relation in which the data are classified at either the tuple or element granularity. Whenever a tuple with the same key at a higher security level already exists in the relation, both tuples are allowed to exist, but the security level is treated as part of the key. Thus, the tuple entered at the LOW security level and the tuple entered at the higher security level will always have different keys, since the keys will have different security levels. A multilevel relation is said to be polyinstantiated when it contains two or more tuples with the same apparent primary key values. The TDI companion document on Polyinstantiation [Poly 96] reviews this topic in greater depth. Garvey et al., suggest development of tools for automatic inference channel detection and cover story generation using polyinstantiation be a long-term goal to provide automated assistance for System Security Officers [Garvey 91c]. 4.3.3 Auditing Facility In some systems, auditing is used to control inferences [Haigh 90]. For instance, a history can be kept of all queries made by a user. Whenever the user makes a query, the history is analyzed to determine whether the response to this query, when correlated with the responses to earlier queries, could result in an inference violation. If a violation arises, the systems can take appropriate action (for example, abort the query). The LOCK Data Views (LDV) project represents an early attempt in data-level monitoring of inference [Haigh 90,91]. The project has as its goal to design and build an MLS DBMS hosted on the Logical Coprocessor Kernel (LOCK). The LOCK system is being developed to satisfy the TCSEC Al criteria. It includes a type manager that allows assured pipelines to be built to enforce application-specific extensions to the system security policy enforced by LOCK. The word "pipeline" is used because the type manager supports both encapsulation and the security policy by funneling an operation through subjects in different domains, each of which has access only to objects of designated types. The LDV project has used the type manager and assured pipeline feature to address DBMS- specific issues not covered by the LOCK security policy. This basic security policy is extended by incorporation of an explicit classification policy. The policy covers name-dependent, content- dependent, and context-dependent classification policies, as well as inference control [Stachour 90]. LDV maintains a set of history files that keeps track of which files have been accessed in answering queries from LOW users. As soon as enough files have been accessed by LOW users to compromise LOW data, LDV either upgrades the clearance of the querying process, or the classification of the query result, or the classification of the file that is being queried. The LDV project considers aggregation and inference as aspects of the same problem. Unlike previous methods that seek to prevent inference problems from arising (by reclassifying data or metadata), auditing seeks to detect inference problems when they are about to occur and to take preventive action. There is a side benefit of this approach: it may deter many inference attacks by threatening discovery of violations. There are three disadvantages of this approach: 1. It may be too cumbersome to be useful in practical situations. 2. It can detect a very limited type of inferences (since it is based on the hypothesis that a violation can always be detected by analyzing the audit records for abnormal behavior). 3. The length of time for which history files will be kept will impact the amount of audit data recorded Collusion between users where each user retrieves some of the information remains a difficult threat to counter and is not addressed by this method. 4.3.4 Snapshot Facility There are situations where it is possible to allow limited inferences [Jajodia 92]. These methods are useful in cases in which the inference bandwidth is so small that these violations do not pose any threat. A snapshot can be created based on a sanitized joined relation for LOW users. Although this snapshot cannot be updated automatically without opening a signaling channel, it can be kept updated by periodic recreation by a trusted subject. In off-line, static databases, the "snapshot" or "sanitized file" is an important technique for controlling inferences. In particular, this approach has been used quite effectively by the United States Bureau of the Census [Cox 86,87; Denning 82]. The Bureau of the Census has the responsibility of releasing collected data for public (or unrestricted) use; however, the Bureau must do so in such a way that it is not possible to identify any individual or organization by use of the released data. To achieve both goals, the Bureau uses a conceptual model that is different from the on-line, general purpose DBMS model [Cox 88]. Unlike the general-purpose DBMS model, which is shown in Figure 4.6, the Bureau constructs a separate database that is designed for public use. This public database is the sanitized version of the Bureau database (see Figure 4.7). Figure 4.7: Database Management System Model at the U. S. Bureau of the Census To sanitize the database for public use, the Bureau starts with a microdata file, which typically consists of a randomly selected sample from the database. All microrecords are first stripped of all identifiers (such as names, exact addresses, or social security numbers of individuals), and then different inference control methods are applied to reduce the risk of inference to an acceptable level. Some of these methods are listed below: 1. Subsampling-A second sample is chosen from the microdata to reduce further the probability of possible identification of individuals. Also, records having extreme values are eliminated. 2. Compression-The microrecords are merged together (e.g., income is reported in five-year intervals rather than every year). 3. Perturbation of data-Noise is added to data (by rounding off numerical values or by actually inserting bogus records). 4. Slicing-The microdata file is divided into several subfiles. While these methods work well in static databases (especially for statistical accesses), these methods must be used with a great care in general-purpose DBMSs. Since these methods perturb the data by adding noise, they have an adverse effect on the integrity of data. Moreover, these methods may be too time-consuming to apply dynamically to on-line databases (in which both statistical and non-statistical accesses must be provided). 4.4 DATABASE DESIGN TOOLS It is a difficult task for the trusted database designer to label data so that the labels accurately reflect the actual sensitivity of the data and adequately protect the information from inference. This section describes tools for detecting potential inference paths. Action can then be taken to minimize or eliminate these inference paths. The tools are based on the techniques described in Section 4.2.2. 4.4.1 DISSECT SRI International (SRI) has produced a prototype tool for detecting specific types of inference channels in a multilevel database. DISSECT (Database Inference System Security Tool) primarily detects inferences where two relations are connected by multiple paths with different security classifications. Once these multiple paths are detected, DISSECT constructs the security constraint formulas that if satisfied would eliminate the problem. However, it is up to the user of DISSECT to ascertain whether the paths represent relations that are semantically similar enough that a lower- classification path comprises a higher path. Selected constraints can then be satisfied by manually editing the schema or by an automatic upgrading procedure. A limited set of external information and assumptions are used. DISSECT accepts as input a multilevel database schema provided in the form of MSQL relation definitions (MSQL is the query language of the SeaView prototype MLS DBMS developed by SRI [Lunt 88]). DISSECT is interactive and allows the user to edit relevant features of the schema. DISSECT can discover a subclass of deductive inference channels (defined in Section 2.1.1) called compositional channels and, with the help of user-supplied type declarations, type-overlap channels: · Compositional channels - A relationship can be inferred between any pair of entities that are connected by a sequence of foreign-key relationships. An inference channel may exist if two relations connected by a high foreign-key relationship are also connected by a path of one or more low foreign-key relationships. If the two paths between the relations represent essentially the same relationship, this is certainly an inference channel, since the relationship is explicitly classified high, but is accessible via an alternative path to a low user. · Type-overlap channels - Attributes whose types overlap are joinable. A typeoverlap relationship occurs between two attributes when the two attributes have been declared to be of the same or overlapping types. This corresponds to the case in which a theorem proved using a high axiom is also provable from all low axioms. This may not be a problem, since consequences of high data are not necessarily high. These channels can be eliminated (by changing attribute classifications) or ignored (e.g., because they are not really problems, are unavoidable, or should be deferred to datalevel handling) at the discretion of the user. Channels can be eliminated by editing the schema to raise or lower some attribute classifications. DISSECT also provides an automated means of upgrading attribute classifications to eliminate all inference channels chosen for elimination. User-specified costs of upgrading attributes express preferences of how channels should be eliminated. The SRI DISSECT research effort has been completed and documented [Garvey 94]. The DISSECT prototype has demonstrated its ability to detect and to optimally upgrade some attributes to automatically eliminate compositional and type-overlap inference channels. Results of the DISSECT-based research has been presented at various security conferences and workshops [Stickel 94; Lunt 94; Qian 92, 93a, 93b, 94; Garvey 91a, 91b, 91c, 93; Binns 92a]. 4.4.2 AERIE The University of Alabama in Huntsville has developed an inference model called AERIE (Activities, Entities, and Relationships' Inference Effects). The project is examining approaches to inference modeling and detection using inference targets (sensitive data that are to be protected from unauthorized disclosure through inference). The approach requires that the specific information that is to be protected be named, and inference paths to this information will be sought. The approach describes the database schema, the data, and the sensitive target information using conceptual graphs. It utilizes a knowledge-base of domain-specific information that can be expressed as a conceptual graph to be used in inference analysis. A precursor to AERlE was the TRW inference engine analysis and the effort to discover inference rules from fundamental relationships among data that pertain to a domain [Hinke 90]. In TRW's Advanced Secure DBMS project, a tool was proposed to identify potential schema-level inference paths, and queries could be run periodically to detect the existence of data-level inference channels. One of the results of the effort was that the specific rules required to find the classified relationship did not have to be explicitly stated. The AERIE research is based on these results and on conceptual graph structures. AERIE extends the inference techniques suggested by Thuraisingham [Thuraisingham 90f] by classifying inference targets and using a two-phase inference approach. Two supporting prototype tools were developed: an inference detection tool called Merlin to validate the AERIE algorithms; and a rule-based, automated inference database generator called Genie to facilitate the development of a large, coherent set of inference relevant instances. Merlin uses functional dependencies to detect second paths with schema-level data. Merlin adapts a transitive association algorithm for testing a relational database decomposition for the lossless join property [Elmasri 94]. The Merlin adapted algorithm is used in the detection of second path inferences within schema-level data and includes subtype relationships between attributes within the schema. Merlin has demonstrated identification of a secondary path between two attributes and is able to categorize secondary paths based on schema classification. However, it is recognized that legitimate inferences are not being detected by Merlin. Future plans of the AERIE project are to 1) move beyond functional dependency-based inference detection, 2) integrate path detection with path finding, and 3) adapt Merlin to handle catalog and instance levels of data. Genie was developed to facilitate the creation of an inference benchmark database for the testing of inference detection tools. The database created by Genie is to contain inferences that represent some coherent activity in some microworld with the potential to grow to a size representative of a real database. Genie is encoded with rules to simulate activities of a microworld, providing the basis for the generation of coherent data. Results of the AERIE-based research have been presented at various conferences and workshops [Hinke 92, 93a, 93b, 94; Delugach 92, 93]. 4.4.3 Security Constraint Design Tool MITRE developed a tool for use during database design using an Informix DBMS [Thuraisingham 94]. The tool provides guidance in assigning minimum security classifications to the database schema as well as establishing minimum security classifications for the database schema of an existing non-secure database migrating to an MLS environment. The design tool examines simple, associative-based, logical and level-based security constraints and determines a minimum classification level for database attributes. This information can be utilized by the database designer in the design of an MLS database schema. 4.4.4 Hypersemantic Data Model and Language Section 4 began with the recognition that a general solution to the inference problem is not possible. The techniques, mechanisms, and tools described in Sections 4.2,4.3, and 4.4 are partial solutions that complement each other. Unfortunately, the approaches are highly specialized. No single tool can handle all types of inference problems. Marks proposes that what is needed is a uniform representation scheme which can be transformed into specific representation schemes without much complexity (Marks 94a]. So while this approach is not a database design tool in and of itself, once done, various inference analysis approaches can be applied. For this objective, a hypersemantic data model called Multilevel Knowledge Data Model (MKDM), a graphical representation scheme called GRAPHICAL-MKDM, and a specification language called Multilevel Knowledge Data Language (MKDL) have all been developed. Future research could include: 1. Developing inference analysis tools for MKDM. 2. Developing transformations to other representations so that current inference analysis tools can be applied. 3. Testing tools with examples. SECTION 5 APPROACHES FOR AGGREGATION There is no general approach or mechanism for dealing with general aggregation problems, only specific measures taken to protect specifically defined aggregates. This section does not describe how to identify sensitive aggregates. Rather, it presents approaches for protecting certain classes of aggregation problems. A discussion on identifying aggregates is not presented because aggregates can be defined in almost any arbitrary way in order to protect almost any fact. As such, identifying an aggregate is totally up to the owner of the data. Once an aggregate which requires protection has been identified, it hopefully can be classified as either one of the two sub-classes of aggregation problems defined below and an appropriate approach defined in Section 5.1 or 5.2 can be used to protect that aggregate. If an aggregate does not cleanly fit into one of the two aggregation sub-classes, it may be a combination of the two forms of aggregation or another problem altogether, such as access control or inference. If an aggregate exhibits both aspects of the aggregation problem, it needs to be broken down into its constituent parts and each part protected appropriately. This section describes the various approaches that have been identified for dealing with specific classes of aggregation problems. Some of these approaches can be used now while others are still ongoing research. As stated in Section 1, the aggregation problem occurs when each individual data item is less sensitive than the collection (or aggregation) of all the data items. This definition distinguishes the aggregation problem from the inference problem which uses LOW data to infer HIGH data, rather than a collection of LOW data being considered HIGH because of the quantity of, or relationships between, the data. Aggregation has been divided into two sub-problems [Jajodia 92]: · Data Association - associations between instances of different data items. · Cardinal Aggregation - the associations among multiple instances of the same entity. 5.1 DATA ASSOCIATION This section presents the research to date on the data association problem. It specifically describes the work being done to formalize the description of the data association problem and possible solutions, and then presents the primary technique that has been identified for solving specific instances of this problem. No tools for detecting data association problems or mechanisms which require changes to standard DBMSs are presented as none have been developed or proposed. 5.1.1 Formalizations Researchers have attempted to apply mathematical approaches to the data aggregation problem, especially in the context of MLS DBMS implementations. This section discusses the approaches that have been proposed to date. 5.1.1.1 Lin's Work Lin applies a security algebra defined over classifications and extended over relational database operations to mathematically express the aggregation problem [Lin 89a]. The security algebra works with classification labels under the constraints of Denning's classification lattice algebra [Denning 76]. Lin defines a homomorphism to relate the classification of a relational operation's output to the individual classifications of the operation's operands. The reader needs to pay close attention to the algebra's notation, as Lin's analysis requires comparisons between purely label- based operations on classification labels under the lattice algebra (called `algebraic classification') and the real-world security class of data (the classification of operations on data). The algebra of label manipulation is fully intuitive to readers familiar with Bell-LaPadula [Bell 76]: the label of a process that can read a set of data needs to dominate the least upper bound of the labels on its individual inputs; the label on a process that creates a set of outputs needs to be dominated by the greatest lower bound of the labels on the individual outputs. Lin observes that an aggregation problem exists when this dominance constraint fails to hold as an equality -- i.e., when the real-world classification of a set of inputs fails to exactly equal the least upper bound of their individual classifications. He allows for the traditional case (the classification of the set strictly dominates the classification of its members), but also for the case where the real- world classification of an operation is strictly dominated by the least upper bound of the inputs to the operation, as in the results of certain aggregate functions (`built-in' operators like SUM, AVERAGE, etc.). This algebra is then used to formally define the `aggregation problem for a database'. Lin deals with the data-association aggregation problem, and not with the cardinality aggregation problem, and defines an aggregation to be `simple' if it is a set of data whose real-world classification strictly dominates the algebraic classifications of each of its constituent elements with the further constraint that no proper subset of the set has this property (i.e., the classification of any proper subset equals the least upper bound of its constituents' classifications). Since under this definition, the removal of any element(s) from a simple aggregate suffices to `solve' the aggregation problem, Lin presents an algorithm that requires the SSO to identify sets of elements that can be removed from a simple aggregate, and reclassified at a strictly dominating level in order to ensure that none of the resulting collections could result in disclosure of a simple aggregate to inadequately cleared users [Lin 89b]. Using this technique, it is up to the owner of the data to decide which sets of data are to be more highly protected and which are to be left at their original classification. Lin extends this work somewhat in [Lin 90a, 91]. An example problem that his technique would address would be if we had the following simple aggregates: · A, B, C · A, C, F · D, F, Y, Z · E, F, G, H where a `simple aggregation problem' is as defined above. Given these four simple aggregates, Lin's technique would identify that A or C, and F were important elements that must either be protected, or all other elements need to be protected in order to protect the aggregates. For example, if elements A and F were marked HIGH, and all other elements were marked LOW then no user can compile all the elements of these aggregates without accessing HIGH data. Lin's method would identify that C was equivalent to A such that it could be protected instead. Conversely, access to A and F or C and F could be granted at LOW while all other data was protected at HIGH and the aggregates would be equivalently protected. It is up to the owner of the data to decide whether it is more important to make available the largest number of elements, or certain critical or more interesting elements, while protecting enough other elements to ensure that an aggregate cannot be collected together solely at the LOW level. Lin goes on to define an important homomorphism, H, to relate the operations of relational algebra and security classification algebra. It is essential that this mapping be a homomorphism, and that it satisfy commutative mapping diagrams (i.e., that the mapping under H into a classification of a relational operation on two operands equal the algebraic classification of the operands' mapped labels in the label-combining algebra -- e.g., the classification of a join equals the least upper bound of the classifications of the join's arguments). Aggregation problems exist when equality fails. Lin notes that there is a need to consider aggregations that could hold if the homomorphism were non- commutative aggregation, indicating that it will be addressed in a forthcoming paper [Lin 90b, 91]. Lin extends the work on simple aggregation problems to towers of aggregation problems [Lin 91]. The real-world classification of a tower strictly dominates the least upper bound of the classifications of the tower's components, but there exist proper subsets of the tower that are simple aggregation problems. That is, given · Individual elements A and Band C may be UNCLASSIFIED · A, B may be a LOW simple aggregation problem · A, B, C may be a HIGH tower aggregation problem Lin would solve by classifying, say B as LOW and replacing it with an UNCLASSIFIED element E (e.g., a cover story). This would reduce the HIGH tower aggregation problem to , which may not be a problem or which may now be a simple aggregation problem. Lin also discusses the case of statistical aggregation problems where the application of an aggregate function (built-in operator) produces an output whose classification is strictly dominated by the least upper bound of its inputs [Lin 91]. 5.1.1.2 Cuppens' Method Cuppens has defined a formal method for addressing the aggregation problem which he refers to as a modal logic framework [Cuppens 91]. This logic is based on modeling what a user knows as compared to what a user has permission to know. This logic framework can be used to model a security policy with aggregation and then prove whether a particular specification is secure or not. It does not detect aggregation in a database management system but rather determines whether a particular specification properly protects data as specified by a security policy. An aggregate must be predefined as part of the security policy in order for the technique to determine whether a particular specification will properly protect the data as required by the security policy. 5.1.2 Techniques Since the data association problem is concerned with the sensitive association between two or more distinct less sensitive data elements, Denning [Denning 86a] and Lunt [Lunt 89] have both proposed fairly straightforward solutions to this problem. They point out that since a quantifiable entity (the sensitive relationship) requires more protection than the individual elements, database design and access controls can be used to control the data association problem. The basic solution to this problem is for the individual data items to be protected as normal, but the ability to associate the items together is removed from each item's relation. A separate relation is then created which describes the relationship between the items and is protected appropriately. Assuming you have an MLS DBMS, this new relation is classified at the proper level for the aggregate. Because the data items cannot be aggregated without reference to the higher level relation, only users cleared for the level of the aggregate will be able to view the aggregate as an entity. Less cleared users will be able to access the individual elements at their proper classification. This approach has the additional advantage that the mandatory access control mechanism will ensure that the aggregate is properly labeled at the least upper bound of the accessed data. In this case, the new relation's higher classification. Using our fictional company ATC we see two examples of data association. In the ATC database, Project names are UNCLASSIFIED while Client names are LOW but the association between a Project and its sponsoring Client is classified HIGH. This example of inter-entity data association is properly protected by placing each entity, as well as the relationship between them, in separate tables that can be independently labeled with the appropriate classification. An example of intra- entity data association is the fact that all information about an employee is UNCLASSIFIED, except for the employee's salary, which must be protected at LOW. To properly protect the association between the employee and the employee's salary, the salary field is stored in a table separate from the rest of the employee data and protected at LOW. These two associations and their solutions are shown as follows: Figure 5.1: ATC Company Data Association Examples If an MLS database is not available, then these more sensitive relationships will have to be removed from the database or protected using discretionary access controls (DAC). The SSO can define roles or individual users that are associated with each type of data and then each user or role can be restricted to access only the appropriate tables. The restrictions can be in the form of DAC on the underlying base tables, as well as providing predefined views that are restricted to the appropriate users or roles. For the example above, to use DAC, one would protect the Client- Project and Employee-Salary tables with DAC controls that restricted access to only those users who should be able to see this information. DAC may be a reasonable solution for mitigating the data association problem in some cases, but, given the inherent flaws of DAC, it may not provide sufficient assurance that the data is properly separated depending on the sensitivity of the aggregate. Such a solution can be used to separate categories of data (e.g., budget, sales, or cost data). However, if aggregating all the data within a specific category is an issue, this solution will not, for example, prevent a budget clerk from accessing all of the budget data. Accessing all the data within a particular category is a cardinal aggregation issue, and approaches to this problem are discussed in the next section. The fundamental technique then to solving aggregation problems that fit into the sub-class called data association is to first identify whether a more sensitive aggregate is truly a data association problem, and if so, then properly construct the database to protect the more sensitive relationship. This protection implies separating the entity's relationship and defining a separate table which associates the entities. The table that relates the two entities together must then be appropriately protected at the level of the aggregate. This solution protects each entity at the appropriate level through careful database design without the need to overclassify any individual element of the data. 5.2 CARDINAL AGGREGATION A fairly straightforward technique was presented for solving the data association problem. Unfortunately, similar straightforward techniques have not yet been identified which can address the cardinal aggregation problem in general. A number of techniques and mechanisms have been proposed to help address the cardinal aggregation problem. This section presents those that have been identified. No tools or formal methods have yet been developed for dealing with the cardinal aggregation problem. 5.2.1 Techniques This section describes those techniques that rely solely on standard capabilities available in the typical commercial DBMS as well as those that would be expected to be present in Multilevel Secure (MLS) DBMSs. The following techniques each address a certain aspect of the cardinal aggregation problem. They vary in complexity, effectiveness, and side effects or drawbacks. Remember that aggregation is a concern only when the security policy for a system specifically indicates that a collection of less sensitive items is considered more sensitive when aggregated together. If the policy does not indicate this is so, then no aggregates have been defined which must be dealt with. Once an aggregate has been identified, then the most appropriate approach for dealing with that particular aggregate will depend on the nature of the aggregate. The number and interrelationships between aggregates will also affect which techniques would be most appropriate. To date, most techniques do not consider interrelationships between aggregates and the complexity interrelationships can cause when proposing a solution to the cardinal aggregation problem. Such interrelationships can make it more difficult to provide acceptable solutions which balance the need to protect sensitive aggregates while still providing access to the less sensitive individual elements. 5.2.1.1 Single Level DBMS Techniques Cardinal aggregation control is a problem in databases used in commercial applications as well as databases used in secure, system-high applications. In both these environments, techniques used for cardinal aggregation control have been developed in an ad hoc fashion and not uniformly applied. Most of these techniques are administrative and physical; examples are control and review of hard copy output and segregation of data across multiple machines. The following techniques rely on the use of DBMS security mechanisms and can be applied using most any commercial DBMS. 5.2.1.1.1 Database Design and Use of DAC One of the simplest techniques to minimize cardinal aggregation in single level databases is to divide data that could aggregate into separate structures [Lunt 89]. Each of these separate structures could then be protected with discretionary controls that restrict individual users from aggregating enough information to construct an aggregate whose size was considered more sensitive than the individual elements. For example, if there were 1000 data elements which were divided into 10 blocks of 100, and an aggregate of over 300 was considered sensitive, then each user could be limited to accessing at most three of the 10 data blocks. Which three may or may not be arbitrarily selected. Such a structure would mitigate the cardinal aggregation problem to some degree, but it has the significant drawback that users with a legitimate need to access certain individual elements may be prevented from doing so, depending on whether the element resides in the part they are permitted to see, or in the part for which they are restricted. It is also difficult to administrate such a structure if the amount of data rapidly changes over time, or the number of users changes to any significant degree. Now if the data in an aggregate can be distinguished in some way, such as budget, sales, and cost data, then it makes it easier to decide how to divide up access to an aggregate among sets of users. However, this is a data association problem, not a cardinal aggregation problem, since cardinal aggregation only deals with multiple instances of the same entity. A single aggregate can have both data association and cardinal aggregation aspects to it. As in the budget, sales, and cost example just discussed, the distinction between the elements is a data association problem, while the collection of a sufficient number of elements within a single category is a cardinal aggregation problem. Such an aggregate needs to be broken into its constituent components and then each aspect of the problem dealt with using its respective technique. An inherent vulnerability in any solution relying on DAC is that any user with access to data can make that data available to others because of the limitations of DAC in controlling data propagation. As such, this type of solution may not be sufficiently strong to protect the data in addition to the inherent drawback of restricting access to data from users with a legitimate need to access it. 5.2.1.1.2 Use of Views Views can be used to provide access to specific portions of the database. Use of a view can then be granted to individuals which might not have the right to access the underlying base relations. Views can then be used to more precisely control what a user sees when accessing the database. These controls can help protect aggregates in a number of ways. Summary and grouping data can be presented which shield the user from accessing individual elements. Depending on the nature of the aggregate, such data may not be considered more sensitive than the individual elements, while the individual elements when seen together may be more sensitive. For example, a view may allow a user to ask how many of a particular element exists, or the average element value, and this might be acceptable to reveal, as long as the values of all the individual elements are not revealed. Another view may show individual element values but restrict the user to only a certain number of elements in any given query. Both techniques may help protect the aggregate. However, it is typically difficult to be sure that views provide sufficient protection, depending on the number of views available, their complexity, and the size and volatility of the database. Normal views do not keep track of what each user has accessed over time so the user could assemble an aggregate by issuing numerous queries which each access only a few elements at a time. Group functions which attempt to hide the underlying data using sums, averages, counts, etc. may be subject to statistical inference attacks described in Section 4. Views should be able to protect data from the casual viewer but a persistent attacker may be able to circumvent the restrictions placed by views. 5.2.1.1.3 Restricting Ad Hoc Queries One desirable feature of a relational DBMS is that it gives users the ability to formulate ad hoc queries against any database object. However, if all users are allowed unrestricted ad hoc queries, the SSO may not be able to pre-identify the likely aggregation results and therefore may not be able to define controls appropriately. This decision to allow ad-hoc queries is a tradeoff between convenience and security. If ad hoc queries are restricted, then the administrator can rely on protecting access to specific views as just described, thereby relying on the access controls they provide to limit each user's access to only that data they need to perform their assigned tasks. It may not be necessary to restrict ad-hoc queries from all users. If certain users are cleared for all the aggregates contained on the system anyway, it may be perfectly reasonable to allow them to execute ad-hoc queries. In this way, this capability is not arbitrarily restricted from all users. In fact, most systems would not really restrict this capability from every user, because at least one user (the database administrator) typically needs to be able to define new views and issue arbitrary queries. 5.2.1.1.4 Use of Audit Another approach relies on the use of auditing tools to track the accesses of each user to elements of an aggregate. Assuming sufficient data has been audited, an administrator may be able to review the data and detect whether a user is trying to collect an abnormally large number of values of a particular data element (e.g., too many employee phone numbers). Hopefully this audit analysis can be done by automated means, but few automated audit analysis (i.e., intrusion detection) tools currently exist. More typically, these analyses are done manually, long after the event, by an SSO on a set schedule. The audit trail analyses look for outputs that contain aggregates more sensitive than their component data elements. Based on these results, the administrator can then either restructure the database or apply other controls that would prevent the same problem in the future. This approach has a number of significant drawbacks, including 1) it attempts to detect, not prevent, an aggregate from being pulled together, 2) today's audit analysis tools are typically difficult and time consuming to use, 3) the likelihood that an administrator would have the time or ability to detect a non-obvious aggregation attempt is questionable, 4) a determined adversary can easily avoid detection through slow and careful gathering of data. 5.2.1.1.5 Store all Data System-High One of the simplest solutions to the aggregation problem, but one with significant drawbacks, is to store all the data in a single level (system-high) database which is classified at the same level as the most sensitive aggregate in the database. By definition then, all users of the database must be cleared for all the data, thus eliminating the possibility of exposing an aggregate to an individual that is not cleared to see such an aggregate. In essence, this solution proposes that you remove all users from a system who you do not want to access all the data. This has the significant drawback that all the less sensitive data, which may be the bulk of the data, is overclassified, possibly highly so, and inaccessible to individuals who should otherwise be able to access the data. Thus, although this addresses the security concerns related to the aggregation problem, it is not normally a practical alternative depending on the nature of the data. Now for a particular dataset, if only a small portion of the data is normally classified less than this system-high level, then this may be a practical solution. If this type of solution is used, the less sensitive data must be reviewed before it is regraded to a lower level before dissemination. This review can be automated for specifically formatted, tightly controlled data, but is typically reviewed manually. If done manually, then the output is typically hardcopy, rather than in electronic form. Depending on the nature and volume of data, this review task can be extremely burdensome to perform. Sufficiently so, that this solution is impractical in many situations. 5.2.1.2 Multilevel DBMS Techniques There are some additional techniques for addressing the aggregation problem by specifically labeling the sensitivity of data (e.g., UNCLASSIFIED, LOW, HIGH) and using this labeling capability, and the ability to control access based on labels, to help control the cardinal aggregation problem. These techniques build upon the techniques described in the previous section and are discussed below. 5.2.1.2.1 Use of Labeled Views Wilson proposed that views in a multilevel system be used to properly label aggregates with their sensitivity level [Wilson 88]. Views are predefined queries which are labeled. They allow a user to access certain portions of the database in a well known, controlled, manner. Views can be preconstructed to access portions of an aggregate. If a view allows access to a small portion of an aggregate then the result that is returned can be labeled at the level of the data accessed. However, if certain views access large portions of an aggregate, then the result returned should be labeled at the level of the aggregate, which may be higher than the sensitivity of the individual elements accessed by the view. This can be accomplished by labeling the sensitivity of the view definition at the level of the aggregate. This will cause the results returned to be marked at least at the level of the aggregate and prevent users with insufficient clearance from accessing this view. The idea here is to use views to define aggregates and have them properly label aggregates whenever they are accessed. This technique relies on the fact that users cannot construct their own queries to get around the classification constraints provided by predefined views. If ad hoc queries are permitted, they should be labeled at system-high, independent of the user's session level, to prevent unauthorized access to aggregates. This means that ad hoc queries by lower cleared users will have to be reviewed and downgraded before being displayed to those users. Based on audit trail analyses, any ad hoc queries that show up frequently could be converted into standard queries with an appropriate label. Use of labeled views and query restriction appears to be a workable strategy in some applications, but has some significant drawbacks. If users rarely need to execute ad hoc queries, this solution may be reasonable. However, if ad hoc queries are normally necessary to accomplish the tasks at hand, then the review and downgrade process could quickly become onerous, depending on the nature and volume of the data being reviewed. This technique will not prevent a user from accumulating sufficient information on their own to assemble an aggregate of sufficient size to be considered more sensitive than the individual elements. What it does do is automatically label aggregates at the appropriate level for those users that are cleared to see the aggregate. It also make it more difficult for an uncleared user to do so, and auditing may be able to detect the accumulation of data in an aggregate that a user is not cleared for. This technique may also draw attention to the fact that a sensitive aggregate exists which may be inappropriate for some operational environments. 5.2.1.2.2 Start with All Data System-High Another approach is to start with all data on the system stored at system-high and then downgrade individual elements as necessary to satisfy the access needs of less cleared users. This will cause the data that is needed to migrate down to its actual level into the hands of individual users while the rest of the data remains at system high. The security advantage here is that a review process is automatically invoked each time a user requires access to an element which is part of a sensitive aggregate and that the data is only given to that particular user, rather than all users at that level. This allows the reviewer to monitor the necessity for and volume of data granted to a particular user. This is similar to the paper word concept where you can call the operator for some organization and request an individual phone number, but you cannot gain direct access to the entire phone book. This allows the operator to verify that you are a legitimate caller (e.g., you at least know the employee's name and where they work). This approach can work in certain situations, as in the phone book example above, but clearly could be unfeasible when the volume of requests to downgrade data is high. 5.2.1.2.3 Make Some Data High Similar to the technique of using discretionary access controls to separate data into chunks to ensure that an aggregate of sufficient size cannot be pulled together, sensitivity labeling can also be used to separate data such that it cannot be aggregated beyond an allowable size. In this case, a sufficient portion of the aggregate is marked at a higher sensitivity level (e.g., the level of the aggregate) in order to prevent or make it more difficult to create an aggregate which exceeds the allowed size at the lower level. If such an aggregate were pulled together, it would be marked at the higher level since it would require the inclusion of some of the higher level data. This has the advantage that an aggregate of sufficient size would most likely be labeled at the appropriate level, but the disadvantage that some, but not all, of the data would be inaccessible to users at the appropriate level of the data. 5.2.2 Mechanisms The previous section described techniques that could be used to reduce the cardinal aggregation problem. These techniques relied on database design techniques and mechanisms available in standard single and multilevel DBMSs. This section describes new mechanisms that have been proposed to address the cardinal aggregation problem which require some degree of modification to standard DBMS mechanisms. In proposed or prototype database implementations using MLS DBMSs, data aggregation control problems that have arisen have been handled by application-specific ad hoc controls [Doncaster 90]. In addition, other control strategies, which are still ad hoc but are less application-specific, have been proposed. These are described below. 5.2.2.1 Aggregation Constraints Section 4.3.3 described the Lock Data Views (LDV) approach for dealing with the inference aspects of classification consistency problems. The following describes the LDV approach for dealing with aggregation through the use of aggregation constraints. Aggregation constraints are defined using extended Structured Query Language (SQL) [ANSI 92, 94], and are enforced after the query results have been built. Aggregation constraints are similar to classification constraints, except the assigned security levels are applied only to the entire aggregate, and not to the individual elements [Denning 86a]. The module responsible for this examination may reclassify the query result based upon the constraints [Stachour 90]. The aggregation levels are then used to restrict how much data can be released to a given user. The constraints are enforced by building a rule-based classification scheme into the pipelines. In addition, a history-recording mechanism is consulted and updated whenever a query response is prepared by the DBMS [Haigh 90]. 5.2.2.2 Aggregation Detection The mechanisms just described attempt to control aggregates that are accessed by a single query. In order to completely protect an aggregate, a solution ultimately needs to track which data elements have been accessed by a particular user and then grant or deny access to additional data based on all previous data that user has seen. In this way, the system can ensure that the user cannot collect sufficient information to compile an aggregate which is more sensitive than the user's clearance. However, the difficulty of tracking all the data that any particular user has seen has seemed insurmountable. Motro, Marks, and Jajodia have proposed an accounting based approach for tracking all information that a user has accessed previously [Motro 94]. This information is then used to make access decisions which ensure that the user cannot assemble an aggregate using multiple queries. This method uses views to define secrets that can be disclosed to a limited degree, and associates a threshold value with each such view, which they call sensitive concepts. If any view incorporates a sensitive concept, this broader view is also protected as well. Access to sensitive concepts is then accounted for each user. In fact, to avoid overcounting due to multiple accesses to the same data, the scheme keeps track of the actual tuples that have been disclosed, rather than just the number of tuples disclosed from a particular sensitive concept. In this way, the method can accurately determine exactly what a particular user has seen before, in order to determine whether a subsequent query is allowed, or should be denied because a threshold would be exceeded. The method also determines whether other related data is equivalent to the data protected by a particular sensitive concept and controls access to this related data as well. For example, if in a particular database all the employees in Division A are located in Building 1 and the employees in Building 1 is a sensitive concept, then employees in Division A would also have to be protected as well. The method described is limited to databases that are single relations and to queries and sensitive concepts that are projection-selection views, where all selections are conjunctions of simple clauses of the form attribute = value. The authors conjecture that this method can be extended to overcome these limitations. SECTION 6 SUMMARY Finding effective approaches for dealing with inference and aggregation continues to be a significant challenge for the security community. This report provides a survey of the state of the art of various aspects of the inference and aggregation problems. It defines and characterizes inference and aggregation as they relate to multilevel secure database systems and describes methods that have been developed for resolving some of these problems. Inference addresses users deducing (or inferring) higher level information based upon lower, visible data [Morgenstern 88]. The aggregation problem occurs when classifying and protecting collections of data that have a higher security level than any of the elements that comprise the aggregate [Denning 87]. Inference and aggregation problems are not new, nor are they only applicable to database management systems (DBMSs). However, the problems are exacerbated in multilevel DBMSs that label and enforce a mandatory access control policy on DBMS objects [Denning 86a]. Inference and aggregation differ from other security problems since the leakage of information is not a result of the penetration of a security mechanism but rather it is based on the very nature of the information and the semantics of the application being automated [Morgenstern 88]. As a result, inference and aggregation are not amenable to complete solution through mechanisms in the DBMS. They inevitably have to be addressed in the context of the semantics of the data. Inference and aggregation are different but related problems. Each of them can in turn take many different forms. Researchers have addressed different aspects of the problems making it difficult to relate different research efforts. In order to more precisely understand what inference and aggregation are and provide a framework for understanding partial solutions it is essential to revisit several first principles related to security concepts and processes. These concepts, described in Section 2, include: · Identifying methods and types of information that can be used to make inferences. · Characterizing two classes of aggregation problems: data association and cardinal aggregation. · Distinguishing between direct disclosure and inference. · Understanding the relevance of the traditional security analysis paradigm. · Identifying the steps in a database security engineering process. The framework presented for reasoning about inference and aggregation identified types of classification rules and vulnerabilities at each step of the database security engineering process. This approach allows one to: · Establish the relationship among information protection requirements, implementing classification rules, database design, and database instantiation. · Distinguish between the types of classification rules (entity, attribute, data association, and cardinal aggregation). · Distinguish between direct disclosure and inference vulnerabilities. · Identify specific vulnerabilities that are introduced at each step in the database security engineering process. · Differentiate between inference as a vulnerability and cardinal aggregation as both a class of policies and as vulnerabilities. Utilizing this framework, Sections 4 and 5 describe research efforts and other approaches to provide techniques, mechanisms, and analysis tools to address aspect of inference and aggregation. Eliminating or reducing unauthorized inference remains a difficult problem because of diversity of methods and types of information that can be used to make inferences. The inference problem remains one that is dependent on understanding the value and semantics of data for the part of the enterprise being automated. Techniques, mechanisms, tools described in Section 4 provide support for identifying inference vulnerabilities during database design. The data association class of aggregation policies can be solved today through database design techniques described in Section 5. Effective mechanisms to enforce cardinal aggregation policies remain a topic for research. REFERENCES [Akl 87] S. G. Akl and D. E. Denning. Checking Classification Constraints for Consistency and Completeness. In Proceedings of the IEEE Symposium on Security and Privacy, pp. 196-201, April 1987. [ANSI 92] ANSI. Database Language SQL2. In ANSI K3.135-1992, American National Standards Institute, New York, December 1992. [ANSI 94] ANSI. SQL3 - Working Draft. Draft Version, Available From X3 Secretariat, Computer and Business Equipment Manufacturers Association (CBEMA), 1250 Eye St. N.W., Suite 200 Washington, D.C. 20005-3992, September 1994. [Audit 96] National Computer Security Center, Auditing Issues in Secure Database Management Systems, NCSC Technical Report-005, Volume 4/5, May 1996. [Bell 76] D. E. Bell and L. J. LaPadula. Secure Computer Systems: Unified Exposition and Multics Interpretation, The MITRE Corporation, March 1976. [Binns 91] L. J. Binns. Inference Through Polyinstantiation. In Proceedings of the Fourth Rome Laboratory Database Security Workshop, 1991. [Binns 92a] L. J. Binns. Inference through Secondary Path Analysis. In Proceedings of the Sixth IFIP 11.3 WG on Database Security, August 1992. [Binns 92b] L. J. Binns. Inference and Cover Stories. In Proceedings of the Sixth IFIP 11.3 WG on Database Security, August 1992. [Binns 94a] L. J. Binns. Inference: A Research Agenda. In L. Notargiacomo, editor, Research Directions in Database Security V, June 1994. [Binns 94b] L. J. Binns. Implementation Considerations for Inference Detection: Intended vs. Actual Classification. In IFIP Transactions A (Computer Science and Technology), A-47:297-306, 1994. [Brewer 89] D. F. Brewer and M. J. Nash, The Chinese Wall Security Policy, In Proceedings of the 1989 IEEE Symposium on Security and Privacy, May 1989. [Buczkowski 89a] L. J. Buczkowski, and E. L. Perry. Database Inference Controller. Interim Technical Report, Ford Aerospace, February 1989. [Buczkowski 89b] L. J. Buczkowski, and E. L. Perry. Database Inference Controller Draft Top-Level Design. Ford Aerospace, July 1989. [Buczkowski 90] L. J. Buczkowski. Database Inference Controller. In Database Security, III: Status and Prospects, ed. D. L. Spooner and C. Landwehr, pp. 311-322, North- Holland, Amsterdam, 1990. [Bums 92] R. K. Burns. A Conceptual Model for Multilevel Database Design. In Proceedings of the 5th Rome Laboratory Database Security Workshop, 1992. [Codd 70] E. F. Codd. A Relational Model of Data for Large Shared Data Banks. In Communications of the ACM, June, 1970, pp. 377-387. [Collins 91] M. Collins, W. Ford, J. O'Keefe, and M. B. Thuraisingham. The Inference Problem in Multilevel Secure Database Management Systems. In Proceedings of the 3rd Rome Laboratory Database Security Workshop, 1991. [Cordingley 89] S. Cordingley and B. J. Sowerbutts. The Fact Database - A Formal Model of the Inference Problem for Multi-Level Secure Data bases. Plessey Report No: 72/89/R420/U, November 1989. [Cordingley 90] S. Cordingley and B. J. Sowerbutts. A Formal Model of Dynamic Inference for Multi-Level Secure Databases. Plessey Report No: 72/90/R078/U, March 1990. [Cox 86] L. S. Cox, S. McDonald, and D. Nelson. Confidentiality Issues at the United States Bureau of the Census. In Journal of Official Statistics, Vol. 2, No. 2, pp. 135-160, 1986. [Cox 87] L. S. Cox. Practices of the Bureau of the Census with the Disclosure of Anonymized Microdata. In Forum der Bundesstatistik, pp. 26-42,1987. [Cox 88] L. S. Cox. Modeling and Controlling User Interface. In Database Security: Status and Prospects, ed. Carl E. Landwehr, North-Holland, Amsterdam, pp. 167-171,1988. [Cuppens 91] F. Cuppens. A modal logic framework to solve aggregation problems. In Proceedings of the 5th IFIP 11.3 WG on Database Security, September 1991. [DAC 96] National Computer Security Center, Discretionary Access Control Issues in High Assurance Secure Database Management Systems, NCSC Technical Report-005, Volume 5/5, May 1996. [Date 86] C. J Date. An Introduction to Database Systems, Vol. I, 4th Edition, AddisonWesley, Reading, MA, 1986. [Delugach 92] H. S. Delugach and T. Hinke. Aerie: Database inference modeling and detection using conceptual graphs. In Proceedings of the 7th Annual Workshop on Conceptual Graphs. New Mexico State University, July 1992. [Delugach 93] H. S. Delugach, T. Hinke, and A. Chandrasekhar. Applying conceptual graphs for inference detection using second path analysis. In Proceedings of the ICCS1993, Intl. Conf. on Conceptual Structures, pages 188-7, Laval University, Quebec City, Canada, August 4-71993. [Denning 76] D. E. Denning. A Lattice Model of Secure Information Flow. Communications of the ACM, Vol. 19, No. 5, May 1976, pp. 236 - 243. [Denning 78] D. E. Denning, P. J. Denning, and M. D. Schwartz. The Tracker: A Threat to Statistical Database Security. In ACM Trans. Database Systems. 4,1,7-18, March 1978. [Denning 82] D. E. Denning. Cryptography and Data Security, Addison-Wesley, Reading, MA, 1982. [Denning 85] D. E. Denning. Commutative Filters for Reducing Inference Threats in Multilevel Database Systems. In Proceedings of the IEEE Symposium on Security and Privacy, pp. 134-146, April 1985. [Denning 86a] D. E. Denning. A Preliminary Note on the Inference Problem in Multilevel Database Management Systems. In Proceedings of the National Computer Security Center Invitational Workshop on Database Security, June 1986. [Denning 86b] D. E. Denning and M. Morgenstern. Military Database Technology Study: Al Techniques for Security and Reliability, SRI Technical Report, August 1986. [Denning 87] D. E. Denning, S. G. Akl, M. Heckman, T. F. Lunt, M. Morgenstern, P. G. Neumann, and R. R. Schell. Views for Multilevel Database Security. In IEEE Transactions on Software Engineering, Vol. 13, No. 2, pp. 129-140, February 1987. [DoD 85] Department of Defense Trusted Computer System Evaluation Criteria, DoD 5200.28-STD, Washington, DC, December 1985. [Doncaster 90] S. Doncaster, M. Endsley, and G. Factor. Rehosting Existing Command and Control Systems into an MLS Environment. In Proceedings of the 6th Annual Computer Security Applications Conference, 1990. [Elmasri 94] R. Elmasri, and S. B. Navathe. Fundamentals of Database Systems, Second Edition. The Benjamin/Cummings Publishing Company, Inc., Redwood City, CA, 1994. [Entity 96] National Computer Security Center, Entity and Referential Integrity Issues in Multilevel Secure Database Management Systems, NCSC Technical Report- 005, Volume 2/5, May 1996. [5100.1-R 86] Department of Defense, Information Security Program Regulation 5100.1- R. June 1986. [Ford 90] W. R. Ford, J. O'Keefe, and M. B. Thuraisingham. Database Inference Controller: An Overview. Technical Report MTR 10963 Vol. 1. The MITRE Corporation. August 1990. [Garvey 91a] T. D. Garvey, T. F. Lunt, and M. E. Stickel. Abductive and approximate reasoning models for characterizing inference channels. In Proceedings of the Fourth Workshop on the Foundations of Computer Security, June 1991. [Garvey 91b] T. D. Garvey, T. F. Lunt, and M. E. Stickel. Characterizing and Reasoning about Inference Channels. In Proceedings of the Fourth Rome Laboratory Database Security Workshop, 1991. [Garvey 91c] T. D. Garvey and T. F. Lunt. Cover Stories for Database Security. In Proceedings of the 5th IFIP 11.3 WG on Database Security, November 1991. [Garvey 93] T. D. Garvey, T. F. Lunt, X. Qian, and M. E. Stickel. Toward a tool to detect and eliminate inference problems in the design of multilevel databases. In M. B. Thuraisingham and C. E. Landwehr, editors, Database Security, VI: Status and Prospects, pages 149-167. North-Holland, 1993. [Garvey 94] T. D. Garvey, T. F. Lunt, X. Quin, and M. E. Stickel. Inference Channel Detection and Elimination in Knowledge-Based Systems. Final Report ECU 2528, SRI International, October 1994. [Goguen 84] J. A. Goguen and J. Meseguer. Unwinding and Inference Control. In Proceedings of the IEEE Symposium on Security and Privacy, pp. 75-86, April 1984. [Haigh 90] J. T. Haigh, R. C. O'Brien, P. D. Stachour, and D. L. Toups. The LDV Approach to Security. In Database Security, III: Status and Prospects, ed. D. L. Spooner and C. Landwehr, North-Holland, Amsterdam, pp. 323-339, 1990. [Haigh 91] J. T. Haigh, R. C. O'Brien, and D. J. Thomsen. The LDV secure relational DBMS model. In S. Jajodia and C. Landwehr, editors, Database Security IV: Status and Prospects, pages 265-279. North- Holland, 1991. [Hale 94] J. Hale, J. Threet, and S. Shenoi. A Practical Formalism for Imprecise Inference Control. In Proceedings for the 8th IFIP 11.3 WG on Database Security, August 1994. [Hinke 88] T. H. Hinke. Inference Aggregation Detection in Database Management Systems. In Proceedings of the IEEE Symposium on Research in Security and Privacy, pp. 96-106, April 1988. [Hinke 90] T. H. Hinke. Database Inference Engine Design Approach. In Database Security, II: Status and Prospects, Proceedings for the IFIP 11.3 WG on Database Security, Kingston, Ontario, Canada, October 1988, C. E. Landwehr, ed., North-Holland, 1990. [Hinke 92] T. H. Hinke and H. Delugach. AERIE: Database inference modeling and detection for databases. In Proceedings 6th IFIP 11.3 WG on Database Security, August 1992. [Hinke 93a] T. H. Hinke, H. Delugach, and A. Chandrasekhar. Layered Knowledge Chunks for Database Inference. In Proceedings for the 7th IFIP 11.3 WG on Database Security, Lake Guntersville State Park Lodge, Alabama, September 1993. [Hinke 93b] T. H. Hinke and H. Delugach. AERIE: Database inference modeling and detection for databases. In M. B. Thuraisingham and C. E. Landwehr, editors, Database Security, VI: Status and Prospects: Proceedings for the Sixth IFIP 11.3 WG on Database Security, Vancouver, Canada, August 1992. Elsevier Science Publishers, 1993. [Hinke 94] T. H. Hinke and H. Delugach. Inference Detection in Databases. In Proceedings of the Sixth Rome Laboratory Database Security Workshop, June 1994. [Jajodia 92] S. Jajodia. Aggregation and Inference Problems in Multilevel Secure Systems. In Proceedings of the 5th Rome Laboratory Database Security Workshop, 1992. [Keefe 89] T. F. Keefe, M. B. Thuraisingham, and W. T. Tsai. Secure Query Processing Strategies. In IEEE Computer. Vol. 22, #3, pages 63-70, March 1989. [Lin 89a] T. Y. Lin. Commutative Security Algebra and Aggregation. In Proceedings of the Second RADC Database Security Workshop., 1989. [Lin 89b] T. Y. Lin. L. Kerschberg, and R. Trueblood. Security Algebra and Formal Models, In Proceedings of the 3rd IFIP 11.3 WG on Database Security, September 1989. [Lin 90a] T. Y. Lin. Database, Aggregation, and Security Algebra. In Proceedings of the 4th IFIP 11.3 WG on Database Security, September 1990. [Lin 90b] T. Y. Lin. Messages and Noncommutative Aggregation. In Proceedings of the 3rd RADC Database Security Workshop, 1990. [Lin 91] T. Y. Lin. Multilevel Database and Aggregated Security Algebra, In Database Security, IV: Status and Prospects, ed. S. Jajodia, and C. E. Landwehr, North- Holland, Amsterdam, pp. 325-350.1991. [Lin 92] T. Y. Lin. Inference Secure Multilevel Databases. In Proceedings of the Sixth IFIP !1.3 WG on Database Security, August 1992. [Lunt 88] T. F. Lunt, R. R. Schell, W. R. Shockley, M. Heckman, and D. Warren. Toward a multilevel relational data language. In Proceedings of the Fourth Aerospace Computer Security Applications Conference, December 1988. [Lunt 89] T. F. Lunt, Aggregation and Inference: Facts and Fallacies. In Proceedings of the IEEE Symposium on Security and Privacy, May 1989. [Lunt 94] T. F. Lunt. The Inference Problem: A Practical Solution, In Proceedings of the 17th NCSC Conference, pages 507-509. Baltimore, MD, October 1994. [Marks 94a] D. Marks, L. Binns, and M. B. Thuraisingham. Hypersemantic Data Modeling for Inference Analysis. In Proceedings for the 8th IFIP 11.3 WG on Database Security, August 1994. [Marks 94b] D. Marks. An Inference Paradigm. In Proceedings of the 17th NCSC Conference, pages 497-506. Baltimore, MD, October 1994. [Meadows 88a] C. Meadows and S. Jajodia. Integrity Versus Security in Multi-Level Secure Databases. In Database Security: Status and Prospects, ed. Carl E. Landwehr, North-Holland, Amsterdam, pp. 89-101,1988. [Meadows 88b] C. Meadows and S. Jajodia. Maintaining Correctness, Availability, and Unambiguity in Trusted Database Management Systems. In Proceedings of the 4th Aerospace Computer Security Applications Conference, pp. 106-110, December 1988. [Meadows 91] C. Meadows, Aggregation Problems: A Position Paper, In Proceedings of the Third RADC Database Security Workshop, May 1991. [Millen 91] J. Millen. Honesty is the Best Policy. In Proceedings of the 3rd Rome Laboratory Database Security Workshop, 1991. [Morgenstern 87] M. Morgenstern. Security and Inference in Multilevel Database and Knowledge-Base Systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 357-373, May 1987. [Morgenstern 88] M. Morgenstern. Controlling Logical Inference in Multilevel Database Systems. In Proceedings of the IEEE Symposium on Security and Privacy, pp. 245-255, April 1988. [Motro 94] A. Motro, D. G. Marks, and S. Jajodia. Aggregation in Relational Databases: Controlled Disclosure of Sensitive Information. In Proceedings of 3rd European Symposium on Research in Computer Security (ESORICS), pp. 431-445, Brighton, UK, November 1994. [National Academy Press 83] Multilevel Data Management Security, National Academy Press, Washington, DC (FOR OFFICIAL USE ONLY). 1983. [Pernul 93] G. Pernul, W. Winiwarter, and A. M. Tjoa, The Deductive Filter Approach to MLS Database Prototyping, Proceedings of the Tenth Annual Computer Security Applications Conference, December 1993. [Poly 96] National Computer Security Center, Polyinstantiation Issues in Multilevel Secure Database Management Systems, NCSC Technical Report-005, Volume 3/5, May 1996. [Qian 92] X. Qian. Integrity, secrecy, and inference channels. In Proceedings of the Fifth Rome Laboratory Database Security Workshop, October 1992. [Qian 93a] X. Qian. A model-theoretic semantics of the multilevel secure relational model. Technical Report SRI-CSL-93-06, Computer Science Laboratory, SRI International, November 1993. [Qian 93b] X. Qian, M. E. Stickel, P. D. Karp, T. F. Lunt, and T. D. Garvey. Detection and elimination of inference channels in multilevel relational database systems. In Proceedings of the 1993 IEEE Symposium on Security and Privacy, pages 6-205, Oakland, California, May 1993. [Qian 94] X. Qian. Inference channel-free integrity constraints in multilevel relational databases. In Proceedings of the 1994 IEEE Symposium on Security and Privacy, pages 158-167, Oakland, California, May 1994. [Schum 87] D. A. Schum, Evidence and Inference for the Intelligence Analyst, University Press of America, Lanham, MD, 1987. [Smith 90a] G. Smith, A Taxonomy of Security-Relevant Knowledge, In Proceedings of the 13th National Computer Security Conference, Washington, D.C., October 1990. [Smith 90b] G. Smith. Modeling Security-Relevant Data Semantics. In Proceedings of the 1990 IEEE Symposium on Security and Privacy, Oakland, California, May 1990. [Smith 91] G. Smith. Thoughts on the Aggregation Problem. In Proceedings of the 3rd RADC Database Security Workshop, May 1991. [Sowa 84] J. Sowa. Conceptual Structures: Information Processing in Minds and Machines, Reading, MA, Addison Wesley, 1984. [Sowerbutts 90] B. Sowerbutts and S. Cordingley. Data Base Architectonics and Inferential Security. In Proceedings 4th IFIP 11.3 WG on Database Security, 1990. [Stachour 90] P. Stachour, and B. Thuraisingham. Design of LDV: A Multilevel Secure Relational Database Management System. In IEEE Transactions on Knowledge and Data Engineering, Vol. 2., No. 2, pp. 190-209, June 1990. [Stickel 94] M. E. Stickel. Elimination of Inference Channels by Optimal Upgrading. In Proceedings of the 1994 IEEE Symposium on Security and Privacy, pages 168- 174, Oakland, California, May 1994. [Su 86] T. Su. Inferences in Database. Ph.D. Dissertation, Department of Computer Engineering and Science, Case Western Reserve University, August 1986. [Su 87] T. Su and G. Ozsoyogiu. Data Dependencies and Inference Control in Multilevel Relational Database Systems. In Proceedings of the IEEE Symposium on Security and Privacy, pp. 202-211, April 1987. [Su 90] T. Su and G. Ozsoyoglu. Multivalued Dependency Inferences in Multilevel Relational Database Systems. In Database Security III: Status and Prospects, eds. D. L. Spooner and C. Landwehr, pp. 293-300, NorthHolland, Amsterdam, 1990. [TDI 91] Trusted Database Management System Interpretation of the Trusted Computer System Evaluation Criteria, NCSC-TG-021, National Computer Security Center, April 1991. [TNI 87] Trusted Network Interpretation of the Trusted Computer System Evaluation Criteria, NCSC-TG-005, National Computer Security Center, July 1987. [Thuraisingham 87] M. B. Thuraisingham. Security Checking in Relational Database Management Systems Augmented with Inference Engines. In Computers & Security, Vol. 6, pp. 479-492,1987. [Thuraisingham 88] M. B. Thuraisingham, W. Tsai, and T. Keefe. Secure Query Processing Using Al Techniques. In Proceedings of the Hawaii International Conference on Systems Sciences. January 1988. [Thuraisingham 89a] M. B. Thuraisingham. Security checking with Prolog Extensions. In Proceedings of the 2nd RADC Database Security Workshop, Franconia, NH, May 1989. [Thuraisingham 89b] M. B. Thuraisingham. Secure Query Processing in Intelligent Database Management Systems. In Proceedings of the 5th Computer Security Applications Conference. Tucson, AZ, December 1989. [Thuraisingham 90a] M. B. Thuraisingham. Towards the Design of a Secure Data/Knowledge Base Management System. In Data and Knowledge Engineering Journal. Vol. 5, #1, March 1990. [Thuraisingham 90b] M. B. Thuraisingham. The Inference Problem in Database Security. In IEEE CIPHER. 1990. [Thuraisingham 90c] M. B. Thuraisingham. Recursion Theoretic Properties of the Inference Problem in Database Security. MTP-291, The MITRE Corporation, Bedford, MA, May 1990. [Thuraisingham 90d] M. B. Thuraisingham. Novel Approaches to Handling the Inference Problem in Database Security. In Proceedings of the 3rd RADC Database Security Workshop, Castile, NY, June 1990. [Thuraisingham 90e] M. B. Thuraisingham. A Nonmonotonic Typed Multilevel Logic for Multilevel Data/Knowledge Base Management Systems. Technical Report MTR 10935, The MITRE Corporation, June 1990. [Thuraisingham 90f] M. B. Thuraisingham. The use of conceptual structures for handling the inference problem. Technical Report M90-55, The MITRE Corporation, Bedford, MA, August 1990. [Thuraisingham 91] M. B. Thuraisingham. The use of conceptual structures for handling the inference problem. In Proceedings 5th IFIP 11.3 WG on Database Security, 1991. [Thuraisingham 92] M. B. Thuraisingham. Knowledge-Based Inference Control in a Multilevel Secure Database Management System. In Proceedings of the 15th National Computer Security Conference, October 1992. [Thuraisingham 94] M. B. Thuraisingham, M. Collins, and H. Rubinovitz. Database Inference Control. In Proceedings of the 6th Rome Laboratory Database Security Workshop, 1994. [Ullman 88] J. D. Ullman. Database and Knowledge-Base Systems Volume I, Rockville, MD, 1988. [Urban 90] S. Urban and L. Delcambre. Constraint Analysis for Specifying Perspectives of Class Objects. In Proceedings of the 5th IEEE International Conference on Data Engineering, Los Angeles, CA, December 1990. [Wilson 88] J. Wilson. Views as the Security Objects in a Multilevel Secure Database Management System. In Proceedings of the IEEE Symposium on Security and Privacy, Oakland, CA. April 1988. Protection Requirements Threat Risk Analysis Vulnerability Controls Access Control Rules System Design DBMS Access Controls Classification Rules Database Design Database Instantiation Information Protection Requirements use sponsors have Employees Research Projects Client Education Critical Technologie s use sponsors have Employees Research Projects Client Education Critical Technologie s Number Name Dept. Home Address Home Telephone GPA Degree School Major Name Name Address POC Description Target Platform Description Name ID work-on work-on Information Protection Requirements Integrity Requirements Confidentiality Requirements Availability Requirements Direct Disclosure Inference Classification Rules Entity Attribute Data Association Cardinal Aggregation Flawed Rules Allow Direct Disclosure Flawed Rules Allow Unintended Inference Cardinal/ Aggregation vulnerabilities vulnerabilities Database Design Faithful Representation of Flawed Rules Flawed Design Allows Direct Disclosure Schema/Content Flawed Design Allows Unintended Inference Schema/Content Cardinal Aggregation Labeled Low Database Instantiation Inappropriate Labeling Allows Disclosure Inappropriate Labeling Allows Unintended Inference vulnerabilities Number Name Dept. Hone.Address Home.Telephone Employee [U] Proj.ID Platform ID Name Description Employee-Salary [L] Employee.Number Salary Research Projects [U] Project-Target [H] Client [L] Name Address POC Client.Name Proj.ID Client-Project [H] Name Description Critical Technologies [L] Emp.Number E-Num School Degree Major Education [U] [U] [L] [H] [H] [L] [H] [U, L] [U] Legend classification of metadata attribute1 attribute2 Table Name [U] [U] classification of tuples VISIBLE INVISIBLE VISIBLE INVISIBLE KNOWN VISIBLE INVISIBLE KNOWN User Interface Inference Engine DBMS Query Translator Request in Logic Response Response Database Relational Query Modified Request Knowledge Base Request User Interface Knowledge Base Inference Engine Rules Base Expert Number Name Dept. Home.Address Home.Telephone Employee [U] Client.Name Employee-Salary [L] Proj.ID ID Name Description Research Projects [U] Employee.Number Salary Client-Project [H] Client [L] Name Address POC [L] [H] [U, L] [U] [L] Inter-Entity Association between Project and Client Inter-Entity Association between Project and Client vulnerabilities INFER (x > y) = H(y) - Hx(y) H(y) Keith F. Brewster Acting Chief, Partnerships and Processes May 1996