DUNIA TEKNOLOGI MAKLUMAT: DATA INTEGRITY ISSUES

Introduction

Data integrity issues are common in relational databases, especially in operational (OLTP) systems. These issues are typically fixed by ETL (Extraction, Transformation and Load) jobs that load the data into a data warehouse. However it is not uncommon to have some integrity issues even in data warehouses.
SQL Server 2005 Analysis Services supports cubes built directly from operational data stores, and it offers some sophisticated controls to manage the data integrity issues inherent in such systems. Database administrators can greatly simplify their cube management tasks by exploiting these controls.

Types of Data Integrity Issues

In this section, we will identify some of the common data integrity issues. We will use the following relational schema for our discussions:

The sales fact table has a foreign key product_id that points to the primary key product_id in the product dimension table.
The product dimension table has a foreign key product_class_id that points to the primary key product_class_id in the product_class dimension table.
Figure 1. Relational schema

Referential Integrity

Referential integrity (RI) issues are the most common of data integrity issues in relational databases. An RI error is essentially a violation of a foreign key–primary key constraint. For example:

The sales fact table has a record with a product_id that does not exist in the product dimension table.
The product dimension table has a product_class_id that does not exist in the product_class dimension table.

NULL Values

Although NULL values are common and even valid in relational databases, they need special treatment in Analysis Services. For example:

The sales fact table has a record with NULL values in store_sales, store_cost and unit_sales. These could be interpreted as a transaction with zero sales, or as if the transaction did not exist. MDX query results (NON EMPTY) would differ depending on the interpretation.
The sales fact table has a record with a NULL value in product_id. Although this is not an RI error in the relational database, it is a data integrity issue that Analysis Services needs to handle.
The product table has a record with a NULL value in product_name. Since this column is providing member keys or names in Analysis Services, the NULL value could be preserved, converted to an empty string, etc.

Data Integrity Controls
In this section, we discuss the various controls that Analysis Services offers to database administrators for dealing with data integrity issues. Note that these controls are not mutually independent. For example, Null Processing is dependent on Unknown Member and Error Configuration is dependent on Null Processing and Unknown Member.

Unknown Member
The Dimension object has a property called UnknownMember that takes three possible values—None, Hidden, Visible. When UnknownMember=Hidden/Visible, the Analysis Server automatically creates a special member called the Unknown Member in every attribute of the dimension. UnknownMember=Hidden indicates that the unknown member will be hidden from query results and schema rowsets. The default value of UnknownMember is None.
The UnknownMemberName property can be used to specify a meaningful name for the unknown member. The UnknownMemberTranslations property can be used to specify localized captions for the unknown member.
Figure 2 shows the Product dimension with UnknownMember=Visible and UnknownMemberName="Invalid Product".

Figure 2. Product dimension
Null Processing
The DataItem object is used in the Analysis Services DDL to specify metadata about any scalar data item. This includes:

Key column(s) of an attribute

Name column of an attribute

Source column of a measure

The DataItem object contains many properties including the following:

DataType

DataSize

NullProcessing

Collation

The NullProcessing property specifies what action the server should take when it encounters a NULL value. It can take five possible values:

ZeroOrBlank—This tells the server to convert the NULL value to a zero (for numeric data items) or a blank string (for string data items). This is how Analysis Services 2000 handles NULL values.

Preserve—This tells the server to preserve the NULL value. The server has the ability to store NULL just like any other value.

Error—This tells the server that a NULL value is illegal in this data item. The server will generate a data integrity error and discard the record.

UnknownMember—This tells the server to interpret the NULL value as the unknown member. The server will also generate a data integrity error. This option is applicable only for attribute key columns.

Default—This is a conditional default. It implies ZeroOrBlank for dimensions and cubes, and UnknownMember for mining structures and models.

Note that the NullProcessing options Error and UnknownMember generate data integrity errors, but the others do not.
The following picture shows the DataItem editor for the key columns of a dimension attribute.

Figure 3. DataItem Collection Editor
Error Nomenclature
Before we discuss the Error Configuration control, we need to clearly define the different types of data integrity errors that the server can encounter. We have already learned about two of them in the previous section on Null Processing. Following is the complete list:

NullKeyNotAllowed—This error is generated when an illegal NULL value is encountered and the record is discarded (when NullProcessing = Error).

NullKeyConvertedToUnknown—This error is generated when a NULL key value is interpreted as the unknown member (when NullProcessing = UnknownMember).

KeyDuplicate—This error is generated only during dimension processing when an attribute key is encountered more than once. Since attribute keys must be unique, the server will discard the duplicate records. In most cases, it is acceptable to have this error. But sometimes it indicates a flaw in the dimension design, leading to inconsistent relationships between attributes.

KeyNotFound—This is the classic referential integrity error in relational databases. It can be encountered during partition as well as dimension processing.

Error Configuration
The ErrorConfiguration object is central to the management of data integrity errors. The server comes with a default error configuration (in the msmdsrv.ini config file). The error configuration can also be specified on the database, dimension, cube, measure group and partition. In addition, the error configuration can also be overridden on the Batch andProcess commands.
The ErrorConfiguration object specifies how the server should handle the four types of data integrity errors. It has the following properties:

KeyErrorLogFile—This is the file to which the server will log the data integrity errors.

KeyErrorLimit (Default=zero)—This is the maximum number of data integrity errors that the server will allow before failing the processing. A value of -1 indicates that there is no limit.

KeyErrorLimitAction (Default=StopProcessing)—This is the action that the server will take when the key error limit is reached. It has two options:
StopProcessing—tells the server to fail the processing.

StopLogging—tells the server to continue processing but stop logging further errors.

KeyErrorAction (Default=ConvertToUnknown)—This is the action that the server should take when a KeyNotFound error occurs. It has two options:
ConvertToUnknown—tells the server to interpret the offending key value as the unknown member.

DiscardRecord—tells the server to discard the record. This is how Analysis Services 2000 handles KeyNotFound errors.

NullKeyNotAllowed (Default=ReportAndContinue)

NullKeyConvertedToUnknown (Default=IgnoreError)

KeyDuplicate (Default=IgnoreError)

KeyNotFound (Default=ReportAndContinue)—This is the action that the server should take when a data integrity error of this type occurs. It has three options:
IgnoreError tells the server to continue processing without logging the error or counting it towards the key error limit.

ReportAndContinue tells the server to continue processing after logging the error and counting it towards the key error limit.

ReportAndStop tells the server to log the error and fail the processing immediately (regardless of the key error limit).

Note that the server always executes the NullProcessing rules before the ErrorConfiguration rules for each record. This is important since NULL processing can produce data integrity errors that the ErrorConfiguration rules must then handle.
The following picture shows the ErrorConfiguration properties for a cube in the properties panel.

Figure 4. Properties panel
Scenarios
In this section, we will discuss various scenarios involving data integrity issues and show how the controls described in the previous section can be used to address them. We will continue to use the relational schema specified earlier.

Referential Integrity Issues in Fact Table
The sales fact table has records with product_id that does not exist in the product dimension table. The server will produce a KeyNotFound error during partition processing. By default, KeyNotFound errors are logged and counted towards the key error limit, which is zero by default. Hence the processing will fail upon the first error.
The solution is to modify the ErrorConfiguration on the measure group or partition. Following are two alternatives:

Set KeyNotFound=IgnoreError.

Set KeyErrorLimit to a sufficiently large number.

The default handling of KeyNotFound errors is to allocate the fact record to the unknown member. Another alternative is to set KeyErrorAction=DiscardRecord, to discard the fact table record altogether.

Referential Integrity Issues in SnowFlaked Dimension Table
The product dimension table has records with product_class_id that do not exist in the product_class dimension table. This is handled in the same way as in the previous section, except that the ErrorConfiguration on the dimension needs to be modified.

NULL Foreign Keys in Fact Table
The sales fact table has records in which the product_id is NULL. By default, the NULLs are converted to zero that is looked up against the product table. If zero is a valid product_id, then the fact data is attributed to that product (probably not what you want). Otherwise a KeyNotFound error is produced. By default, KeyNotFound errors are logged and counted towards the key error limit that is zero by default. Hence the processing will fail upon the first error.
The solution is to modify the NullProcessing on the measure group attribute. Following are two alternatives:

Set NullProcessing=ConvertToUnknown. This tells the server to attribute the records with NULL values to the unknown member "Invalid Product". This also producesNullKeyConvertedToUnknown errors, which are ignored by default.

Set NullProcessing=Error. This tells the server to discard the records with NULL values. This also produces NullKeyNotAllowed errors that, by default, are logged and counted towards the key error limit. Modifying the ErrorConfiguration on the measure group or partition can control this.

Figure 5. Edit Bindings dialog box
Note that the NullProcessing needs to be set on the KeyColumn of the measure group attribute. In the Dimension Usage tab of the cube designer, edit the relationship between the dimension and the measure group. Click Advanced, select the granularity attribute, and set the NullProcessing.

NULLs in Snowflaked Dimension Table
The product dimension table has records in which the product_class_id is NULL. This is handled in the same way as in the previous section, except that the NullProcessing needs to be set on the KeyColumn of DimensionAttribute (in the Properties pane of the dimension designer).

Inconsistent Relationships in Dimension Table
As described earlier, inconsistent relationships in the dimension table result in duplicate keys. In the example described earlier, the brand_name "Best Choice" appears twice with different product_class_id values. This produces a KeyDuplicate error that by default is ignored, and the server discards the duplicate record.
Alternatively setting KeyDuplicate=ReportAndContinue/ReportAndStop will cause the errors to be logged. The log can then be examined to determine potential flaws in the dimension design.

Conclusion
Data integrity issues can be challenging for database administrators to manage. SQL Server 2005 Analysis Services provides sophisticated controls such as Unknown Member, Null Processing, and Error Configuration that can greatly simplify cube management tasks.

Sunday, March 27, 2011

DATA INTEGRITY ISSUES

Introduction

Types of Data Integrity Issues

Referential Integrity

NULL Values

Data Integrity Controls

Unknown Member

Null Processing

Error Nomenclature

Error Configuration

Scenarios

Referential Integrity Issues in Fact Table

Referential Integrity Issues in SnowFlaked Dimension Table

NULL Foreign Keys in Fact Table

NULLs in Snowflaked Dimension Table

Inconsistent Relationships in Dimension Table

Conclusion

0 comments:

Post a Comment