I made this widget at MyFlashFetish.com.

Sunday, March 27, 2011

DATA INTEGRITY ISSUES


Introduction

Data integrity issues are common in relational databases, especially in operational (OLTP) systems. These issues are typically fixed by ETL (Extraction, Transformation and Load) jobs that load the data into a data warehouse. However it is not uncommon to have some integrity issues even in data warehouses.
SQL Server 2005 Analysis Services supports cubes built directly from operational data stores, and it offers some sophisticated controls to manage the data integrity issues inherent in such systems. Database administrators can greatly simplify their cube management tasks by exploiting these controls.

Types of Data Integrity Issues

In this section, we will identify some of the common data integrity issues. We will use the following relational schema for our discussions:
  • The sales fact table has a foreign key product_id that points to the primary key product_id in the product dimension table.
  • The product dimension table has a foreign key product_class_id that points to the primary key product_class_id in the product_class dimension table.
    ms345138.as2k5dataintegrity_01(en-US,SQL.90).gif
    Figure 1. Relational schema

Referential Integrity

Referential integrity (RI) issues are the most common of data integrity issues in relational databases. An RI error is essentially a violation of a foreign key–primary key constraint. For example:
  • The sales fact table has a record with a product_id that does not exist in the product dimension table.
  • The product dimension table has a product_class_id that does not exist in the product_class dimension table.

NULL Values

Although NULL values are common and even valid in relational databases, they need special treatment in Analysis Services. For example:
  • The sales fact table has a record with NULL values in store_salesstore_cost and unit_sales. These could be interpreted as a transaction with zero sales, or as if the transaction did not exist. MDX query results (NON EMPTY) would differ depending on the interpretation.
  • The sales fact table has a record with a NULL value in product_id. Although this is not an RI error in the relational database, it is a data integrity issue that Analysis Services needs to handle.
  • The product table has a record with a NULL value in product_name. Since this column is providing member keys or names in Analysis Services, the NULL value could be preserved, converted to an empty string, etc.

Data Integrity Controls

In this section, we discuss the various controls that Analysis Services offers to database administrators for dealing with data integrity issues. Note that these controls are not mutually independent. For example, Null Processing is dependent on Unknown Member and Error Configuration is dependent on Null Processing and Unknown Member.

Unknown Member

The Dimension object has a property called UnknownMember that takes three possible values—None, Hidden, Visible. When UnknownMember=Hidden/Visible, the Analysis Server automatically creates a special member called the Unknown Member in every attribute of the dimension. UnknownMember=Hidden indicates that the unknown member will be hidden from query results and schema rowsets. The default value of UnknownMember is None.
The UnknownMemberName property can be used to specify a meaningful name for the unknown member. The UnknownMemberTranslations property can be used to specify localized captions for the unknown member.
Figure 2 shows the Product dimension with UnknownMember=Visible and UnknownMemberName="Invalid Product".
ms345138.as2k5dataintegrity_02(en-US,SQL.90).gif
Figure 2. Product dimension

Null Processing

The DataItem object is used in the Analysis Services DDL to specify metadata about any scalar data item. This includes:
  • Key column(s) of an attribute
  • Name column of an attribute
  • Source column of a measure
The DataItem object contains many properties including the following:
  • DataType
  • DataSize
  • NullProcessing
  • Collation
The NullProcessing property specifies what action the server should take when it encounters a NULL value. It can take five possible values:
  • ZeroOrBlank—This tells the server to convert the NULL value to a zero (for numeric data items) or a blank string (for string data items). This is how Analysis Services 2000 handles NULL values.
  • Preserve—This tells the server to preserve the NULL value. The server has the ability to store NULL just like any other value.
  • Error—This tells the server that a NULL value is illegal in this data item. The server will generate a data integrity error and discard the record.
  • UnknownMember—This tells the server to interpret the NULL value as the unknown member. The server will also generate a data integrity error. This option is applicable only for attribute key columns.
  • Default—This is a conditional default. It implies ZeroOrBlank for dimensions and cubes, and UnknownMember for mining structures and models.
Note that the NullProcessing options Error and UnknownMember generate data integrity errors, but the others do not.
The following picture shows the DataItem editor for the key columns of a dimension attribute.
ms345138.as2k5dataintegrity_03(en-US,SQL.90).gif
Figure 3. DataItem Collection Editor

Error Nomenclature

Before we discuss the Error Configuration control, we need to clearly define the different types of data integrity errors that the server can encounter. We have already learned about two of them in the previous section on Null Processing. Following is the complete list:
  • NullKeyNotAllowed—This error is generated when an illegal NULL value is encountered and the record is discarded (when NullProcessing = Error).
  • NullKeyConvertedToUnknown—This error is generated when a NULL key value is interpreted as the unknown member (when NullProcessing = UnknownMember).
  • KeyDuplicate—This error is generated only during dimension processing when an attribute key is encountered more than once. Since attribute keys must be unique, the server will discard the duplicate records. In most cases, it is acceptable to have this error. But sometimes it indicates a flaw in the dimension design, leading to inconsistent relationships between attributes.
  • KeyNotFound—This is the classic referential integrity error in relational databases. It can be encountered during partition as well as dimension processing.

Error Configuration

The ErrorConfiguration object is central to the management of data integrity errors. The server comes with a default error configuration (in the msmdsrv.ini config file). The error configuration can also be specified on the database, dimension, cube, measure group and partition. In addition, the error configuration can also be overridden on the Batch andProcess commands.
The ErrorConfiguration object specifies how the server should handle the four types of data integrity errors. It has the following properties:
  • KeyErrorLogFile—This is the file to which the server will log the data integrity errors.
  • KeyErrorLimit (Default=zero)—This is the maximum number of data integrity errors that the server will allow before failing the processing. A value of -1 indicates that there is no limit.
  • KeyErrorLimitAction (Default=StopProcessing)—This is the action that the server will take when the key error limit is reached. It has two options:
    • StopProcessing—tells the server to fail the processing.
    • StopLogging—tells the server to continue processing but stop logging further errors.
  • KeyErrorAction (Default=ConvertToUnknown)—This is the action that the server should take when a KeyNotFound error occurs. It has two options:
    • ConvertToUnknown—tells the server to interpret the offending key value as the unknown member.
    • DiscardRecord—tells the server to discard the record. This is how Analysis Services 2000 handles KeyNotFound errors.
  • NullKeyNotAllowed (Default=ReportAndContinue)
  • NullKeyConvertedToUnknown (Default=IgnoreError)
  • KeyDuplicate (Default=IgnoreError)
  • KeyNotFound (Default=ReportAndContinue)—This is the action that the server should take when a data integrity error of this type occurs. It has three options:
    • IgnoreError tells the server to continue processing without logging the error or counting it towards the key error limit.
    • ReportAndContinue tells the server to continue processing after logging the error and counting it towards the key error limit.
    • ReportAndStop tells the server to log the error and fail the processing immediately (regardless of the key error limit).
Note that the server always executes the NullProcessing rules before the ErrorConfiguration rules for each record. This is important since NULL processing can produce data integrity errors that the ErrorConfiguration rules must then handle.
The following picture shows the ErrorConfiguration properties for a cube in the properties panel.
ms345138.as2k5dataintegrity_04(en-US,SQL.90).gif
Figure 4. Properties panel

Scenarios

In this section, we will discuss various scenarios involving data integrity issues and show how the controls described in the previous section can be used to address them. We will continue to use the relational schema specified earlier.

Referential Integrity Issues in Fact Table

The sales fact table has records with product_id that does not exist in the product dimension table. The server will produce a KeyNotFound error during partition processing. By default, KeyNotFound errors are logged and counted towards the key error limit, which is zero by default. Hence the processing will fail upon the first error.
The solution is to modify the ErrorConfiguration on the measure group or partition. Following are two alternatives:
  • Set KeyNotFound=IgnoreError.
  • Set KeyErrorLimit to a sufficiently large number.
The default handling of KeyNotFound errors is to allocate the fact record to the unknown member. Another alternative is to set KeyErrorAction=DiscardRecord, to discard the fact table record altogether.

Referential Integrity Issues in SnowFlaked Dimension Table

The product dimension table has records with product_class_id that do not exist in the product_class dimension table. This is handled in the same way as in the previous section, except that the ErrorConfiguration on the dimension needs to be modified.

NULL Foreign Keys in Fact Table

The sales fact table has records in which the product_id is NULL. By default, the NULLs are converted to zero that is looked up against the product table. If zero is a valid product_id, then the fact data is attributed to that product (probably not what you want). Otherwise a KeyNotFound error is produced. By default, KeyNotFound errors are logged and counted towards the key error limit that is zero by default. Hence the processing will fail upon the first error.
The solution is to modify the NullProcessing on the measure group attribute. Following are two alternatives:
  • Set NullProcessing=ConvertToUnknown. This tells the server to attribute the records with NULL values to the unknown member "Invalid Product". This also producesNullKeyConvertedToUnknown errors, which are ignored by default.
  • Set NullProcessing=Error. This tells the server to discard the records with NULL values. This also produces NullKeyNotAllowed errors that, by default, are logged and counted towards the key error limit. Modifying the ErrorConfiguration on the measure group or partition can control this.
ms345138.as2k5dataintegrity_05(en-US,SQL.90).gif
Figure 5. Edit Bindings dialog box
Note that the NullProcessing needs to be set on the KeyColumn of the measure group attribute. In the Dimension Usage tab of the cube designer, edit the relationship between the dimension and the measure group. Click Advanced, select the granularity attribute, and set the NullProcessing.

NULLs in Snowflaked Dimension Table

The product dimension table has records in which the product_class_id is NULL. This is handled in the same way as in the previous section, except that the NullProcessing needs to be set on the KeyColumn of DimensionAttribute (in the Properties pane of the dimension designer).

Inconsistent Relationships in Dimension Table

As described earlier, inconsistent relationships in the dimension table result in duplicate keys. In the example described earlier, the brand_name "Best Choice" appears twice with different product_class_id values. This produces a KeyDuplicate error that by default is ignored, and the server discards the duplicate record.
Alternatively setting KeyDuplicate=ReportAndContinue/ReportAndStop will cause the errors to be logged. The log can then be examined to determine potential flaws in the dimension design.

Conclusion

Data integrity issues can be challenging for database administrators to manage. SQL Server 2005 Analysis Services provides sophisticated controls such as Unknown Member, Null Processing, and Error Configuration that can greatly simplify cube management tasks.



Real Time rocessing

Real Time (Transaction)

Another type of real-time system involves transactions. As soon as a transaction is received by the computer, it is processed and any data files are updated. The system is real-time. If a customer books a seat at a performance then the details are added to the bookings file immediately and that seat is flagged as 'Reserved'. Another customer who asks a few seconds later for a seat at the same performance will not be able to book the same seat.

Example: Theater Booking System


Theater Booking System

About the Booking System

-  They built Theater booking systems to any specification
- Take payments for bookings by debit and credit cards online
- Theatre systems are web based and can be accessed from anywhere in the world
- Each system is custom designed to match your company branding
- Bookings can still be taken via phone and in person by using the administration panel
Real Time (Process)

Real-time means that data is processed immediately.

A real-time system is always 'up-to-date'. The computer used in a real-time system is 'dedicated' - it does nothing else.


An example of a real-time system is a Process-Control System - computers controlling a manufacturing system. Input data received from sensors is processed immediately, analyzed and any necessary actions taken without any delay.   


Another example would be a flight simulator.

KEY FIELD

 KEY FIELD

A key is a data item that allows us to uniquely identify individual occurrences or an entity type. You can sort and quickly retrieve information from a database by choosing one or more fields (ie attributes) to act as keys. For instance, in a student's table you could use a combination of the last name and first name fields (or perhaps last name, first name and birth dates to ensure you identify each student uniquely) as a key field. 
There are several types of key field:
  • Primary Key
  • Secondary Key
  • Foreign key
  • Simple key
  • Compound key
  • Composite key


Primary Key


A primary key consists of one or more attributes that distinguishes a specific record from any other. For each record in the table the primary key acts like a driver's licence number or a national insurance number, only one number exists for each person.
For example, your student number is a primary key as this uniquely identifies you within the college student records system. An employee number uniquely identifies a member of staff within a company. 
An IP address uniquely addresses a PC on the internet.
A primary key is mandatory. That is, each entity occurrence must have a value for its primary key.


Secondary Key

An entity may have one or more choices for the primary key. Collectively these are known as candidate keys. One is selected as the primary key. Those not selected are known as secondary keys.
For example, an employee has an employee number, a National Insurance (NI) number and an email address. If the employee number is chosen as the primary key then the NI number and email address are secondary keys. However, it is important to note that if any employee does not have a NI number or email address (ie: the attribute is not mandatory) then it cannot be chosen as a primary key.



Foreign Key
A foreign key is one or more attribute in one entity, which enables a link (or relationship) to another entity. That is, a foreign key in one entity links to a primary key in another entity. However, if the business rules permit, a foreign key may be optional.
For example, an employee works in a department. The department number column in the employee entity is a foreign key, which links to the department entity.
Foreign keys will be explained in more detail when we explore the normalisation process later in this section.

What is a Key field in a Database and How should I choose one?

Keys are crucial to a table structure for many reasons, some of which are identified below:
§                  They ensure that each record in a table is precisely identified.
§                  They help establish and enforce various types of integrity.
§                  They serve to establish table relationships.

Now let's see how you should choose your key(s). First, let's make up a little table to look at:
PersonID
LastName
FirstName
D.O.B
1
Smith
Robert
01/01/1970
2
Jones
Robert
01/01/1970
4
Smith
Henry
01/01/1970
5
Jones
Henry
01/01/1970

A superkey is a column or set of columns that uniquely identify a record. This table has many superkeys:
§                  PersonID
§                  PersonID + LastName
§                  PersonID + FirstName
§                  PersonID + DOB
§                  PersonID + LastName + FirstName
§                  PersonID + LastName + DOB
§                  PersonID + FirstName + DOB
§                  PersonID + LastName + FirstName + DOB
§                  LastName + FirstName + DOB

All of these will uniquely identify each record, so each one is a superkey. Of those keys, a key which is comprised of more than one column is a composite key; a key of only one column is a simple key.
A candidate key is a superkey that has no unique subset; it contains no columns that are not necessary to make it unique. 

This table has 2 candidate keys:
§                  PersonID
§                  LastName + FirstName + DOB

Not all candidate keys make good primary keys: Note that these may work for our current data set, but would likely be bad choices for future data. It is quite possible for two people to share a full name and date of birth.

We select a primary key from the candidate keys. This primary key will uniquely identify each record. It may or may not provide information about the record it identifies. It must not be Null-able, that is if it exists in a record it can not have the value Null. It must be unique. Itcan not be changed. Any candidate keys we do not select become alternate keys.

We will select (PersonID) as the primary key. This makes (LastName + FirstName + DOB) an alternate key.
Now, if this field PersonID is meaningful, that is it is used for any other purpose than making the record unique, it is a natural key or intelligent key. In this case PersonID is probablynot an AutoNumber field, but is rather a "customer number" for use, much like the UPC or ISBN.

However, if this field is not meaningful, that is it is strictly for the database to internally identify a unique record, it is a surrogate key or blind key. In this case Person ID probablyis an AutoNumber field, and it should not be used except internally by the database.

There is a long running debate over whether one should use natural or surrogate keys, and I'm not going to foolishly attempt to resolve it here. Whichever you use, stick with it. If you choose to generate an AutoNumber that is only used to identify a record, do not expose that number to the user. They will surely want to change it, and you can not change primary keys.

I can now use my chosen primary key in another table, to relate the two tables. It may or may not have the same name in that second table. In either case, with respect to the second table it is a foreign key, and if in that second table the foreign key field is not indexed it is a fast foreign key.
Many thanks to JasonM, mdbmakers.com Moderator, at www.MDBMAKERS.com for permission to use the above article




TRANSACTION PROCESSING AND BATCH PROCESSING

Transaction processing is a process wherein a bunch of data is managed by a system. This includes different types of transactions, such as reserving airline tickets, processing credit cards and others. Batch transaction processing has the same idea but it deals with multiple transactions that are executed by a computer. It is set up so that a bunch of data is processed by the computer without any human interaction.
Here is how batch processing works:
·           Batch Processing - Batch processing is the processing of multiple credit card transactions all at the same time. A credit card processor system works with all the transactions without the need for human interaction in doing the job. Batch processing is useful for tellers as it is more convenient to settle different credit card transactions all at the same time without having to worry about it.
·           Using Your Credit Card - When you use your credit card to purchase or pay for products and services that you need, the transaction process for your purchases is monitored. This is the case for most banking companies, including Visa processing. The transaction system is the one that processes your credit card.
·           Credit Card Batch Processing - After you have purchased something using your credit card, it will be transacted into the processing system of the bank. You are not billed for each of your purchases using the credit card. Instead, you are given one bill with all the purchases that you have made. The bill is done that way because of batch processing. It accumulates all your purchases and puts them in one billing statement when the billing time arrives. This cycle is repeated for each month of the year.
Some benefits of batch processing:
·          Convenience - Batch processing is more convenient for bank tellers. This is because they do not need to do transactions online for each purchase but instead, one whole statement that has different transactions. They also have more time accommodating customers as batch processing machines do not need to be operated by a person in order to work.
·          Same Money - Making transactions for each bill has a minimal expense but if you process a lot of them, the fees will pile up and you will see a big difference in expenses. Batch processing makes the transaction fees nominal as fewer resources are used for batch processing instead of individual processing.
·          Efficiency - Transactions in a batch statement are easier to find. This is helpful if you are trying to find a particular purchase or if you have a problem with a transaction. You will only have to find the transaction in 1 statement rather than going through plenty of statements.
This is how batch transaction works and some of the benefits that come with batch transaction. Batch transaction processing is not only meant for credit cards. It also deals with a lot of other transactions that are as important.

Batch Processing in the Google Data Protocol

Batch processing gives you the ability to execute multiple operations in one request, rather than having to submit each operation individually.

Note: To perform batch operations, you need to be using a recent version of your Google Data API client library. Batch operations are not supported by the JavaScript client library.

Contents

1.                  Audience
2.                  Introduction
3.                  Submitting a batch request
4.                  Writing a batch operations feed
5.                  Handling status codes
6.                  Example batch operations and status feeds

Introduction

<feed>
  <entry>
    <batch:operation type="insert"/>
    ... what to insert ...
  </entry> 
  <entry>
    <batch:operation type="update"/>
    ... what to update ...
  </entry>
  <entry>
    <batch:operation type="delete"/>
    ... what to delete ...
  </entry>
  <entry>
    <batch:operation type="query"/>
    ... what to query ...
  </entry>
</feed>

Using a GData batch feed, you can collect multiple insert, update, delete, and query operations, and then submit and execute them all at once.
For example, the following feed includes four operations:
he service will perform as many of the requested changes as possible and return status information that you can use to evaluate the success or failure of each operation.
The service attempts to execute each of the operations within a batch, even if some of the operations included in the batch do not succeed.

Submitting a batch request

<feed xmlns=...
  <id>http://www.google.com/base/feeds/items</id>
  <link rel="http://schemas.google.com/g/2005#feed"
    type="application/atom+xml"
    href="http://www.google.com/base/feeds/items"/>
  <link rel="http://schemas.google.com/g/2005#post"
    type="application/atom+xml"
    href="http://www.google.com/base/feeds/items"/>
  <link rel="http://schemas.google.com/g/2005#batch"
    type="application/atom+xml"
    href="http://www.google.com/base/feeds/items/batch"/>
  ...
</feed> 

A batch request should be sent as an HTTP POST to a batch URL. Different feeds support different batch operations. Read-only feeds only support queries.
To discover whether a given feed supports batch operations, you can query the feed. If the feed contains a "batch" link relation at the feed level, this indicates that the feed supports batch operations.
A "batch" link relation is a <link> element with rel="http://schemas.google.com/g/2005#batch". The href attribute of the link relation defines the URL where feed documents for batch operations may be posted.
For example, if you execute: GET http://www.google.com/base/feeds/items (the regular Google Base "items" feed), you might get the following response:
n this example, the batch URL is http://www.google.com/base/feeds/items/batch

Payroll



At the end of each pay period, a business needs to pay its employees. A payroll system uses a computer to calculate the wages of each employee, print out pay-slips and record the information for accounting purposes.

Input : This may come from ...
  • a database of employees details (salaries, pay rates, bonus rates etc)
  • if employees are paid by the hour then timesheets would be used to input and validate the number of hours worked and number of hours overtime. (possibly using OMR or OCR techniques)

Process : The computer needs to calculate ..
  • the gross amount earned by each employee.
  • any bonuses
  • any deductions such as tax, national insurance etc
  • the net amount earned by each employee.

Output : The computer would need to ..
  • print pay-slips with amounts and deductions (using pre-printed stationery)
  • update the employee database.
  • output details of payments to BACS (Bankers Automated Clearing Service) to pay money directly into employees' bank accounts.
  • print summary reports.

A Payroll system is usually run as a batch processing system. Data may be entered for a number of departments or branches of a company and then the processing is done when all the data has been collected.
As there is no urgency for the output, the payroll processing may be run at off-peak times (e.g. overnight).

The database of the employees and the time sheets will need to be kept secure from unauthorized access. (Employees must not be able to alter their data!)
The employee database is updated in the payroll process. A backup copy of the database is made before this is done.
The new database is called the 'son' and the backup is the 'father'. Previous generations of backups are referred to as 'grandfather' etc...

 for detail information, go to this site: