DATABASE ANOMALIES Database Management System
DATABASE ANOMALIES Database Management System
DATABASE ANOMALIES Database Management System : Database anomalies are the problems in relations that occur due to redundancy in the relations. These anomalies affect the process of inserting, deleting and modifying data in the relations. Some important data may be lost if a relations is updated that contains database anomalies. It is important to remove these anomalies in order to perform different processing on the relations without any problem.
We have said that relations thatform the database must satisfy some properties, for example, relations have no duplicate tuples, tuples have no ordering associated with them, and each element in the relation is atomic. Relations that satisfy these basic requirements may still have some undesirable properties, for example, data redundancy and update anomalies. We illustrate these properties and study how relations may be transformed or decomposed (or normalised) to eliminate them. Most such undesirable properties do not arise if the database modelling has been carried out very carefully using some technique like the Entity-Relationship Model that we have discussed but it is still important to understand the techniques in this chapter to check the model that has been obtained and ensure that no mistakes have been made in modeling.
The Central concept in these discussions is the notion of functional dependency which depends on understanding the semantics of the data and which deals with what information in a relation is dependent on what other information in the relation. We will define the concept of functional dependency and discuss how to reason with the dependencies. We will then show how to use the dependencies information to decompose relations whenever necessary to obtain relations that have the desirable properties that we want without loosing any of the information in the original relations.
Let us consider the following relation student.
The above table satisfies the properties of a relation and is said to be in first normal form (or 1 NF). Conceptually it is Convenient to have all the information in one relation since it is then likely to be easier to query the database. But the above relation has the following undesirable features (Anomalies):
- Redundancy Anomalies : A lot of information is being repeated. Student name, address, course name, instructor name and office number are being repeated often. Every time we wish to insert a student enrolment, say, in CP302 we must insert the name of the course CP302 as well as the name and office number of its instructor. Also everytime we inserta new enrolment for, say Smith, we must repea this name and address. Repetition of information results in wastage of storage as well as other problems.
- Updation Anomalies : Redundant information not only wastes storage but makes updates more difficult since, for example, changing the name of the instructor of CP302 would require that all tuples Containing CP302 enrolment information be updated. If for Some reason, all tuples are not updated, we might have a database that gives two names of instructor for subject CP302. This difficulty is called the update anomaly,
- Insertion Anomalies : inability to represent certain information : Let the primary key of the above relation be (sno, cno). Any new tuple to be inserted in the relation must have a value for the primary key since existential integrity requires that a key may not be totally or partially NULL. However, if one wanted to insert the number and name of a new Course in the database, it would not be possible until a student enrols in the course and we are able to insert values of sno and cno. Similarly information about a new student cannot be inserted in the database until the student enrols in a subject. These difficulties are called insertion anomalies.
- Deletion Anomalies : Loss of Useful information: in some instances, useful information may be lost when a tuple is deleted. For example, if we delete the tuple corresponding to student 85001 doing CP304, we will loose relevant information about Course CP304 (viz. Course name, instructor, office number), if the student 85001 was the only student enrolled in that course. Similarly deletion of course CP302 from the database may remove all information about the student named Jones. This is called deletion anomalies.
The above problems arise primarily because the relation student has information about Students as well as subjects. One solution to deal with the problems is to decompose the relation into two or more smaller relations.
Decomposition may provide further benefits, for example, in a distributed database different relations may be stored at different sites if necessary. Of course, decomposition does increase the cost of query processing since the decomposed relations will need to be joined, Sometime frequently.
The above relation may be easily decomposed into three relations to remove most of the above undesirable properties:
S (sno, sname, address)
C (Cno, Cname, instructor, office)
SC (sno, Cno)
Such decomposition is called normalization and is essential if we wish to overcome undesirable anomalies. As noted earlier, normalization often has an adverse effect on performance. Data which could have been retrieved from one relation before normalization may require several relations to be joined after normalization. Normalization does however lead to more efficient updates since an update that may have required several tuples to be updated before normalization could well need only one tuple to be updated after normalization.
Although in the above case we are able to look at the original relation and propose a suitable decomposition that eliminates the anomalies that we have discussed, in general this approach is not possible. A relation may have one hundred or more attributes and it is then almost impossible for a person to conceptualise all the information and suggest a suitable decomposition to overcome the problems. We therefore need an algorithmic approach to finding if there are problems in a proposed database design and how to eliminate them if they exist.
There are several stages of the normalization process. These are called the first normal form (1NF), the second normal form (2NF), the third normal form (3.NF), Boyce-Codd normal form (BCNF), the fourth normal form (4NF) and the fifth normal form (5NF). For all practical purposes, 3 NF or the BCNF are quite adequate since they remove the anomalies discussed above for most common situations. It should be clearly understood that there is no obligation to normalise relations to the highest possible level. Performance should be taken into account and this may result in a decision not to normalise, say, beyond second normal form.
Intuitively, the second and third normal forms are designed to result in relations such that each relation contains information about only one thing (either an entity or a relationship). That is, non-key attributes in each relation must provide a fact about the entity or relationship that is being identified by the key. Again, a sound E-R model of the database would ensure that all relations either provide facts about an entity or about a relationship resulting in the relations that are obtained being in 3NF.
It should be noted that decomposition of relations has to be always based on principles that ensure that the original relation may be reconstructed from the decomposed relations if and when necessary. If we are able to reduce redundancy and not loose any information, it implies that all that redundant information can be derived given the other information in the database. Therfore information that has been removed must be related or dependent on other information that still exists in the database. That is why the concept of redundancy is important. Careless decomposition of a relation can result in loss of information. We will discuss this in detail later.