SQL Normalization: Unraveling the Fundamentals
Welcome to this comprehensive exploration of SQL Normalization. In this blog, we'll delve into the concept of normalization, its different forms, benefits, and a few examples to put these concepts into perspective.
Introduction to SQL Normalization
Normalization in SQL is a database design technique that organizes tables in a manner that reduces redundancy and dependency of data. It divides larger tables into smaller tables and links them using relationships. The inventor of the relational model, Edgar F. Codd, introduced normalization and its forms are based on the understanding of functional dependency.
Objective of Normalization
The main objective of normalization is to eliminate redundant (repeated) data, which in turn prevents data manipulations anomalies and saves storage. Normalization rules are divided into several forms, each with an increasing level of complexity.
The Normal Forms
The normalization process involves getting our data to conform to a series of 'normal forms' to ensure data integrity and reduce data redundancy. These normal forms, from first (1NF) through fifth (5NF), provide increasingly stringent rules for how the database schema should be organized.
First Normal Form (1NF)
1NF is achieved by ensuring that each column in a table contains atomic, indivisible values, and each cell contains only one value from its attribute (column) domain. It also necessitates that all entries in a column (attribute) are of the same kind.
For example, consider a table:
StudentID | Subject |
---|---|
1 | Math, Physics |
2 | English, History |
This table is not in 1NF because the Subject column has multiple values. A 1NF version of this table would be:
StudentID | Subject |
---|---|
1 | Math |
1 | Physics |
2 | English |
2 | History |
Second Normal Form (2NF)
A table is in 2NF if it is in 1NF and if all non-key attributes are fully functional dependent on the primary key. This means that all non-key attributes must depend on the entire set of primary key attributes.
Consider a table:
StudentID | CourseID | StudentName | CourseName |
---|---|---|---|
1 | 101 | John | Math |
2 | 102 | Sarah | English |
1 | 103 | John | Physics |
In this table, the primary key is (StudentID, CourseID). But the problem is that 'StudentName' is only dependent on 'StudentID', and 'CourseName' is only dependent on 'CourseID'. They are not dependent on the full set of primary keys. This is a partial dependency, which is not allowed in 2NF. So, we break this table up to remove the partial dependency:
Student table:
StudentID | StudentName |
---|---|
1 | John |
2 | Sarah |
Course table:
CourseID | CourseName |
---|---|
101 | Math |
102 | English |
103 | Physics |
StudentCourse table:
StudentID | CourseID |
---|---|
1 | 101 |
2 | 102 |
1 | 103 |
Third Normal Form (3NF)
A table is in 3NF if it is in 2NF and if there are no transitive functional dependencies. A transitive functional dependency is when a non-key attribute depends on another non-key attribute.
Consider a table:
StudentID | StudentName | CourseID | CourseName | CourseInstructor |
---|---|---|---|---|
1 | John | 101 | Math | Mr. Smith |
2 | Sarah | 102 | English | Ms. Johnson |
1 | John | 103 | Physics | Dr. Brown |
In this table, the primary key is (StudentID, CourseID). But 'CourseInstructor' depends on 'CourseName', which depends on 'CourseID'. This is a transitive dependency, which is not allowed in 3NF. So, we break this table up to remove the transitive dependency:
Student table:
StudentID | StudentName |
---|---|
1 | John |
2 | Sarah |
Course table:
CourseID | CourseName | CourseInstructor |
---|---|---|
101 | Math | Mr. Smith |
102 | English | Ms. Johnson |
103 | Physics | Dr. Brown |
StudentCourse table:
StudentID | CourseID |
---|---|
1 | 101 |
2 | 102 |
1 | 103 |
Boyce-Codd Normal Form (BCNF)
A table is in BCNF if it is in 3NF and for each of its dependencies X -> Y, X is a superkey. A superkey is a set of one or more columns that can uniquely identify a row in a table.
Let's consider the following table:
CourseID | CourseName | CourseInstructor |
---|---|---|
101 | Math | Mr. Smith |
102 | English | Ms. Johnson |
103 | Physics | Dr. Brown |
104 | Math | Dr. Stone |
The primary key here is (CourseID, CourseName). However, 'CourseInstructor' is dependent only on 'CourseName', which is part of the primary key. Therefore, we must separate the tables like this:
Course table:
CourseID | CourseName |
---|---|
101 | Math |
102 | English |
103 | Physics |
104 | Math |
Instructor table:
CourseName | CourseInstructor |
---|---|
Math | Mr. Smith |
English | Ms. Johnson |
Physics | Dr. Brown |
Math | Dr. Stone |
Fourth Normal Form (4NF)
A table is in 4NF if it is in BCNF and there are no multi-valued dependencies. A multi-valued dependency occurs when one attribute in a table depends on another, yet they are both independent of the primary keys.
Suppose we have a table:
StudentID | Skill | Hobby |
---|---|---|
1 | Programming | Fishing |
1 | Design | Fishing |
2 | Design | Painting |
2 | Design | Stamp collecting |
Here, a student can have multiple skills and multiple hobbies, but the skills and hobbies are independent of each other. This is a multi-valued dependency. To make it 4NF, we'd split it into two tables:
Skills table:
StudentID | Skill |
---|---|
1 | Programming |
1 | Design |
2 | Design |
Hobbies table:
StudentID | Hobby |
---|---|
1 | Fishing |
2 | Painting |
2 | Stamp collecting |
Fifth Normal Form (5NF) or Project-Join Normal Form (PJNF)
A table is in 5NF or PJNF if every join dependency in the table is a consequence of the candidate keys. This is about situations where information can be re-constructed from smaller pieces of information that can be maintained with less redundancy.
Let's consider an example. Suppose we have a database of musicians, instruments they play, and bands they play for:
Musician | Instrument | Band |
---|---|---|
John | Guitar | Band1 |
John | Bass | Band2 |
Paul | Bass | Band1 |
Paul | Drums | Band2 |
If we decompose this into two tables:
Table1
Musician | Instrument |
---|---|
John | Guitar |
John | Bass |
Paul | Bass |
Paul | Drums |
Table2
Musician | Band |
---|---|
John | Band1 |
John | Band2 |
Paul | Band1 |
Paul | Band2 |
We lost the information that John plays Guitar for Band1, Bass for Band2, etc. Hence, this decomposition leads to loss of information and it's not a correct decomposition. In other words, the original table is in 5NF.
Benefits of Normalization
Avoids Data Redundancy : Normalization splits up the data into additional tables to avoid duplication.
Data Consistency : By having a single update point, normalization brings in data consistency.
Database Optimization : Smaller tables with reduced data tend to increase the search performance.
Saves Storage Space : Normalization helps to eliminate redundant data, saving storage space.
An Example
Consider a table ‘Customer’ with fields: CustomerID, CustomerName, CustomerAddress and OrderID.
This table is not in 1NF as the OrderID field is multi-valued. To convert it into 1NF, we can divide it into two tables ‘Customer’ and ‘Order’ where CustomerID would be a common field.
The 'Customer' table now has: CustomerID, CustomerName, CustomerAddress. The 'Order' table has: CustomerID, OrderID.
Now, both tables are in 1NF. Further normalization can be applied as necessary.
Conclusion
Normalization is a key aspect of relational database design. Proper normalization ensures that your database is efficient, flexible, and it minimizes redundancy without compromising the integrity of your data. Although the process may seem complex, the benefits are worth the effort. It's crucial to remember that normalization, like many aspects of database design, must be carefully tailored to the requirements of the system – there's no one-size-fits-all solution. Keep exploring the world of SQL with us for more insights and guides! Happy learning!