SQL Normalization: Unraveling the Fundamentals

Welcome to this comprehensive exploration of SQL Normalization. In this blog, we'll delve into the concept of normalization, its different forms, benefits, and a few examples to put these concepts into perspective.

Introduction to SQL Normalization

Normalization in SQL is a database design technique that organizes tables in a manner that reduces redundancy and dependency of data. It divides larger tables into smaller tables and links them using relationships. The inventor of the relational model, Edgar F. Codd, introduced normalization and its forms are based on the understanding of functional dependency.

Objective of Normalization

The main objective of normalization is to eliminate redundant (repeated) data, which in turn prevents data manipulations anomalies and saves storage. Normalization rules are divided into several forms, each with an increasing level of complexity.

The Normal Forms

The normalization process involves getting our data to conform to a series of 'normal forms' to ensure data integrity and reduce data redundancy. These normal forms, from first (1NF) through fifth (5NF), provide increasingly stringent rules for how the database schema should be organized.

First Normal Form (1NF)

1NF is achieved by ensuring that each column in a table contains atomic, indivisible values, and each cell contains only one value from its attribute (column) domain. It also necessitates that all entries in a column (attribute) are of the same kind.

For example, consider a table:

StudentID	Subject
1	Math, Physics
2	English, History

This table is not in 1NF because the Subject column has multiple values. A 1NF version of this table would be:

StudentID	Subject
1	Math
1	Physics
2	English
2	History

Second Normal Form (2NF)

A table is in 2NF if it is in 1NF and if all non-key attributes are fully functional dependent on the primary key. This means that all non-key attributes must depend on the entire set of primary key attributes.

Consider a table:

StudentID	CourseID	StudentName	CourseName
1	101	John	Math
2	102	Sarah	English
1	103	John	Physics

In this table, the primary key is (StudentID, CourseID). But the problem is that 'StudentName' is only dependent on 'StudentID', and 'CourseName' is only dependent on 'CourseID'. They are not dependent on the full set of primary keys. This is a partial dependency, which is not allowed in 2NF. So, we break this table up to remove the partial dependency:

Student table:

StudentID	StudentName
1	John
2	Sarah

Course table:

CourseID	CourseName
101	Math
102	English
103	Physics

StudentCourse table:

StudentID	CourseID
1	101
2	102
1	103

Third Normal Form (3NF)

A table is in 3NF if it is in 2NF and if there are no transitive functional dependencies. A transitive functional dependency is when a non-key attribute depends on another non-key attribute.

Consider a table:

StudentID	StudentName	CourseID	CourseName	CourseInstructor
1	John	101	Math	Mr. Smith
2	Sarah	102	English	Ms. Johnson
1	John	103	Physics	Dr. Brown

In this table, the primary key is (StudentID, CourseID). But 'CourseInstructor' depends on 'CourseName', which depends on 'CourseID'. This is a transitive dependency, which is not allowed in 3NF. So, we break this table up to remove the transitive dependency:

Student table:

StudentID	StudentName
1	John
2	Sarah

Course table:

CourseID	CourseName	CourseInstructor
101	Math	Mr. Smith
102	English	Ms. Johnson
103	Physics	Dr. Brown

StudentCourse table:

StudentID	CourseID
1	101
2	102
1	103

Boyce-Codd Normal Form (BCNF)

A table is in BCNF if it is in 3NF and for each of its dependencies X -> Y, X is a superkey. A superkey is a set of one or more columns that can uniquely identify a row in a table.

Let's consider the following table:

CourseID	CourseName	CourseInstructor
101	Math	Mr. Smith
102	English	Ms. Johnson
103	Physics	Dr. Brown
104	Math	Dr. Stone

The primary key here is (CourseID, CourseName). However, 'CourseInstructor' is dependent only on 'CourseName', which is part of the primary key. Therefore, we must separate the tables like this:

Course table:

CourseID	CourseName
101	Math
102	English
103	Physics
104	Math

Instructor table:

CourseName	CourseInstructor
Math	Mr. Smith
English	Ms. Johnson
Physics	Dr. Brown
Math	Dr. Stone

Fourth Normal Form (4NF)

A table is in 4NF if it is in BCNF and there are no multi-valued dependencies. A multi-valued dependency occurs when one attribute in a table depends on another, yet they are both independent of the primary keys.

Suppose we have a table:

StudentID	Skill	Hobby
1	Programming	Fishing
1	Design	Fishing
2	Design	Painting
2	Design	Stamp collecting

Here, a student can have multiple skills and multiple hobbies, but the skills and hobbies are independent of each other. This is a multi-valued dependency. To make it 4NF, we'd split it into two tables:

Skills table:

StudentID	Skill
1	Programming
1	Design
2	Design

Hobbies table:

StudentID	Hobby
1	Fishing
2	Painting
2	Stamp collecting

Fifth Normal Form (5NF) or Project-Join Normal Form (PJNF)

A table is in 5NF or PJNF if every join dependency in the table is a consequence of the candidate keys. This is about situations where information can be re-constructed from smaller pieces of information that can be maintained with less redundancy.

Let's consider an example. Suppose we have a database of musicians, instruments they play, and bands they play for:

Musician	Instrument	Band
John	Guitar	Band1
John	Bass	Band2
Paul	Bass	Band1
Paul	Drums	Band2

If we decompose this into two tables:

Table1

Musician	Instrument
John	Guitar
John	Bass
Paul	Bass
Paul	Drums

Table2

Musician	Band
John	Band1
John	Band2
Paul	Band1
Paul	Band2

We lost the information that John plays Guitar for Band1, Bass for Band2, etc. Hence, this decomposition leads to loss of information and it's not a correct decomposition. In other words, the original table is in 5NF.

Benefits of Normalization

Avoids Data Redundancy : Normalization splits up the data into additional tables to avoid duplication.
Data Consistency : By having a single update point, normalization brings in data consistency.
Database Optimization : Smaller tables with reduced data tend to increase the search performance.
Saves Storage Space : Normalization helps to eliminate redundant data, saving storage space.

An Example

Consider a table ‘Customer’ with fields: CustomerID, CustomerName, CustomerAddress and OrderID.

This table is not in 1NF as the OrderID field is multi-valued. To convert it into 1NF, we can divide it into two tables ‘Customer’ and ‘Order’ where CustomerID would be a common field.

The 'Customer' table now has: CustomerID, CustomerName, CustomerAddress. The 'Order' table has: CustomerID, OrderID.

Now, both tables are in 1NF. Further normalization can be applied as necessary.

Conclusion

Normalization is a key aspect of relational database design. Proper normalization ensures that your database is efficient, flexible, and it minimizes redundancy without compromising the integrity of your data. Although the process may seem complex, the benefits are worth the effort. It's crucial to remember that normalization, like many aspects of database design, must be carefully tailored to the requirements of the system – there's no one-size-fits-all solution. Keep exploring the world of SQL with us for more insights and guides! Happy learning!