Introduction

In data processing, raw data from source systems is often inconsistent, messy, or lacks standardization. Data transformation techniques help to clean, normalize, and harmonize this data before it is used for analysis or loaded into structured schemas like a data warehouse. This blog explores multiple techniques used in SQL-based data transformation to convert raw data into clean, meaningful information.

1. Removing Unnecessary Whitespaces

Purpose: Trim leading and trailing whitespace to maintain consistency in textual fields.

Example:

Data Before:

customer_idfirst_namelast_name
101" John "" Doe "
102" Alice "" Brown "

SQL Transformation:

SELECT customer_id, 
       TRIM(first_name) AS first_name, 
       TRIM(last_name) AS last_name
FROM customers;

Data After:

customer_idfirst_namelast_name
101“John”“Doe”
102“Alice”“Brown”

2. Handling Missing or Invalid Data

Purpose: Replace null or incorrect values with default values or meaningful alternatives.

Example:

Data Before:

customer_idemail
101john@abc.com
102NULL

SQL Transformation:

SELECT customer_id, 
       COALESCE(email, 'unknown@xyz.com') AS email
FROM customers;

Data After:

customer_idemail
101john@abc.com
102unknown@xyz.com

3. Deduplication

Purpose: Identify and remove duplicate records while retaining the most recent or relevant entry.

Example:

Data Before:

customer_idfirst_namelast_namecreated_at
101JohnDoe2024-01-10 10:00
101JohnDoe2024-01-12 12:30

SQL Transformation:

WITH RankedCustomers AS (
    SELECT *, 
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
    FROM customers
)
SELECT customer_id, first_name, last_name, created_at
FROM RankedCustomers
WHERE rn = 1;

Data After:

customer_idfirst_namelast_namecreated_at
101JohnDoe2024-01-12 12:30

4. Normalization and Harmonization

Purpose: Standardize categorical values to ensure consistency and accuracy.

Example:

Data Before:

marital_statusgenderproduct_linecountry
“S”“F”“M”“DE”
“M”“M”“R”“US”
“Unknown”“U”“X”“USA”
NULL""“T”""

SQL Transformation:

SELECT 
    CASE marital_status 
        WHEN 'S' THEN 'Single' 
        WHEN 'M' THEN 'Married' 
        ELSE 'n/a' END AS marital_status,
    CASE gender 
        WHEN 'F' THEN 'Female' 
        WHEN 'M' THEN 'Male' 
        ELSE 'n/a' END AS gender,
    CASE product_line 
        WHEN 'M' THEN 'Mountain' 
        WHEN 'R' THEN 'Road' 
        WHEN 'S' THEN 'Other Sales' 
        WHEN 'T' THEN 'Touring' 
        ELSE 'n/a' END AS product_line,
    CASE country 
        WHEN 'DE' THEN 'Germany' 
        WHEN 'US' THEN 'United States' 
        WHEN 'USA' THEN 'United States' 
        ELSE 'n/a' END AS country
FROM products;

Data After:

marital_statusgenderproduct_linecountry
“Single”“Female”“Mountain”“Germany”
“Married”“Male”“Road”“United States”
“n/a”“n/a”“n/a”“United States”
“n/a”“n/a”“Touring”“n/a”

5. Deriving Values

Purpose: Compute missing or dependent values based on existing fields.

Example:

Data Before:

order_idquantityunit_pricetotal_price
50013NULL60.00
5002220.00NULL
50034NULLNULL
5004115.0015.00

SQL Transformation:

SELECT order_id, 
       quantity, 
       COALESCE(unit_price, total_price / quantity) AS unit_price, 
       COALESCE(total_price, quantity * unit_price) AS total_price
FROM orders;

Data After:

order_idquantityunit_pricetotal_price
5001320.0060.00
5002220.0040.00
50034NULLNULL
5004115.0015.00

6. Conditional Mapping Using CASE WHEN

6a. Standardizing Categorical Data

Purpose: Convert inconsistent category formats to a standard format.

Data Before:

product_codecategory_code
“P100”“A-200”
“P101”“B-300”
“P102”“C400”

SQL Transformation:

SELECT product_code, 
       REPLACE(category_code, '-', '_') AS category_code
FROM products;

Data After:

product_codecategory_code
“P100”“A_200”
“P101”“B_300”
“P102”“C400”

6b. Handling Date Transformations

Purpose: Format dates from a compact string to a standard date format.

Data Before:

order_date
“20240101”
“20231231”
“20230715”

SQL Transformation:

SELECT order_date, 
       TO_DATE(order_date, 'YYYYMMDD') AS formatted_order_date
FROM orders;

Data After:

order_dateformatted_order_date
“20240101”“2024-01-01”
“20231231”“2023-12-31”
“20230715”“2023-07-15”

6c. Calculating Order Status

Purpose: Derive the shipment status based on shipment and delivery dates.

Data Before:

shipped_datedelivered_date
NULL“2024-02-10”
“2024-02-05”“2024-02-07”
“2024-02-08”NULL
“2024-02-09”NULL
NULLNULL

SQL Transformation:

SELECT shipped_date, delivered_date, 
       CASE 
           WHEN shipped_date IS NULL THEN 'Pending Shipment' 
           WHEN delivered_date IS NOT NULL THEN 'Delivered' 
           ELSE 'In Transit' 
       END AS status
FROM shipments;

Data After:

shipped_datedelivered_datestatus
NULL“2024-02-10”“Pending Shipment”
“2024-02-05”“2024-02-07”“Delivered”
“2024-02-08”NULL“In Transit”
“2024-02-09”NULL“In Transit”
NULLNULL“Pending Shipment”

7. Extracting Year and Month from Dates

Purpose: Isolate year and month components from a date for time-series analysis.

Data Before:

order_date
“2024-05-15”
“2023-11-30”
“2024-01-01”
“2024-12-25”

SQL Transformation:

SELECT order_date, 
       EXTRACT(YEAR FROM order_date) AS order_year, 
       EXTRACT(MONTH FROM order_date) AS order_month
FROM orders;

Data After:

order_dateorder_yearorder_month
“2024-05-15”20245
“2023-11-30”202311
“2024-01-01”20241
“2024-12-25”202412

8. Aggregating Data

Purpose: Summarize data by grouping and applying aggregate functions like SUM, AVG, COUNT, etc.

Data Before:

product_idquantity
1015
1023
1012
1034

SQL Transformation:

SELECT product_id, 
       SUM(quantity) AS total_quantity
FROM orders
GROUP BY product_id;

Data After:

product_idtotal_quantity
1017
1023
1034

9. Pivoting Data

Purpose: Transform rows into columns to create a pivot table.

Data Before:

product_idmonthsales
101Jan100
101Feb150
102Jan200
102Feb250

SQL Transformation:

SELECT product_id,
       SUM(CASE WHEN month = 'Jan' THEN sales ELSE 0 END) AS Jan_sales,
       SUM(CASE WHEN month = 'Feb' THEN sales ELSE 0 END) AS Feb_sales
FROM sales
GROUP BY product_id;

Data After:

product_idJan_salesFeb_sales
101100150
102200250

10. Unpivoting Data

Purpose: Transform columns into rows to normalize a pivot table.

Data Before:

product_idJan_salesFeb_sales
101100150
102200250

SQL Transformation:

SELECT product_id, 
       'Jan' AS month, Jan_sales AS sales
FROM sales
UNION ALL
SELECT product_id, 
       'Feb' AS month, Feb_sales AS sales
FROM sales;

Data After:

product_idmonthsales
101Jan100
101Feb150
102Jan200
102Feb250

11. Joining Tables

Purpose: Combine data from multiple tables based on a related column.

Data Before:

Orders Table:

order_idcustomer_id
1101
2102

Customers Table:

customer_idcustomer_name
101John Doe
102Alice Brown

SQL Transformation:

SELECT orders.order_id, 
       customers.customer_name
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id;

Data After:

order_idcustomer_name
1John Doe
2Alice Brown

12. Filtering Data

Purpose: Extract specific rows based on a condition.

Data Before:

order_idorder_datetotal_amount
12024-01-15100.00
22024-02-20200.00
32024-03-10150.00

SQL Transformation:

SELECT order_id, order_date, total_amount
FROM orders
WHERE total_amount > 150.00;

Data After:

order_idorder_datetotal_amount
22024-02-20200.00

13. Calculating Running Totals

Purpose: Compute cumulative totals over a specified order.

Data Before:

order_idorder_datetotal_amount
12024-01-15100.00
22024-02-20200.00
32024-03-10150.00

SQL Transformation:

SELECT order_id, order_date, total_amount,
       SUM(total_amount) OVER (ORDER BY order_date) AS running_total
FROM orders;

Data After:

order_idorder_datetotal_amountrunning_total
12024-01-15100.00100.00
22024-02-20200.00300.00
32024-03-10150.00450.00

14. Calculating Moving Averages

Purpose: Compute the average of a specified number of preceding rows.

Data Before:

order_idorder_datetotal_amount
12024-01-15100.00
22024-02-20200.00
32024-03-10150.00
42024-04-05300.00

SQL Transformation:

SELECT order_id, order_date, total_amount,
       AVG(total_amount) OVER (ORDER BY order_date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS moving_avg
FROM orders;

Data After:

order_idorder_datetotal_amountmoving_avg
12024-01-15100.00100.00
22024-02-20200.00150.00
32024-03-10150.00175.00
42024-04-05300.00225.00

15. Data Type Conversion

Purpose: Convert data from one type to another for consistency and compatibility.

Data Before:

order_idorder_datetotal_amount
1“2024-01-15”“100.00”
2“2024-02-20”“200.00”

SQL Transformation:

SELECT order_id, 
       TO_DATE(order_date, 'YYYY-MM-DD') AS order_date, 
       CAST(total_amount AS DECIMAL(10, 2)) AS total_amount
FROM orders;

Data After:

order_idorder_datetotal_amount
12024-01-15100.00
22024-02-20200.00

16. Removing Duplicates

Purpose: Eliminate duplicate rows from a dataset to ensure data integrity.

Data Before:

customer_idemail
101john@abc.com
102alice@xyz.com
101john@abc.com

SQL Transformation:

SELECT DISTINCT customer_id, email
FROM customers;

Data After:

customer_idemail
101john@abc.com
102alice@xyz.com

17. Calculating Percentages

Purpose: Compute the percentage of a value relative to a total.

Data Before:

product_idsales
101100
102200
103300

SQL Transformation:

SELECT product_id, 
       sales, 
       (sales / SUM(sales) OVER ()) * 100 AS sales_percentage
FROM products;

Data After:

product_idsalessales_percentage
10110016.67
10220033.33
10330050.00

18. String Concatenation

Purpose: Combine multiple string fields into a single field.

Data Before:

first_namelast_name
JohnDoe
AliceBrown

SQL Transformation:

SELECT first_name, 
       last_name, 
       CONCAT(first_name, ' ', last_name) AS full_name
FROM customers;

Data After:

first_namelast_namefull_name
JohnDoeJohn Doe
AliceBrownAlice Brown

19. Conditional Aggregation

Purpose: Aggregate data based on a condition.

Data Before:

product_idsalesregion
101100North
102200South
101150North
102250South

SQL Transformation:

SELECT product_id, 
       SUM(CASE WHEN region = 'North' THEN sales ELSE 0 END) AS north_sales,
       SUM(CASE WHEN region = 'South' THEN sales ELSE 0 END) AS south_sales
FROM sales
GROUP BY product_id;

Data After:

product_idnorth_salessouth_sales
1012500
1020450

20. Data Masking

Purpose: Protect sensitive data by masking it.

Data Before:

customer_idemail
101john@abc.com
102alice@xyz.com

SQL Transformation:

SELECT customer_id, 
       CONCAT(SUBSTRING(email, 1, 2), '*****', SUBSTRING(email, CHARINDEX('@', email), LEN(email))) AS masked_email
FROM customers;

Data After:

customer_idmasked_email
101jo*****@abc.com
102al*****@xyz.com

21. Calculating Differences

Purpose: Compute the difference between values in consecutive rows.

Data Before:

order_idorder_datetotal_amount
12024-01-15100.00
22024-02-20200.00
32024-03-10150.00

SQL Transformation:

SELECT order_id, 
       order_date, 
       total_amount, 
       total_amount - LAG(total_amount) OVER (ORDER BY order_date) AS amount_difference
FROM orders;

Data After:

order_idorder_datetotal_amountamount_difference
12024-01-15100.00NULL
22024-02-20200.00100.00
32024-03-10150.00-50.00

22. Data Binning

Purpose: Group continuous data into bins or intervals.

Data Before:

order_idtotal_amount
1100.00
2200.00
3150.00
4300.00

SQL Transformation:

SELECT order_id, 
       total_amount, 
       CASE 
           WHEN total_amount < 150 THEN 'Low' 
           WHEN total_amount BETWEEN 150 AND 250 THEN 'Medium' 
           ELSE 'High' 
       END AS amount_bin
FROM orders;

Data After:

order_idtotal_amountamount_bin
1100.00Low
2200.00Medium
3150.00Medium
4300.00High

23. Handling Outliers

Purpose: Identify and handle outliers in the data.

Data Before:

order_idtotal_amount
1100.00
2200.00
3150.00
41000.00

SQL Transformation:

SELECT order_id, 
       total_amount, 
       CASE 
           WHEN total_amount > 3 * (SELECT AVG(total_amount) FROM orders) THEN 'Outlier' 
           ELSE 'Normal' 
       END AS amount_status
FROM orders;

Data After:

order_idtotal_amountamount_status
1100.00Normal
2200.00Normal
3150.00Normal
41000.00Outlier

Conclusion

By applying these transformation techniques, ETL pipelines can ensure that raw data is clean, consistent, and ready for analytics. Whether dealing with text normalization, handling missing values, deduplication, or deriving computed fields, these methods help maintain high-quality datasets essential for business insights.