Aggregating Data - SQL
Categories
Simple Aggregate Query vs Grouped Aggregate Query
In SQL, Simple Aggregate Queries operate on all of the rows in a table or result set, providing a single result based on the entire dataset. Grouped Aggregate Queries involves applying aggregate functions to subsets of data based on one or more grouping columns, dividing the data into groups and calculating aggregate values for each group separately using the GROUP BY statement.
- Simple Aggregate Query -
A simple aggregate function operates on all the rows in a table or the result set generated by a query.
It returns a single result based on all of the values in the specified column.
These functions provide a summary of the entire dataset without considering any specific grouping criteria.
Example:
SELECT
COUNT(*)
FROM
orders
Counts all of the rows in the orders table.
- Grouped Aggregate Query -
Grouped aggregate data uses functions on data subsets grouped by columns.
The data is divided into groups based on the values of one or more columns, and then aggregate functions are applied to each group separately.
It provides insights into subsets of the data rather than the entire dataset.
To do this, use the GROUP BY statement in SQL.
Example:
SELECT
department, AVG(salary)
FROM
employees
GROUP BY
department
Calculates the average salary for each department separately.
Commonly Used Aggregate Functions
In SQL, aggregate functions are used to perform calculations on sets of values and return a single value as a result. Below are some of the most commonly used aggregate functions.
- COUNT Function -
This function is essential for returning the number of rows present in a given dataset. It is commonly used to provide a quick insight into the size of the data being analyzed.
Syntax:
SELECT
COUNT(*)
FROM
table_name
- SUM Function -
This function is a powerful tool used to calculate the total sum of the values within a specific column. This function helps to easily add up values step by step, making data analysis faster and more precise.
Syntax:
SELECT
SUM(column_name)
FROM
table_name
- AVG Function -
This function calculates the mean value of a specified column's data. Calculating the average helps to get a quick idea of the dataset's central tendency, which can guide decision-making.
Syntax:
SELECT
AVG(column_name)
FROM
table_name
- MIN Function -
This function is used to retrieve the minimum value from a specific column in a database table. It helps in finding the smallest value present in the data set, which can be useful for various analytical purposes.
Syntax:
SELECT
MIN(column_name)
FROM
table_name
- MAX Function -
This function is used to retrieve the maximum value from a specific column in a database table. It helps in finding the largest value present in the data set, which can be useful for various analytical purposes.
Syntax:
SELECT
MAX(column_name)
FROM
table_name
Grouping Data
Grouping data in SQL involves categorizing rows based on the values in one or more columns and then applying aggregate functions to each group. This allows you to analyze and summarize the data within those groups. The GROUP BY Statement groups rows based on specified columns, often with aggregate functions for group calculations.
Syntax:
SELECT
column1, column2,
aggregate_function(column3)
FROM
table_name
GROUP BY
column1, column2
column1
andcolumn2
are the columns used for grouping.aggregate_function
is any aggregate function like SUM(), AVG(), COUNT(), etc., applied to another column (column3
in this case).The result set will contain one row for each unique combination of values in
column1
andcolumn2
, with the aggregate function applied to the corresponding values incolumn3
.
Example:
SELECT
department_id, job_title,
AVG(salary)
FROM
company.employees
GROUP BY
department_id, job_title
This SQL query selects the department_id
, job_title
, and average salary
from the employees
table in the company
schema. It groups the data by department_id
and job_title
, calculating the average salary for each unique combination of department and job title.
Filtering Group Data
In SQL, filtering grouped data involves applying conditions to the groups formed by the GROUP BY Statement. You can use the HAVING Statement to filter groups based on aggregate conditions.
Syntax:
SELECT
column1, column2,
aggregate_function(column3)
FROM
table_name
GROUP BY
column1, column2
HAVING
condition
column1
andcolumn2
are the columns used for grouping.aggregate_function
is any aggregate function like SUM(), AVG(), COUNT(), etc., applied to another column (column3
in this case).The HAVING Statement follows the GROUP BY Statement and is used to apply a condition to the groups formed by the GROUP BY Statement.
condition
is the condition that each group must satisfy. It can involve aggregate functions.
Example:
SELECT
department_id, job_title,
AVG(salary) AS avg_salary
FROM
company.employees
GROUP BY
department_id, job_title
HAVING
salary > 115000
This SQL query retrieves the department_id
, job_title
, and average salary
from the employees
table in the company
schema. It groups the data by department_id
and job_title
, then filters the groups to include only those with an average salary greater than $115,000 using the HAVING Statement.
Partitions
In SQL, partitions refer to dividing a dataset into smaller, manageable parts for analysis or processing. They are commonly used in operations like ranking, aggregation, and analytics. With partitions, you can apply functions and filters within specific subsets of data, allowing for more granular analysis without affecting the overall dataset. Partitions are defined using the PARTITION BY statement in SQL queries, specifying the criteria by which to divide the dataset. This helps in performing operations separately within each partition, offering flexibility and efficiency in data manipulation and analysis.
Syntax:
SELECT
column1, column2, ...,
aggregate_function(column3) OVER (PARTITION BY partition_column)
FROM
table_name
column1
andcolumn2
are the columns used for grouping.aggregate_function
is any aggregate function like SUM(), AVG(), COUNT(), etc., applied to another column (column3
in this case).The PARTITION BY statement is used within the OVER clause to define the partitioning criteria for window functions.
It divides the dataset into partitions based on the values of one or more columns (
partition_column
).Rows with the same value(s) in the partition column(s) will be grouped into the same partition.
Example:
SELECT
department_id, job_title, salary,
AVG(salary) OVER (PARTITION BY department_id, job_title) AS avg_salary_per_job_title_department
FROM
company.employees
department_id
andjob_title
are the columns used for partitioning the dataset.The
AVG()
function calculates the average salary (salary
) within each partition defined bydepartment_id
andjob_title
.The PARTITION BY clause divides the dataset into partitions based on the values of
department_id
andjob_title
.Each partition represents a unique combination of
department_id
andjob_title
, and theAVG()
function computes the average salary separately for each partition.The result set includes the original columns (
department_id
,job_title
, andsalary
) along with a new columnavg_salary_per_job_title_department
, showing the average salary for each job title within each department.
Pivot
In SQL, pivoting involves transforming data from rows into columns, basically rotating the data to display it in a different manner. Pivoting is useful when you want to summarize data or perform comparisons across different categories. You could pivot data in two different ways, the first being using the PIVOT Operator and the other using CASE Statements.
- PIVOT Operator -
In SQL, the PIVOT Operator allows you to rotate rows of data into columns, effectively transposing the data. Convert data from a tall format to a broad format for simpler analysis and comprehension.
Syntax:
SELECT <non-pivoted column>, [pivoted columns]
FROM
(SELECT <columns> FROM <source_table>)
AS <alias>
PIVOT
(<aggregate_function>(<value_column>) FOR <pivot_column> IN ([pivoted_column1], [pivoted_column2], ...)) AS <pivot_alias>
<non-pivoted column>
: These are the columns that remain unchanged in the result set and are not involved in the pivot operation.[pivoted columns]
: These are the columns that will be created as a result of pivoting. They represent the unique values of the pivot column.<columns>
: Columns selected from the source table for the pivot operation.<source_table>
: The table containing the original data.<alias>
: An alias assigned to the subquery that retrieves data from the source table.<aggregate_function>
: An aggregate function (e.g., SUM, AVG, COUNT) to be applied to the values in the pivot column.<value_column>
: The column containing the values to be aggregated.<pivot_column>
: The column whose distinct values will become the new columns in the pivoted result set.[pivoted_column1], [pivoted_column2], ...
: The distinct values of the pivot column, which will become the new columns in the pivoted result set.<pivot_alias>
: An alias assigned to the result set obtained after pivoting.
Example:
SELECT department_id, [Engineer], [Manager]
FROM (SELECT department_id, job_title, salary FROM employees) AS src
PIVOT
(SUM(salary) FOR job_title IN ([Engineer], [Manager])) AS pivot_table
<non-unpivoted columns>
: These are the columns from the source table that are not being unpivoted and will remain unchanged in the result set. In this query,department_id
is a non-unpivoted column.<unpivoted_column>
: This is the new column that will contain the names of the original columns that are being unpivoted. In this query,[Engineer]
and[Manager]
are unpivoted columns.<value_column>
: This is the new column that will contain the values from the original columns that are being unpivoted. In this query, the aggregatedSUM(salary)
serves as the value column.<columns>
: These are the columns selected from the source table. In this query,department_id
,job_title
, andsalary
are selected from theemployees
table.<source_table>
: This is the table containing the original data. In this query, theemployees
table is the source table.<column1>, <column2>, ...
: These are the columns from the source table that will be transformed into rows. In this query, thejob_title
column values (Engineer
andManager
) are transformed into rows.
Unpivot
In SQL, the UNPIVOT Operator allows you to transform columns into rows, effectively the reverse of the PIVOT Operator. This can be useful when normalizing data that has been spread across multiple columns.
Syntax:
SELECT <non-unpivoted columns>,
<unpivoted_column> AS <new_column_name>,
<value_column> AS <new_value_column>
FROM
(SELECT <columns> FROM <source_table>)
AS <alias>
UNPIVOT
(<value_column> FOR <unpivoted_column>
IN (<column1>, <column2>, ...)) AS <unpivot_alias>
<non-unpivoted columns>
: Columns that are not being unpivoted and will remain unchanged in the result set.<unpivoted_column>
: The new column that will contain the names of the original columns that are being unpivoted.<value_column>
: The new column that will contain the values from the original columns that are being unpivoted.<columns>
: Columns selected from the source table.<source_table>
: The table containing the original data.<column1>, <column2>, ...
: The columns that will be transformed into rows.
Example:
SELECT department_id, job_title, salary
FROM
(SELECT department_id, [Engineer], [Manager] FROM employees) AS src
UNPIVOT
(salary FOR job_title IN ([Engineer], [Manager])) AS unpivoted_data;
<non-unpivoted columns>
: In this query,department_id
is a non-unpivoted column. It remains unchanged and is selected directly from the source table.<unpivoted_column>
: The new column created by the UNPIVOT operation isjob_title
. It contains the names of the original columns ([Engineer]
and[Manager]
) that are being unpivoted.<value_column>
: The new column created by the UNPIVOT operation issalary
. It contains the values from the original columns ([Engineer]
and[Manager]
) that are being unpivoted.<columns>
: The columns selected from the source table aredepartment_id
,[Engineer]
, and[Manager]
.<source_table>
: The source table containing the original data isemployees
.<column1>, <column2>, ...
: The columns[Engineer]
and[Manager]
are transformed into rows as part of the UNPIVOT operation.