Aggregating Data - SQL


Categories

Simple Aggregate Query vs Grouped Aggregate Query


In SQL, Simple Aggregate Queries operate on all of the rows in a table or result set, providing a single result based on the entire dataset. Grouped Aggregate Queries involves applying aggregate functions to subsets of data based on one or more grouping columns, dividing the data into groups and calculating aggregate values for each group separately using the GROUP BY statement.

- Simple Aggregate Query -


  • A simple aggregate function operates on all the rows in a table or the result set generated by a query.

  • It returns a single result based on all of the values in the specified column.

  • These functions provide a summary of the entire dataset without considering any specific grouping criteria.

Example:

SELECT COUNT(*)
FROM orders

Counts all of the rows in the orders table.

- Grouped Aggregate Query -


  • Grouped aggregate data uses functions on data subsets grouped by columns.

  • The data is divided into groups based on the values of one or more columns, and then aggregate functions are applied to each group separately.

  • It provides insights into subsets of the data rather than the entire dataset.

  • To do this, use the GROUP BY statement in SQL.

Example:

SELECT department, AVG(salary)
FROM employees
GROUP BY department

Calculates the average salary for each department separately.


Commonly Used Aggregate Functions


In SQL, aggregate functions are used to perform calculations on sets of values and return a single value as a result. Below are some of the most commonly used aggregate functions.

- COUNT Function -


This function is essential for returning the number of rows present in a given dataset. It is commonly used to provide a quick insight into the size of the data being analyzed.

Syntax:

SELECT COUNT(*) FROM table_name

- SUM Function -


This function is a powerful tool used to calculate the total sum of the values within a specific column. This function helps to easily add up values step by step, making data analysis faster and more precise.

Syntax:

SELECT SUM(column_name) FROM table_name

- AVG Function -


This function calculates the mean value of a specified column's data. Calculating the average helps to get a quick idea of the dataset's central tendency, which can guide decision-making.

Syntax:

SELECT AVG(column_name) FROM table_name

- MIN Function -


This function is used to retrieve the minimum value from a specific column in a database table. It helps in finding the smallest value present in the data set, which can be useful for various analytical purposes.

Syntax:

SELECT MIN(column_name) FROM table_name

- MAX Function -


This function is used to retrieve the maximum value from a specific column in a database table. It helps in finding the largest value present in the data set, which can be useful for various analytical purposes.

Syntax:

SELECT MAX(column_name) FROM table_name


Grouping Data


Grouping data in SQL involves categorizing rows based on the values in one or more columns and then applying aggregate functions to each group. This allows you to analyze and summarize the data within those groups. The GROUP BY Statement groups rows based on specified columns, often with aggregate functions for group calculations.

Syntax:

SELECT column1, column2, aggregate_function(column3)
FROM table_name
GROUP BY column1, column2

  • column1 and column2 are the columns used for grouping.

  • aggregate_function is any aggregate function like SUM(), AVG(), COUNT(), etc., applied to another column (column3 in this case).

  • The result set will contain one row for each unique combination of values in column1 and column2, with the aggregate function applied to the corresponding values in column3.

Example:

SELECT department_id, job_title, AVG(salary)
FROM company.employees
GROUP BY department_id, job_title

This SQL query selects the department_id, job_title, and average salary from the employees table in the company schema. It groups the data by department_id and job_title, calculating the average salary for each unique combination of department and job title.


Filtering Group Data


In SQL, filtering grouped data involves applying conditions to the groups formed by the GROUP BY Statement. You can use the HAVING Statement to filter groups based on aggregate conditions.

Syntax:

SELECT column1, column2, aggregate_function(column3)
FROM table_name
GROUP BY column1, column2
HAVING condition

  • column1 and column2 are the columns used for grouping.

  • aggregate_function is any aggregate function like SUM(), AVG(), COUNT(), etc., applied to another column (column3 in this case).

  • The HAVING Statement follows the GROUP BY Statement and is used to apply a condition to the groups formed by the GROUP BY Statement.

  • condition is the condition that each group must satisfy. It can involve aggregate functions.

Example:

SELECT department_id, job_title, AVG(salary) AS avg_salary
FROM company.employees
GROUP BY department_id, job_title
HAVING salary > 115000

This SQL query retrieves the department_id, job_title, and average salary from the employees table in the company schema. It groups the data by department_id and job_title, then filters the groups to include only those with an average salary greater than $115,000 using the HAVING Statement.


Partitions


In SQL, partitions refer to dividing a dataset into smaller, manageable parts for analysis or processing. They are commonly used in operations like ranking, aggregation, and analytics. With partitions, you can apply functions and filters within specific subsets of data, allowing for more granular analysis without affecting the overall dataset. Partitions are defined using the PARTITION BY statement in SQL queries, specifying the criteria by which to divide the dataset. This helps in performing operations separately within each partition, offering flexibility and efficiency in data manipulation and analysis.

Syntax:

SELECT column1, column2, ..., aggregate_function(column3) OVER (PARTITION BY partition_column)
FROM table_name

  • column1 and column2 are the columns used for grouping.

  • aggregate_function is any aggregate function like SUM(), AVG(), COUNT(), etc., applied to another column (column3 in this case).

  • The PARTITION BY statement is used within the OVER clause to define the partitioning criteria for window functions.

  • It divides the dataset into partitions based on the values of one or more columns (partition_column).

  • Rows with the same value(s) in the partition column(s) will be grouped into the same partition.

Example:

SELECT department_id, job_title, salary, AVG(salary) OVER (PARTITION BY department_id, job_title) AS avg_salary_per_job_title_department
FROM company.employees

  • department_id and job_title are the columns used for partitioning the dataset.

  • The AVG() function calculates the average salary (salary) within each partition defined by department_id and job_title.

  • The PARTITION BY clause divides the dataset into partitions based on the values of department_id and job_title.

  • Each partition represents a unique combination of department_id and job_title, and the AVG() function computes the average salary separately for each partition.

  • The result set includes the original columns (department_id, job_title, and salary) along with a new column avg_salary_per_job_title_department, showing the average salary for each job title within each department.


Pivot


In SQL, pivoting involves transforming data from rows into columns, basically rotating the data to display it in a different manner. Pivoting is useful when you want to summarize data or perform comparisons across different categories. You could pivot data in two different ways, the first being using the PIVOT Operator and the other using CASE Statements.

- PIVOT Operator -


In SQL, the PIVOT Operator allows you to rotate rows of data into columns, effectively transposing the data. Convert data from a tall format to a broad format for simpler analysis and comprehension.

Syntax:

SELECT <non-pivoted column>, [pivoted columns]
FROM (SELECT <columns> FROM <source_table>) AS <alias>
PIVOT
(<aggregate_function>(<value_column>) FOR <pivot_column> IN ([pivoted_column1], [pivoted_column2], ...)) AS <pivot_alias>

  • <non-pivoted column>: These are the columns that remain unchanged in the result set and are not involved in the pivot operation.

  • [pivoted columns]: These are the columns that will be created as a result of pivoting. They represent the unique values of the pivot column.

  • <columns>: Columns selected from the source table for the pivot operation.

  • <source_table>: The table containing the original data.

  • <alias>: An alias assigned to the subquery that retrieves data from the source table.

  • <aggregate_function>: An aggregate function (e.g., SUM, AVG, COUNT) to be applied to the values in the pivot column.

  • <value_column>: The column containing the values to be aggregated.

  • <pivot_column>: The column whose distinct values will become the new columns in the pivoted result set.

  • [pivoted_column1], [pivoted_column2], ...: The distinct values of the pivot column, which will become the new columns in the pivoted result set.

  • <pivot_alias>: An alias assigned to the result set obtained after pivoting.

Example:

SELECT department_id, [Engineer], [Manager]
FROM (SELECT department_id, job_title, salary FROM employees) AS src
PIVOT
(SUM(salary) FOR job_title IN ([Engineer], [Manager])) AS pivot_table

  • <non-unpivoted columns>: These are the columns from the source table that are not being unpivoted and will remain unchanged in the result set. In this query, department_id is a non-unpivoted column.

  • <unpivoted_column>: This is the new column that will contain the names of the original columns that are being unpivoted. In this query, [Engineer] and [Manager] are unpivoted columns.

  • <value_column>: This is the new column that will contain the values from the original columns that are being unpivoted. In this query, the aggregated SUM(salary) serves as the value column.

  • <columns>: These are the columns selected from the source table. In this query, department_id, job_title, and salary are selected from the employees table.

  • <source_table>: This is the table containing the original data. In this query, the employees table is the source table.

  • <column1>, <column2>, ...: These are the columns from the source table that will be transformed into rows. In this query, the job_title column values (Engineer and Manager) are transformed into rows.


Unpivot


In SQL, the UNPIVOT Operator allows you to transform columns into rows, effectively the reverse of the PIVOT Operator. This can be useful when normalizing data that has been spread across multiple columns.

Syntax:

SELECT <non-unpivoted columns>, <unpivoted_column> AS <new_column_name>, <value_column> AS <new_value_column>
FROM (SELECT <columns> FROM <source_table>) AS <alias>
UNPIVOT
(<value_column> FOR <unpivoted_column> IN (<column1>, <column2>, ...)) AS <unpivot_alias>

  • <non-unpivoted columns>: Columns that are not being unpivoted and will remain unchanged in the result set.

  • <unpivoted_column>: The new column that will contain the names of the original columns that are being unpivoted.

  • <value_column>: The new column that will contain the values from the original columns that are being unpivoted.

  • <columns>: Columns selected from the source table.

  • <source_table>: The table containing the original data.

  • <column1>, <column2>, ...: The columns that will be transformed into rows.

Example:

SELECT department_id, job_title, salary
FROM (SELECT department_id, [Engineer], [Manager] FROM employees) AS src
UNPIVOT
(salary FOR job_title IN ([Engineer], [Manager])) AS unpivoted_data;

  • <non-unpivoted columns>: In this query, department_id is a non-unpivoted column. It remains unchanged and is selected directly from the source table.

  • <unpivoted_column>: The new column created by the UNPIVOT operation is job_title. It contains the names of the original columns ([Engineer] and [Manager]) that are being unpivoted.

  • <value_column>: The new column created by the UNPIVOT operation is salary. It contains the values from the original columns ([Engineer] and [Manager]) that are being unpivoted.

  • <columns>: The columns selected from the source table are department_id, [Engineer], and [Manager].

  • <source_table>: The source table containing the original data is employees.

  • <column1>, <column2>, ...: The columns [Engineer] and [Manager] are transformed into rows as part of the UNPIVOT operation.