Three Routes: List aggregation

Relevant Scenario(s)

There have been several instances when columns values have to be aggregated in a list of comma separated values. Contrast this to the requirement in where we needed to split up a comma separated list in a column into multiple rows.

This kind of aggregation requirement arises when columns are grouped by and the aggregation function, is applied on a column that is text. You can count the number of values in this column being aggregated or you can collect its values in a list. Unlike numeric data, this column cannot be acted upon by an aggregation function that sums. averages or finds a min or max value.

We have with US cities, states and counties. We want to aggregate the data so that we have only one row for each state and in the column next to the state all the cities within it are in a comma separated list.

Data table

The data we will consider here is an excel sheet with five columns as seen in Fig 1. The columns are :
a) City
b) State_short
c) State_full
d) County
e) City Alias,
In this article we are not concerned with the county or the city alias and will be deleting those or ignoring those.

Fig 1:

Technique in PostgreSQL

Data is loaded using pgAdmins’ Import feature after creating a table. You will find more details on importing data in pgAdmin in the Setup Data section below.

The table state_city_county contains duplicated state_short, state_full and city combinations for every city_alias. We will use a CTE — Common table expression that collects the distinct set of states and cities to remove this duplication. Note the order by clause in the CTE. It includes the City so that the list has the city names in alphabetic order
A function that enables us to create a list of cities when using the “group by” clause is STRING_AGG (the equivalent of Oracle LISTAGG that is an analytic function).

The sweet and short query results in a row for each of the states in the data set with the cities concatenated in a list separated by the concatenation operator specified in STRING_AGG.

with u as (select distinct state_short,
state_full,
city
from state_city_county
order by state_short, city)
select distinct state_short,
state_full,
string_agg(city, ",") as city_list
from u
group by state_short, state_full;

Technique in Power BI

Setting up the data in Power BI is accomplished rather easily by loading the CSV using Get Data->Text/CSV.

The steps you will see in the clip below are:
1) Promote the first row as headers
2) Delete the county and city alias columns
3) Remove the duplicates in the City, State short and State full column combination, since the duplication arose from having multiple City Aliases for this key set.
4) Use the Group By in the menu to group by State short and State full in the advanced group by selection. Select All Rows for the aggregation operation. We name this aggregated column as Combination column
5) This will result in a table being appended to each row of the states in the file under the column name Combination column. Each of these tables has a City column.
6) To extract the cities from the above table out into a list, Add a custom column called City list. The custom formula is

= Table.column([Combination column], “City”)

The entire operation is shown in the gif below to help you follow along:

Technique in Pandas

Data in the CSV file is imported into Jupyter Notebook using Pandas read_csv . Setup data also shows you a new option to copy and paste data (using Pandas read_clipboard) to setup data in a jiffy.

Let’s limit the number of states we are working with in this demo to Connecticut and Washington

In the above code the grouping and aggregation is accomplished using Pandas group by of the columns State short and State full. the aggregation function is a lambda function that grabs only the unique City values and converts into a list. The result is that we end up with two rows and the City column now has all the distinct cities in the State.

Setup Data

CSV files can be imported into PostgresSQL (pgAdmin) , Pandas and Power BI.

CREATE TABLE public.state_city_county
(
city text COLLATE pg_catalog."default" NOT NULL,
state_short text COLLATE pg_catalog."default" NOT NULL,
state_full text COLLATE pg_catalog."default" NOT NULL,
county text COLLATE pg_catalog."default",
city_alias text COLLATE pg_catalog."default"
)
TABLESPACE pg_default;

To capture the data into a Jupyter notebook:

You can use Pandas read_csv (filename) to read the file into Jupyter notebook directly into a dataframe, or copy the values from the columns in the excel file and place it in a dataframe using pandas read_clipboard

To capture CSV data into a Power BI Query
Select Text/CSV under Get Data to import the file.

Do you encounter the need to aggregate a non-numeric column? If so let me know your scenario in the comments.

Certified DS Associate (DP-100, DA-100), pursuing Masters in University Of Michigan’s Applied Data Science Program