Hierarchical indexing is a method of creating structured group relationships in data. These hierarchical indexes, or MultiIndexes, are highly flexible and offer a range of options when performing complex data queries.
Hierarchical indexing is one of the functions in pandas, a software library for the Python programming languages. pandas derives its name from the term “panel data”, a statistical term for four-dimensional data models that show changes over time. The pandas library includes multiple tools that support detailed data queries.
How Does Hierarchical Indexing Work?
The pandas library works with what it refers to as DataFrames. These are similar to tables in a relational database, so you can easily import a relational table into a DataFrame. CSV, JSON and Excel data are also suitable sources for creating a DataFrame.
DataFrames are two-dimensional structures with rows and columns. Each row has an integer range index that is assigned by pandas. A typical DataFrame might look like this:
Range EmployeeID Name RoleType State 1 10001 A
Alvarez Admin TX 2 10002 B Benson Admin CA
3 10003 C Christie Manager TX 4 10004 D
Duane Manager CA 5 10005 E Edison
Executive TX 6 10006 F Franklin Executive CA
Within this data, we can see some logical groupings, by state and by role type. It’s possible to organize this data by running some kind of join, condition or grouping function. pandas includes a groupby() function for this very purpose.
However, we can use hierarchical indexing to create new indexes for this DataFrame, using data values. In pandas, that involves a line of code like this:
df.set_index(['RoleType', 'State'], inplace=True)
Which creates MultiIndexes for this DataFrame. If we sort the data, we see something like this:
State RoleType EmployeeID Name TX Admin 10001 A Alvarez
Manager 10003 C Christie Executive 10005 E Edison CA
Admin 10002 B Benson Manager 10004 D Duane
Executive 10006 F Franklin
In this example, State and RoleType combine to create an index that not only organizes the data, but which also allows for faster sorting.
What’s more, these identifiers now show clear logical relationships between rows. In the example above, we can see the internal structure of each office, which means that we can run queries about each office without any kind of arguments to the query.
To realize the benefits of MultiIndexes, you will need to sort the data before running any queries. Otherwise, pandas will return a copy of the data, not a view. It will also serve up a performance warning.
What are the Advantages of Hierarchical Indexing?
In pandas, MultiIndexes can help to provide optimized queries and preserve relationships. MultiIndexes are themselves data values, and pandas handles them as such in queries, but the full contents of a hierarchical index are displayable in the results view.
One example is in pandas Series. A series is simply a column of data, the equivalent of selecting a single column in Excel or running an SQL query like SELECT column FROM table.
When you attempt to retrieve a series containing MultiIndexes, such as:
print(df['EmployeeID'].head(5))
You receive an output that looks like something like this:
State RoleType EmployeeID TX Admin 10001
Manager 10003 Executive 10005
CA Admin 10002 Manager 10004
The reason for this is that the hierarchical indexing relationship is unbreakable, so all queries return the values used for the purpose of indexing. This can be extremely useful in cases where you want to perform complex queries on grouped data. Rather than attempting to maintain the relationships across your queries, you can treat them as atomic entities.
Using Hierarchical Indexing in Code
pandas is a software library intended to facilitate data analysis. It extends Python with a number of functions that assist developers in this purpose.
For example, the .loc function can help to identify rows that match certain index values. With hierarchical indexing, we can pass multiple arguments to .loc. So, in the example above, the code might look like this:
ds.loc[(‘TX’, ‘Manager’), :]
Which will return matching values. In this case, that will be:
EmployeeID Name
10003 C Christie
You can also look for values using a partial index. For example:
ds.loc[(‘TX’, slice(none), :]
will return a list of all employees based in Texas. This extra syntax allows for greater flexibility when creating queries that cover large, complex DataFrames.
Another useful function is the cross-section, which is .xs in pandas. The cross-slice allows us to retrieve matching values. For example, we can pull an employee from our sample database using this query:
ds.xs(‘A Alvarex’, level=’Name’)
These are some basic examples of how to code and utilize DataFrames within Panda. For more detail, refer to the pandas documentation.
Other Implementations of Hierarchical Indexing
pandas offers one of the few ready-to-use implementations of MultiIndexing. For other implementations, developers generally rely on hand-coded solutions.
For example, some solutions use arrays in R to provide a version of hierarchical indexing. Arrays only hold one mode, but this can include storing a list. Based on this, R programmers can code arrays that will hold values for range indexes, and these are organizable in a hierarchical manner.
However, one of the biggest advantages of the pandas approach is that the you can MultiIndexes prior to the execution of any queries. This means that there's no processing overhead when running a MultiIndexing query. From a resource point of view, it's not very different from any other type of query.
pandas is often used to analyze data cubes, which are multiple views of the same database over time. MultiIndexing is immensely useful in this context, as grouping helps to facilitate many different types of query. However, hierarchical indexing is also a solution for any other type of query that's supported by pandas.