What is Hierarchical Indexing?

Hierarchical indexing is a method of creating structured group relationships in data. These hierarchical indexes, or MultiIndexes, are highly flexible and offer a range of options when performing complex data queries.

Hierarchical indexing is one of the functions in pandas, a software library for the Python programming languages. pandas derives its name from the term “panel data”, a statistical term for four-dimensional data models that show changes over time. The pandas library includes multiple tools that support detailed data queries.

How Does Hierarchical Indexing Work?

The pandas library works with what it refers to as DataFrames. These are similar to tables in a relational database, so you can easily import a relational table into a DataFrame. CSV, JSON and Excel data are also suitable sources for creating a DataFrame.

DataFrames are two-dimensional structures with rows and columns. Each row has an integer range index that is assigned by pandas. A typical DataFrame might look like this:

Range   EmployeeID       Name    RoleType             State 1              10001    A
Alvarez             Admin   TX 2              10002    B Benson             Admin   CA
3              10003    C Christie             Manager              TX 4         10004 D
Duane               Manager              CA 5              10005    E Edison              
Executive            TX 6              10006    F Franklin             Executive         CA

Within this data, we can see some logical groupings, by state and by role type. It’s possible to organize this data by running some kind of join, condition or grouping function. pandas includes a groupby() function for this very purpose.

However, we can use hierarchical indexing to create new indexes for this DataFrame, using data values. In pandas, that involves a line of code like this:

df.set_index(['RoleType', 'State'], inplace=True)

Which creates MultiIndexes for this DataFrame. If we sort the data, we see something like this:

State  RoleType      EmployeeID    Name TX     Admin         10001         A Alvarez       
Manager        10003         C Christie        Executive     10005         E Edison CA    
Admin         10002         B Benson        Manager        10004         D Duane       
Executive     10006         F Franklin

In this example, State and RoleType combine to create an index that not only organizes the data, but which also allows for faster sorting.

What’s more, these identifiers now show clear logical relationships between rows. In the example above, we can see the internal structure of each office, which means that we can run queries about each office without any kind of arguments to the query.

To realize the benefits of MultiIndexes, you will need to sort the data before running any queries. Otherwise, pandas will return a copy of the data, not a view. It will also serve up a performance warning.

What are the Advantages of Hierarchical Indexing?

In pandas, MultiIndexes can help to provide optimized queries and preserve relationships. MultiIndexes are themselves data values, and pandas handles them as such in queries, but the full contents of a hierarchical index are displayable in the results view.

One example is in pandas Series. A series is simply a column of data, the equivalent of selecting a single column in Excel or running an SQL query like SELECT column FROM table.

When you attempt to retrieve a series containing MultiIndexes, such as:

print(df['EmployeeID'].head(5))

You receive an output that looks like something like this:

State    RoleType            EmployeeID TX          Admin                 10001              
          Manager             10003                Executive              10005 
CA         Admin              10002                 Manager               10004

The reason for this is that the hierarchical indexing relationship is unbreakable, so all queries return the values used for the purpose of indexing. This can be extremely useful in cases where you want to perform complex queries on grouped data. Rather than attempting to maintain the relationships across your queries, you can treat them as atomic entities.

Using Hierarchical Indexing in Code

pandas is a software library intended to facilitate data analysis. It extends Python with a number of functions that assist developers in this purpose.

For example, the .loc function can help to identify rows that match certain index values. With hierarchical indexing, we can pass multiple arguments to .loc. So, in the example above, the code might look like this:

ds.loc[(‘TX’, ‘Manager’), :]

Which will return matching values. In this case, that will be:

EmployeeID Name

10003 C Christie

You can also look for values using a partial index. For example:

ds.loc[(‘TX’, slice(none), :]

will return a list of all employees based in Texas. This extra syntax allows for greater flexibility when creating queries that cover large, complex DataFrames.

Another useful function is the cross-section, which is .xs in pandas. The cross-slice allows us to retrieve matching values. For example, we can pull an employee from our sample database using this query:

ds.xs(‘A Alvarex’, level=’Name’)

These are some basic examples of how to code and utilize DataFrames within Panda. For more detail, refer to the pandas documentation.

Other Implementations of Hierarchical Indexing

pandas offers one of the few ready-to-use implementations of MultiIndexing. For other implementations, developers generally rely on hand-coded solutions.

For example, some solutions use arrays in R to provide a version of hierarchical indexing. Arrays only hold one mode, but this can include storing a list. Based on this, R programmers can code arrays that will hold values for range indexes, and these are organizable in a hierarchical manner.

However, one of the biggest advantages of the pandas approach is that the you can MultiIndexes prior to the execution of any queries. This means that there's no processing overhead when running a MultiIndexing query. From a resource point of view, it's not very different from any other type of query.

pandas is often used to analyze data cubes, which are multiple views of the same database over time. MultiIndexing is immensely useful in this context, as grouping helps to facilitate many different types of query. However, hierarchical indexing is also a solution for any other type of query that's supported by pandas.

What is Hierarchical Indexing?

How Does Hierarchical Indexing Work?

What are the Advantages of Hierarchical Indexing?

Using Hierarchical Indexing in Code

Other Implementations of Hierarchical Indexing

Glossary of Terms

A guide to the nomenclature of data integration technology.

ETL Developer vs Data Engineer: Key Differences

What is Customer Data Ingestion?

AWS Data Pipeline Best Practices

Solutions

Support

Company

Language

What is Hierarchical Indexing?

How Does Hierarchical Indexing Work?

What are the Advantages of Hierarchical Indexing?

Using Hierarchical Indexing in Code

Other Implementations of Hierarchical Indexing

Glossary of Terms

A guide to the nomenclature of data integration technology.

What is Data Integrity?

What is Data Conversion?

What is Data Access?

Related Readings

ETL Developer vs Data Engineer: Key Differences

What is Customer Data Ingestion?

AWS Data Pipeline Best Practices

Solutions

Support

Company

Language