This is a guest post with exclusive content by Bill Inmon. Bill Inmon “is an American computer scientist, recognized by many as the father of the data warehouse. Inmon wrote the first book, first magazine column, held the first conference, and was the first to offer classes in data warehousing.” -Wikipedia.

Our key points:

  1. Essentially, the lakehouse consists of three types of data — structured, transaction-based, and textual data.
  2. Analytical processing using blended data requires addressing the issue of analyzing data from several environments.
  3. Raw text may come from emails, the internet, surveys, conversations, printed reports, and more.
  4. Fundamental differences between two types of data make finding an intersection of data difficult.
  5. Combining data from the textual environment with data from the structured environment allows for a much more inclusive analysis.

One of the characteristics of most computing and analytical environments is that the environment consists of only one type of data. The OLTP environment consists primarily of transaction-based data. The data warehouse environment consists of integrated, historical data. The textual environment consists of text, and so forth.

The Lakehouse

But there is an exception to the singularity of types of data when it comes to the lakehouse. The lakehouse consists mainly of three types of data — structured, transaction-based, and textual data as well as other unstructured data types such as IoT and analog-based data. Because of this mixture of data types found in the data lakehouse, a new problem is introduced when using the data lakehouse. That problem is blending the different types of data together cohesively.

To perform analytical processing using blended data, the analyst must address the problem of how to analyze data from different environments.

The Origins of Data

The origins of the data found in the data lakehouse are shown:

thumbnail image

Data in the data lakehouse is typically created from two different processing technologies. Structured data and other unstructured data in the lakehouse are achieved as output. Textual data is achieved by passing raw text through textual ETL. From these sources of data, data arrives in the data lakehouse.

Types of Data Analysis

In the data lake architecture, there are essentially three types of analysis that can be done:

  • Structured data analysis
  • Textual data only analysis
  • Blended text and structured data analysis

thumbnail image

Data in the structured environment is typically found in a highly structured fashion. Typical of the data found in the structured environment are keys, attributes, indexes, and records:

thumbnail image

Each record in the structured environment has all these characteristics. A simple example of a key might be a social security number. Simple examples of attributes might be a person's name, address, or telephone number. A simple example of an index might be the town a person lives in. All this information is contained in a single record.

Each record in the structured environment has all this information:

thumbnail image

Repetitive Data

Because each record in the structured environment has the same type of information, the records are said to be repetitive. Having the same type of information in each record does not mean the same information is in each record. In one record, the name might be Mary Jones. In the next record, the name is Sam Smith. But in each record, there will be a name. There is a difference between same type of data and same data.

As a simple example of the analysis that can only be created from structured data, the analyst may issue a monthly report about cash flow. Cash comes from different sources and varies from month to month. The analysis of the cash flow is one of the corporation's KPIs.

From the Textual Environment

Data from the textual environment starts as raw text. The raw text can come from almost anywhere, such as emails, the internet, surveys, conversations, printed reports, and so forth. Once the raw text is captured and placed into a format where it can be read and managed, the raw text is then transformed into a database. It's necessary to transform the data into a database because if analytical processing is to be done on blended data, the data must be structured.

Stated differently, if text is left in a raw form, it doesn't fit in a database in a useful manner.

The resulting database has several elements after transformation, such as:

  • Identification of the originating document
  • Location of the word of interest in the document being analyzed
  • The word of interest
  • The context of the word of interest

thumbnail image

As an example of the types of data found in the textual environment, document ID might be “YELP Comment 506 on Jan 27, 2020.” Byte address might be "byte 208." Word might be “liked.” And context might be “positive sentiment.”

If you're doing analytic processing exclusively using raw text as a basis for analysis, you can perform either sentiment analysis or correlative analysis:

thumbnail image

Combining Text and Raw Data

The most powerful analytical processing you can do combines the data from both the structured environment and the textual environment. To accomplish this type of analysis, it's necessary to join the data from the two environments.

As an example of a join of data from the textual environment and the structured environment, consider the comment made by a customer about an automobile. One day the customer writes, “I am really disappointed in my recent purchase of a Chevrolet.” When the customer comment is matched with the purchase records, it is seen that the customer bought a Chevrolet Camaro brand new in 2007. Now the comment can be attached to a specific automobile.

The most interesting data (and the most useful for analytics) is the data in which there's an intersection of the different kinds of data.

thumbnail image

Because of the fundamental differences between the two types of data, finding an intersection of data is often difficult.

A Common Identifier

The easiest way to blend data from the two environments is when the textual data has an identifier associated with the data. In many forms of textual data, a specific identifier can be found. A typical identifier might be a social security number, a passport number in a document, or an employee number. On occasion, the document itself requires some form of identification.

If, in fact, there is an identifier in the textual document, then matching the textual document with the structured document becomes fairly straightforward.

The social security number in the structured environment is matched to the same social security number in the textual document:

thumbnail image

Note that some raw text documents have a component that's structured. If that's the case, then matching structured data and unstructured data becomes easy.

But many textual documents have neither a structured component nor an identifier. In this case, it's possible to find other types of data on which to blend data types other than an identifier.

Date Identifiers

Another simple way to blend documents is to find dates in both the structured data and the unstructured data. It's common for both types of documents to have some form of date:

thumbnail image

Note that there are many different kinds of dates. There's a purchase date, manufacture date, sales date, date of data capture, and so forth. The most meaningful type of date is typically one that reflects the date of the transaction, if there happens to be a transaction involved.

A mechanical consideration in matching dates is matching the different forms of representation that text might take. In one case, the date might be March 13, 2021; in another, the data may be 3/13/2021. Logically, these are the same date. But physically, they are very different.

Location Identifiers

Different types of data can also be matched by location. As a simple example, the state of Texas may be found in both types of documents — the structured document and the textual document. A match can be made on the name of the state.

A mechanical consideration is that state name can take more than one form. In one case, Texas may be spelled out as Texas. In another document, the state can be abbreviated TX. In yet another form, Texas may appear as Tex.

thumbnail image

Name Identifiers

Data can also be blended on a name itself. In the example shown, the name “Jena Smith” is found in documents in the structured environment and documents found in the textual environment.

In general, matching on names is the weakest form of matching. There are several reasons why a name identifier is the weakest of matches:

  • Name can be spelled many ways: J Smith, Jena Smith, J H Smith, etc.
  • More than one person may have the same name
  • A person may or may not use a title: Mr, Mrs, Ms, Dr, etc.

Generally, matches on names should not be considered concrete matches. Stated differently, when a match is made by name, the results are always questionable.

thumbnail image

Product Name Identifiers

A product known as the “4x2 television set” can be matched across different environments.

The problem with this matching approach is that the same product can be named slightly differently in multiple places.

thumbnail image

The Importance of Data Matching

It may not be obvious, but there is great value in matching structured data and text. The value lies in the fact that the matching criteria determine how the analysis is conducted.

For example, you cannot do a product analysis on data matched by location. Because the criteria for matching determines how your analysis can be done, the way your data is matched is important.

Imperfect Matching

Regardless of how you match data, there is always the chance that some documents will not have a corresponding match. In other words, there will be documents in the structured world that match with nothing in the text world. And vice versa. There will be documents in the textual world that have no match in the world of structured data.

As an example, a person buys a car but never offers an opinion of the car. In this case, there will be a structured record of the purchase of the car but no textual opinion of the car. Such mismatches are common and create no cause for concern. The data sources were never designed to be tightly coordinated, so it's no surprise that not all data is matched.

thumbnail image

An Example from a Restaurant…

One day, a restaurant chain decides to find out which stores are performing well. The data analyst comes up with an idea and decides to look at the reviews received from customers about their dining experience.

The analyst gathers the text and organizes it in a simple manner. Some comments are positive, some comments are neutral, and some comments are negative.

The analyst takes the negative comments and organizes them according to which store has received the most negative comments.

The results appear as follows:

thumbnail image

Store 12 has the most negative comments. Store 2 is right behind Store 12, and Store 7 is third in line with the most negative comments, and so forth.

When management looks at the results, the first impulse is to fire the management at stores 12, 2, and 7.

The logic is that the stores with the most negative comments must be doing something wrong.

But the analyst looks at the results and decides that data coming only from text may be a little misguided. The analyst suggests structured data should be filtered into the analysis.

The analyst creates a new analysis that factors in how large the store is and the average number of customers. When this analysis is done, a ratio is created. The ratio calculates the number of negative complaints against the total monthly dollar volume of the store.

This analysis yields an entirely different result. For example, if you look at the number one store for complaints, it's the one in Manhattan, N.Y. This would lead you to believe that Manhattan is a poorly run store. But when you compare the number of customers the Manhattan store has with the number of customers the Gallup, New Mexico store has, there's no contest. Manhattan has 5,000 customers each day, and Gallup has 35 customers per day. Manhattan has many more complaints simply because they have far more customers than Gallup.

Bringing in structured data and combining the structured data with textual data gives a much more realistic viewpoint on which stores are the most poorly managed.

When the volume of customers is factored into the number of complaints, it emerges that the most poorly run stores are Gallup, New Mexico; Jackson, Mississippi; and Van Horn, Texas.

This measurement is a far more accurate and realistic picture of stores with poor management:

thumbnail image

Note that there could have been many ways to factor in structured data. The total dollar volume each day could have been used. The total number of meals served each day could have been used. The total number of customers each day could have been used, and so forth.

This is an excellent example of using structured data in conjunction with textual data to achieve an accurate picture.

The Impact of COVID

As another example of the need to blend textual data and structured data, consider the number of reported COVID deaths in a state. The analyst looks at hospital reports and finds that hospitals have reported a certain number of COVID-related deaths. The reports are amalgamated, and an analysis of the number of COVID deaths in the U.S. is made:

thumbnail image

These numbers are gathered from a variety of reports and sources. When management examines the reports, they find that the total number of COVID deaths nationwide is an interesting number. But it does not give enough information.

Management decides to drill down further. But the reports that provided the nationwide COVID death count do not have detailed state by state information.

The analyst then decides to go to the individual state reports, which are stored in a structured system:

thumbnail image

By combining data from the textual environment with data from the structured environment, a much more inclusive analysis can be made.

Partner with Integrate.io

With Integrate.io, you get the best of all the worlds — a new ETL platform with blazing fast CDC, reverse ETL, and deep Ecommerce capabilities. Schedule an intro call to learn more.

Bill Inmon the father of the data warehouse has authored 65 books and was named by Computerworld as one of the ten most influential people in the history of computing.Bill’s company – Forest Rim technology is a Castle Rock, Colorado company. Bill Inmon and Forest Rim Technology provide a service to companies in helping companies hear the voice of their customer. See more at www.forestrimtech.com