This is a guest post by computer scientist Bill Inmon, recognized as the "father of the data warehouse." Bill has written 65 books in nine languages and is currently building a technology called textual ETL

Here are five things to know about textual data:

  1. Traditionally, systems that collected data operated on transaction-based data and ignored text.

  2. Many businesses base their decisions on structured data rather than unstructured textual data.

  3. Textual ETL is a process that prepares textual data and moves it to a target system for analytics.

  4. Moving textual data to a target system enables Ecommerce organizations like yours to identify patterns and trends in that data through business intelligence tools.

  5. Integrate.io is a solution that integrates textual data (and other data) without lots of code.

In the beginning, simple systems collected data, wrote data to files, and created reports. For the most part, these systems operated on transaction-based data—bank deposits, sales, telephone calls, and the like. An entire infrastructure supported these essential business systems, but there was little or no place for text. All data was highly and tightly structured, and text was ignored.

In this post, learn more about textual-based data, how textual ETL can help Ecommerce organizations like yours, and the benefits of a solution like Integrate.io for creating a single source of truth for textual data. 

Integrate.io is a data warehousing solution for Ecommerce that moves textual data from documents, emails, chat logs, and other sources to a centralized repository for analytics via no-code/low-code pipelines. You can then operationalize this data in business intelligence tools. Email hello@integrate.io to learn more. 

All the Data in the Corporation

Consider all the data in the corporation, represented by this bar:

thumbnail image

How much of all the data in the corporation is structured data? The answer is not much. Depending on the corporation, only  5–15% is structured data, as you can see (symbolically) below:

thumbnail image

A closer examination of unstructured data in the corporation shows that some of that data is textual-based and some is not. The following figure shows this delineation:

thumbnail image

Most structured data is transaction-based data. Ecommerce business transactions and activities such as bank deposits, payments made, sales made, telephone calls, and so forth are included in the structured component of corporate data. As such, there is typically great business value in structured data for online retailers.

Making Business Decisions

For most Ecommerce retailers, the vast majority of business decisions are made on the basis of structured data, as shown below: 

thumbnail image

Most organizations exist happily and blissfully in the state shown in the above graphic. But forward-looking Ecommerce retailers see there is something fundamentally wrong here. The problem is that the majority of business decisions are based on just 10% of the data. This is like a golfer playing with only a driver and a putter. Or a racing skier only using their left foot. It doesn't make business sense to use just 10% of the data that exists in the corporation when making business decisions.

thumbnail image

The Progression of Text Analytics

Text-based data has presented a challenge in the world of technology. The simplest (but not the only) reason is that text just doesn’t fit well with the standard configuration of a database. However, advancements in technology have attempted to solve the challenges presented by text.

That progression looks like:

thumbnail image

In the beginning, programmers defined a database field known as the comments field. Some database users entered lots of text in this field. Other people put in short sentences and a few words. Some people didn’t put anything at all. In early versions of databases, space had to be allocated for the largest comment anyone could make, and most comments were not as large as the largest comment. So, in the earliest renditions of text management, there was a lot of wasted space.

Soon, database vendors created a “blob,” which took care of the problem of variable amounts of space being required and/or wasted. However, there was very little else that could be done with blobs. Once someone entered text into a blob, that text essentially became useless data. There was very little one could do with text placed in a blob.

Next, analysts tried something called Soundex. Soundex attempted to standardize text based on its sound. Then came stemming. With stemming, words were reduced to either their Latin or Greek word stems. The problem with Soundex and stemming is these solutions solved only a small part of the problem of text analytics.

Next came tagging. Tagging, in many ways, was the first real attempt to solve the basic problems of managing text inside a computer. But tagging had its own set of problems. The main issue was that the analyst had to know what needed to be tagged before tagging. Unfortunately, foreknowledge was not a skill many analysts had.

Then came taxonomies—a major step forward. With taxonomies, text could be externally categorized independently of text. This was a major advance in the move to address the issues of text analytics.

After taxonomies appeared, NLP (natural language processing) and machine learning (ML) took textual analytics to a new level.

And then, finally, came textual ETL (or “textual disambiguation”). This data integration method is built on the work that preceded it. There are many differences between textual ETL and NLP and ML. But the primary one is that NLP and ML focus on text, while textual ETL focuses equally on text and context.

“What” vs. “Why”

One of the best ways to describe the difference between textual ETL and NLP/ML is to understand the difference between answering the question “what” and answering the question “why.” Textual ETL answers the question “why,” whereas NLP answers the question “what.” 

A really simple way to tell the difference between “what” and “why” is this:

thumbnail image

NLP tells you your girlfriend is upset with you. Textual ETL tells you your girlfriend is upset with you and the reason why she is upset is that you have bad breath. There could be lots of reasons why she might be upset with you. You forgot her birthday. You smiled at a waitress. You left a small tip. You had too much to drink.

Now, knowing that she is mad is valuable information. But knowing why she is mad is even more valuable.

Text presents some challenges that are not found when handling classical structured text. The main challenge of text is that it is not enough just to handle the text. In order to handle text, you have to handle both text and context.

 

thumbnail image

 

To illustrate the value of context, suppose you listen in on a short conversation between two young men. “She’s hot,” one of the men says to the other as a young girl passes by.

Now, what is meant by “She’s hot”? 

thumbnail image

One interpretation is that the young lady is attractive and the young man would like to date her.

Or it could be that the two young men are in Houston, Texas, on a July day, and it is 100 degrees Fahrenheit with 99%humidity. The young lady is pouring with sweat. She is physically hot.

Or it could be that the two young men are doctors, and the young lady has COVID-19 and a temperature of 104 degrees. She is sick, and the doctor has taken her temperature. Her body is hot from fever.

And there are probably lots of other interpretations of “She’s hot.”

So, the meaning conveyed by “She’s hot” depends entirely on the context in which the words are said. You cannot understand the meaning of “She’s hot” without understanding the context.

Trying to interpret words without understanding context is a waste of time.

Textual ETL Architecture

The overall architecture of textual ETL looks like:

thumbnail image

Raw text is ingested from a variety of sources. External taxonomies can be introduced as well. Textual ETL reads and analyzes text and creates a standard database. Text can then be analyzed by standard analytical tools.

Some of the specifics of textual ETL are shown by:

thumbnail image

Once a standard database is set up, Ecommerce organizations can analyze the results. One way to do this is through business intelligence tools such as Tableau, Qlik, Excel, and PowerBI. Another method to analyze output text is Forest Rim’s Text Analytics Workbench.

thumbnail image

Integrate.io performs ETL, ELT, ReverseETL, and super-fast Change Data Capture (CDC), helping you integrate text-based data and other information in your Ecommerce organization with little or no code. There’s no jargon or complicated data pipelines. Email hello@integrate.io to learn more. 

How To Manage Large Amounts of Text

Now, for the first time, the massive amount of textual information in the corporation can be read and analyzed. The result is the unleashing of massive business value into the corporation.

The ability to not be constrained by volumes of data cannot be overstated. As long as you manually read and process text, you will always be limited as to how much analysis you can do. But once you digitize your processes, you can handle vast amounts of Ecommerce text.

Handling text in an automated manner brings down the cost, enhances the speed, and improves processing accuracy. All of those reasons support the case for automated text handling in almost every environment:

thumbnail image

How Integrate.io Helps With Textual Data

Moving unstructured textual data to a textual or conventional data warehouse can be challenging if you lack coding and programming skills. Integrate.io simplifies the data integration process for Ecommerce companies with out-of-the-box low-code/no-code connectors that sync with major warehouses, making it easier to run data through business intelligence tools.

Other Integrate.io features include simple pricing, compliance with data governance legislation, superior security, exceptional customer service, and Salesforce-to-Salesforce integrations.

Integrate.io is the data warehousing integration solution for Ecommerce companies everywhere. Use it to move textual data to a warehouse (or lake) and identify trends and patterns in that data. Schedule an intro call or email hello@integrate.io now. 

Bill Inmon, the father of the data warehouse, has authored 65 books. Computerworld named him one of the ten most influential people in the history of computing. Bill’s company, Forest Rim Technology, is based in Castle Rock, Colorado. Bill Inmon and Forest Rim Technology help organizations hear the voice of their customers. Learn more at www.forestrimtech.com.