Regular expressions, also known as “regexes,” are an extremely powerful yet tricky tool. Although regular expressions are an essential technology for text searches and natural language processing, they are hard to read and even harder to write—at least for the uninitiated user.
Using regular expressions with big data is even more difficult because your results aren’t instantaneous: you’ll have to wait a while before you find out whether the regular expression was correct or not. In this post, we’ll help clear up what regular expressions are and how to use them when processing big data.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
What is a Regular Expression? Regular Expressions 101
A regular expression is a sequence of characters used to match patterns in strings. For example, to find all numbers in a block of text, you can use the regular expression \d+:
- ‘\d’ acts as a placeholder for a single digit. This could also be written as [0-9].
- The ‘+’ character means we’re looking for one or more consecutive occurrences of the expression that precedes this character.
For example, running the above regex on the lyrics of the song “In The Year 2525” would return all of the different years in the lyrics: “2525,” “3535,” “4545,” “5555,” “6565,” “7510,” “8510,” and “9595.”
Below are some more highly useful regular expression operators:
- ‘.’ matches any character except the newline character ‘\n’.
- ‘*’ finds 0 or more consecutive occurrences of the preceding character. For example, the regex ‘25*’ will return results such as “2,” “25,” “255,” “2555,” etc.
- ‘^’ matches the start of a line.
- ‘$’ matches the end of a line.
- ‘[qwerty]’ matches any of the characters in the square brackets. For example, ‘[bmt]ake’ will return the results “bake,” “make,” and “take.”
- ‘[^asdf]’ matches any characters except the ones in the square brackets. For example, ‘[^0-9]’ will match any single character except a digit from 0 to 9.
- ‘\w’ matches all word characters. This could also be written as ‘[A-Za-z0-9_]’.
- ‘\b’ matches all non-word characters, i.e. all characters not matched by ‘\w’.
- \s matches all whitespace characters.
Below are some example regular expressions:
- ‘[0-9a-f]*’ matches hexadecimal strings, such as “ff02d4ee.”
- ‘[a-z0-9_-]’ matches usernames, such as “lady-gaga_2014.”
- ‘(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])’ matches dates in the YYYY-MM-DD format, e.g. “2012-01-21,” “1980/03/23,” and “1948 04-24.”
Testing Regular Expressions
Before you start trying to use regular expressions with big data, it’s imperative to do some testing first. It’s really easy to get regex syntax wrong, or to formulate an incorrect expression. Check out tools such as RegEx Pal, a website that lets you test your regular expressions online. Start by taking a small chunk of data, write your desired regex, and see if it works as expected.
Regular Expressions with Big Data
In Integrate.io, regular expressions can be used either as part of a filter or select component. They are executed via the function REGEX_EXTRACT(string_expression, regExp, index), which returns matches as a string (or null if there is no match). The function receives the following parameters:
- string_expression: This argument can be any string expression. It is the field name which should be used as input for the regular expression, a literal value, or function call.
- regExp: This argument is the regular expression. It comes with the following caveats:
- The regex should be surrounded by single quotes; any single quotes within the regex should be escaped with a backslash character in order to use them properly (e.g. \’).
- To return a group from within the matching pattern, surround the relevant part with parentheses.
- A single backslash indicates a special expression (e.g. to match a sequence of digits, enter ‘(\d+)’). To match a backslash character itself, use double escaping: ‘\\’.
- index: This argument indicates which match should be returned. For instance, 3 returns the third match of the regular expression, while 0 returns the entire match rather than only the requested groups
Below are some example use cases for the REGEX_EXTRACT function in Integrate.io:
- REGEX_EXTRACT('213.131.343.135:5020', '(.*)\\:(.*)', 1) returns '213.131.343.135'
- REGEX_EXTRACT('213.131.343.135:5020', '(.*)\\:(.*)', 2) returns ‘5020’
- REGEX_EXTRACT('/user/superman/cape', '/user/(.*)/', 1) returns ‘superman’
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Regular Expression Tutorial with Integrate.io
1) Open the relevant package, or create a new one.
2) Add or open a filter component:
- In the lefthand field, enter the relevant field or function, (i.e. source_name)
- In the center operator dropdown, choose "text matches."
- In the righthand field, enter the pattern. If you are looking for a pattern that could be found anywhere in the string, surround the pattern with .* from both sides
3) Add or open a select component:
- In the Expression line, click the Edit button to open the expression editor.
- Enter REGEX_EXTRACT (hit CTRL + spacebar for autocomplete) with the relevant parameters.
- Click Save. If there are any parsing errors, please go over the syntax and make sure that it’s correct. Note that the regular expression’s syntax is not checked at this point, only the function syntax.
4) Verify the package by clicking on the checkmark button on the top right of the package editor. If there are any errors, reopen the relevant component and double-check the syntax.
If the regex is malformed, you may receive a Java IOException in RegexExtract when running the job on a cluster:
Caused by: java.io.IOException: RegexExtract : Mal-Formed Regular expression : userId=([^&*) ...
In this case, verify that the regular expression syntax is valid. Test your regex on the side with a small chunk of data, as mentioned above.
Conclusion
Regular expressions can seem a bit complicated to start using, but they’re one of the most useful and powerful tools for text searches. After overcoming the hurdle of learning how to use regular expressions, you can use regex to match strings in big data with Integrate.io—whether that’s phone numbers, emails, URLs, or anything else that your big data heart desires.
Want to learn more about all Integrate.io has to offer? Contact us to schedule a demo and start your risk-free trial of the Integrate.io platform.