SQL Pattern Matching Deep Dive - Part 3, greedy vs. reluctant quantifiers

Picture courtesy of Pixabay

Welcome to the third post in this deep-dive series on SQL pattern matching using the MATCH_RECOGNIZE feature that is part of Database 12c.

In the first part of this series we looked at a wide range of topics including ensuring query consistency, how to correctly use predicates and how to manage sorting. In the second part we looked at using the built-in measures to understand how a data set is matched to a pattern.
In this post I am going to review the concepts of greedy and reluctant quantifiers. I will breakdown this down into a number of areas: 1) Overview of regular expressions, 2) understanding quantifiers, and 3) greedy vs. reluctant quantifiers. The examples in this post use the built-in measures to help show the difference between greedy and reluctant matching. If you are not familiar with the MATCH_NUMBER() function or the CLASSIFIER() function then please take some time to read the second post in this series.

Overview of regular expressions

The MATCH_RECOGNIZE clause has a lot of component parts but probably the most important one is the PATTERN component which consists of regular expressions. But what exactly is a regular expression?

What is a regular expression?

Here is a very basic definition from Wikipedia:

a regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace” - like operations
Wikipedia: https://en.wikipedia.org/wiki/Regular_expression

Oracle has for quite some time supported regular expressions within the Oracle Database. The 12c Database Development Guide has a section on regular expressions where these are defined as:

A regular expression specifies a search pattern, using metacharacters (which are, or belong to, operators) and character literals
12c Database Development Guide

so you might see something like this in SQL code:

WHERE REGEXP_LIKE((hr.employees.first_name, '^Ste(v|ph)en$')

which uses a regular expression to identify employees with the first name of Steven or Stephen. There are a number of different REFEXP_XXXX functions and these are listed in Table 8-1 of the documentation.
In terms of MATCH_RECOGNIZED, Oracle has implemented regular expressions slightly differently. Each regular expression is specified by a set of strings based on two key parts: tokens and quantifiers. The token is a symbol the definition of which is described in the DEFINE clause. The quantifier after a token or group of tokens specifies how often that token-group is allowed to occur.
First, let’s explore the concept of tokens.

Grouping tokens

As indicated earlier, tokens can be grouped and managed using various methods. So before we go any further lets quickly explore the four key concepts relating to tokens: concatenation, grouping, alternation, and permutes. They are amazingly powerful and deeply expressive and perfectly suited to the problem of describing patterns.

1. Concatenation

Used to list two or more items in a pattern that have to be matched in the specified order. Items are concatenated when there is no operator sign between two successive items. For example: PATTERN (A B C).
Using concatenation we can rewrite our usual pattern definition for w-shapes as follows:

SELECT symbol, first_x, last_z
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES FIRST(x.tstamp) AS first_x,
LAST(z.tstamp) AS last_z
ONE ROW PER MATCH
PATTERN (X+ Y+ W+ Z+)
DEFINE X AS (price < PREV(price)),
Y AS (price > PREV(price)),
W AS (price < PREV(price)),
Z AS (price > PREV(price) AND z.tstamp - FIRST(x.tstamp) <= 7 ))WHERE symbol = 'OSCORP';

..can now be rewritten as follows:

SELECT symbol, first_x, last_y
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES FIRST(x.tstamp) AS first_x,
LAST(y.tstamp) AS last_y
ALL ROWS PER MATCH
PATTERN ((X+ Y+) (X+ Y+) Z)
DEFINE X AS (price < PREV(price)),
Y AS (price > PREV(price)),
Z AS LAST(y.tstamp) - FIRST(x.tstamp) <= 7 )WHERE symbol = ('OSCORP');

Initially, I was hoping to reduce the DEFINE statement to just two lines. I did try to rewrite the pattern without the additional “z” variable, by simply adding the AND clause from the first example, which tests for the duration of the W-shape, to the Y pattern definition. Whilst this produced the same result, the test for the overall duration of the pattern, i.e. whether it is less than or equal to 7 days, gets applied to the first (X Y) scenario as well as the second (X Y) scenario, when we actually want to test the duration of the complete W-shape. Therefore, I had to introduce Z to manage the test for the duration of the W-pattern. Hope that is clear!

2. Alternation

Matches a single regular expression from a list of several possible regular expressions. The alternation list is created by placing a vertical bar (|) between each regular expression. Alternatives are preferred in the order they are specified. As an example, PATTERN (A | B | C) attempts to match A first, if A is not matched, it attempts to match B, if B is not matched, it finally attempts to match C.

3. Grouping

Treats a portion of the regular expression as a single unit, enabling you to apply quantifiers to the whole group. A grouping is created using parentheses. For example, PATTERN ((A B){3} C). This attempts to match the group (A B) three times and then seeks one occurrence of C.

4. Permute

The PERMUTE syntax may be used to express a pattern that is a permutation of simpler patterns. Let’s consider a relatively simple example: PATTERN (PERMUTE (X{3}, B C?, D)). This is equivalent to the following to writing the following:

PATTERN ((X{3} B C? D)
|(X{3}DBC?)
|(BC?X{3}D)
|(BC?DX{3})
|(DX{3}BC?)
|(DBC?X{3}))

Note that PERMUTE is expanded lexicographically. This is an important point because it means that alternatives are attempted in the order written in the expansion and that has an impact when we have to start back-tracking (don’t panic this specific topic will be discussed in a future post) after a complete match fails because we have to step back through the alternative permutations in the correct order.
So as you can see we have a very rich way of managing how a group of tokens should be applied to a data set: are they AND’ed together or are they OR’ed, are they grouped together, do different permutations need to be tested against the data set etc.

Getting to know quantifiers

Now we understand about tokens lets look at how we control the frequency a token should appear within our patten. The most common quantifiers used by developers are: 1) question mark, indicates zero or one match, 2) asterisk, indicates the need for zero or more matches and 3) plus sign, indicates the need for one or more matches.
Of course, Oracle supports a large library of built-in quantifiers, which are sometimes referred to as POSIX basic and extended quantifiers:

* 0 = or more matches
+ 1 = or more matches
? 0 = or 1 match
{n} = exactly n matches
{n,} = n or more matches
{n, m} = at least n but not more than m (inclusive) matches
{, m} = at most m (inclusive) matches

Here are a couple of simple examples of using quantifier operators:

PATTERN A* - matches 0 or more iterations of A
PATTERN A{3,6} - matches 3 to 6 iterations of A
PATTERN A{,4} - matches 0 to 4 iterations of A

Note that at this stage we have not defined what ”A“ actually means - that is covered in the DEFINE clause. There are a couple of special quantifiers which are referred to as “anchors”. What is an “anchor”?

Special quantifiers - Anchors

These are partition wide quantifiers. They work slightly different to other quantifiers because they operate in terms of positions rather than rows. They match a position either at the start or end of a partition or they can be used in combination to span an entire partition. The two start and end operators are:

^ matches the position before the first row in the partition.
$ matches the position after the last row in the partition.

As an example the PATTERN (^ A+ $) will only result in a match if all rows in a partition satisfy the condition for A, i.e. the resulting match spans the entire partition.

Special quantifiers - excluding a pattern

When using ALL ROWS PER MATCH with either the OMIT EMPTY MATCHES or SHOW EMPTY MATCHES sub-options, rows matching a portion of the PATTERN may be excluded from the row pattern output table. The excluded portion is bracketed between {- and -} in the PATTERN clause.
This example finds the longest period of increasing prices where the price is not less than ten at the start of the period of increasing prices:

PATTERN ( {- A -} B+ {- C+ -} )
SUBSET S = (A,B)
DEFINE A AS A.Price >= 10,
B AS B.Price > PREV(B.Price),
C AS C.Price <= PREV(C.Price)

NOTE this is probably obvious… you cannot use the “ALL ROWS PER MATCH WITH UNMATCHED ROWS” syntax when there are exclusions in the PATTERN clause.
The “exclude” feature acts like a filtering process within the pattern matching process. It is possible to achieve almost the same result by reversing the definition of A and removing the exclusion of A within the pattern definition and dropping C completely. However, more rows are returned because pattern A is now part of the output, i.e. rows are actually matched to A and returned.
It’s important to remember that the exclude feature only removes rows from the output, they are not excluded from the definitions of union pattern variables, or from the calculations within the DEFINE or MEASURES clauses. Therefore, you can remove rows from the output but still include those rows in calculations that need access to the whole pattern such as SUM() or AVG() or MAX() or MIN() operations.
One benefit of this is that , assuming you want to remove rows from the output, there is no need to make two passes over the data - once to match the pattern and then via the WHERE clause to filter the rows. Everything is managed within the MATCH_RECOGNIZE processing.
For example, let’s consider the following code:

SELECT symbol,
tstamp,
a_tstamp,
matchno,
row_count,
classfr,
price,
TRUNC(avgp,1)
FROM Ticker MATCH_RECOGNIZE (
PARTITION BY Symbol
ORDER BY tstamp
MEASURES FINAL AVG(Price) AS avgp,
COUNT(*) as row_count,
CLASSIFIER() AS classfr,
MATCH_NUMBER() AS matchno,
FIRST(a.tstamp) AS a_tstamp
ALL ROWS PER MATCH
PATTERN ( {- A -} B+ {- C+ -} )
SUBSET S = (A,B)
DEFINE A AS A.Price >= 10,
B AS B.Price > PREV(B.Price),
C AS C.Price <= PREV(C.Price)
)
WHERE symbol= 'ACME'
ORDER BY symbol, tstamp;

generates the following output:
Output from exclude syntax to remove rows

If you look at the count of the number of rows (row_count) within each match it does not match the number of rows in the output table. For example, for the first match there are four rows returned but the count of the number of rows actually matched is five! That is because the rows mapped to A have been removed from the output but were included in the calculations defined in the MEASURES clause. Very clever!
Of course you can’t use the exclude syntax with if WITH UNMATCHED ROWS is specified for ALL ROWS PER MATCH. In contrast the exclusion functionality does not affect the AFTER MATCH SKIP TO option. It is valid to SKIP to a variable that appears is marked as excluded.

Greedy vs. reluctant

Now we have the groundwork in place we can start to explore the concept of greedy vs. reluctant quantifiers. Before we start a quick warning about using SQL Developer for these examples. At the moment, if you try to include a question mark (?) as part of your pattern then there is an issue because this symbol is used to catch parameters. Therefore, if you try to execute a MATCH_RECOGNIZE statement that includes reluctant quantifiers you will get an error message:

Missing IN or OUT parameter at index:: 1

I don’t think there is a fix for this at the moment but the SQL Developer team is aware of the issue and working on a fix. See this thread on OTN. Of course, this issue will also impact any applications that use the JDBC drivers (and ODBC?) because a question mark is taken as a parameter marker. There is a workaround for JDBC connections described in the thread on OTN but it does not appear to be very elegant - at least not to me!
Anyway…..Pattern quantifiers that attempt to match as many instances as possible of the token to which they are applied are referred to as greedy, i.e. they will gobble up as many rows as possible. This is the default mode of operation.
In contrast a reluctant quantifier will attempt to match as few instances as possible of the token on which they are applied. To switch a quantifier from greedy to reluctant you simply add a question mark ? as additional suffix to the pattern. For example:

PATTERN (strt down+? up+)
DEFINE
down AS (price <= PREV(price)),
up AS (price >= PREV(price))

searches for one occurrence of strt followed by at least one occurrence of down and one or more occurrences of up. BUT - and this where greedy vs. reluctant comes into play - if a row could be matched to either down or up then the down pattern will defer to the up pattern. If down had a greedy quantifier then if a row could be matched to either down or up then the down pattern will win. Hope that makes sense, if not then …

A simple example…

I have taken the sample ticker data which you can play with on LiveSQL and made a small change to the data for ticker symbol ACME. I changed the price on Apr-16 to 14 so that there are now three consecutive rows (15-Apr, 16-Apr and 17-Apr) with the same value (14) for price. If we graph the price for ACME it now looks like this:
Updated price data for ACME

You can see that we now have a flat spot in our ticker stream data. If we simplify our usual w-shape pattern so that we are just looking for V-shaped patterns and change the definition of the down and up variables so that there is potential for ties by using <= and >= in their respective definitions then we can explore this idea of greedy quantifiers by using the following code:

SELECT symbol, tstamp, price, mn, pattern,
first_down, first_price, last_up, last_price
FROM ticker MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES MATCH_NUMBER() AS mn,
CLASSIFIER() as pattern,
FIRST(strt.tstamp) AS first_down,
FIRST(strt.price) as first_price,
LAST(up.tstamp) AS last_up,
LAST(up.price) as last_price
ALL ROWS PER MATCH
PATTERN (strt down+ up+)
DEFINE
down AS (price <= PREV(price)),
up AS (price >= PREV(price))
)
WHERE symbol = 'ACME’
ORDER BY symbol, tstamp;

this will generate the following output:
Results from MATCH_RECOGNIZE using greedy quantifiers

Note that on April 17 we match the row to the down variable because down is being greedy. It could have been matched to the up variable because the price is actually equal to the previous row but down took precedence.
What if we change down and make it reluctant by using the ? quantifier:

SELECT symbol, tstamp, price, mn, pattern,
first_down, first_price, last_up, last_price
FROM ticker MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES MATCH_NUMBER() AS mn,
CLASSIFIER() as pattern,
FIRST(strt.tstamp) AS first_down,
FIRST(strt.price) as first_price,
LAST(up.tstamp) AS last_up,
LAST(up.price) as last_price
ALL ROWS PER MATCH
PATTERN (strt down+? up+)
DEFINE
down AS (price <= PREV(price)),
up AS (price >= PREV(price))
)
WHERE symbol = 'ACME’
ORDER BY symbol, tstamp;

this will generate the following output:
Pattern matching showing reluctant quantifier

Now you can see that because down is reluctant it matches as few occurrences as possible and where a row could be mapped to either down or up it is up that takes precedence.

Why is this important?

Obviously this can impact the way that rows are matched to your pattern. Therefore, you need to think carefully about how you are going to manage the situation where rows could be matched to more than one variable - do you have preference for which variable wins? If you have measures that are tied to specific variables then these will be impacted by whether the variable is greedy or reluctant. Obvious examples are where you are performing some sort of calculation such as averages, sums or counts.
Therefore, when you are constructing your pattern please think very carefully about how greedy you want your matching process to be as it processes your data set.

What’s next?

In the next post in this series I am going to take a quick diversion to review how we more accurately manage the output from MATCH_RECOGNIZE. In the last post we briefly looked at differences between ONE ROW PER MATCH vs. ALL ROWS PER MATCH. However, there three different permutations of the ALL ROWS syntax and these seem to be causing some confusion. Stay tuned if you want to learn about the key differences between: 1) SHOW EMPTY MATCHES, 2) OMIT EMPTY MATCHES and 3) WITH UNMATCHED ROWS. All will become clear in my next post.
Feel free to contact me if you have an interesting use cases for SQL pattern matching or if you just want some more information. Always happy to help. My email address is keith.laker@oracle.com.

Looking for more Information

Use the tag search to see more information about pattern matching or SQL Analytics or Database 12c.
Technorati Tags: Analytics, Data Warehousing, Database 12c, Pattern Matching, SQL Analytics

Comments

Stew Ashton11 January 2017 at 13:21
Hi Keith,

Concerning the reluctant qualifier ? in JDBC, the 12.2 documentation describes a workaround:

"starting from Oracle Database 12c Release 1 (12.1.0.2), you can use the '{\ ... \}' syntax while using the ? character, so that the JDBC driver does not process it as a parameter marker and allows the SQL engine to process it"

http://docs.oracle.com/database/122/JJDBC/JDBC-reference-information.htm#JJDBC-GUID-3454411C-5F24-4D46-83A9-5DA0BA704F5D

Search This Blog

SQL for Data Warehousing and Analytics