SQL Pattern Matching Deep Dive - Part 4, Empty matches and unmatched rows?

image courtesy of flicker: https://c1.staticflickr.com/1/55/185807556_21c547c02e_b.jpg

I have been asked a number of times during and after presenting on this topic (SQL Pattern Matching Deep Dive) what is the difference between the various matching options such as EMPTY MATCHES and UNMATCHED ROWS. This is the area that I am going to cover in this particular blog post, which is No 4 in this deep dive series.

When determining the type of output you want MATCH_RECOGNIZE to return most developers will opt for one of the following:

ONE ROW PER MATCH - each match produces one summary row. This is the default.
ALL ROWS PER MATCH - a match spanning multiple rows will produce one output row for each row in the match.

The default behaviour for MATCH_RECOGNIZE is to return one summary row for each match. In the majority of use cases this is probably the ideal solution. However, there are also many use cases that require more detailed information to be returned. If you are debugging your MATCH_RECOGNIZE statement then a little more information can help show how the pattern is being matched to your data set.

In some cases you may find it useful, or even necessary, to use the extended syntax of the ALL ROWS PER MATCH keywords. There are three sub options:

ALL ROWS PER MATCH SHOW EMPTY MATCHES <- note that this is the default
ALL ROWS PER MATCH OMIT EMPTY MATCHES
ALL ROWS PER MATCH WITH UNMATCHED ROWS

Let’s look at these sub options in more detail but first a quick point of reference: all the examples shown below use the default AFTER MATCH SKIP PAST LAST ROW syntax. More on this later…

TICKER DATA

Here is part of the ticker data set that we are going to use in this example - if you want to take a look at the full data set then see the example on the LiveSQL site:

Sample ticker data set

Empty matches

An obvious first question is: what’s the difference between an “empty match” and an “unmatched row”? This is largely determined by the type of quantifier used as part of the pattern definition. By changing the quantifier it is possible to produce the similar result using both sets of keywords. To help explore the subtleties of these keywords I have simplified the pattern to just look for price decreases and you should note that we are using the * quantifier to indicate that we are looking for zero or more matches of the DOWN pattern. Therefore, if we run the following code:

SELECT
symbol,
tstamp,
price,
start_tstamp,
end_tstamp,
match_num,
classifier
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES FIRST(down.tstamp) AS start_tstamp,
LAST(down.tstamp) AS end_tstamp,
match_number() AS match_num,
classifier() AS classifier
ALL ROWS PER MATCH SHOW EMPTY MATCHES
PATTERN (DOWN*)
DEFINE
DOWN AS (price <= PREV(price))
)
WHERE symbol = 'GLOBEX';

We get the following output:

Query showing results of empty matches

You can see that the result set contains all 20 rows that make up the data for my symbol “GLOBEX". Rows 1- 3, 9, and 13-15 are identified as empty matches - the classifier returns null. These rows appear because we have defined the search requirements for pattern DOWN as being zero or more occurrences.

Based on this we can state that an empty match is a row that does not map explicitly to a pattern variable (in this case DOWN). However, it is worth noting that an empty match does in fact have a starting row and it is assigned a sequential match number, based on the ordinal position of its starting row. The above situation is largely the result of the specific quantifier that we are using: * (asterisk).

Given that the DOWN variable can be matched zero or more times there is the opportunity for an empty match to occur. As the complexity of the PATTERN increases, adding more variables and using different combinations of quantifiers, the probability of getting empty matches decreases but it is something that you need to consider. Why? Because the MATCH_NUMBER() function counts the empty matches and assigns a number to them - as you can see above. Therefore, if we omit the empty matches from the results the MATCH_NUMBER() column no longer contains a contiguous set of numbers:
So that if we run the following code where we have specified “OMIT EMPTY MATCHES”:

SELECT
symbol,
tstamp,
price,
start_tstamp,
end_tstamp,
match_num,
classifier
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES FIRST(down.tstamp) AS start_tstamp,
LAST(down.tstamp) AS end_tstamp,
match_number() AS match_num,
classifier() AS classifier
ALL ROWS PER MATCH OMIT EMPTY MATCHES
PATTERN (DOWN*)
DEFINE
DOWN AS (price <= PREV(price))
)
WHERE symbol = 'GLOBEX';

We get the following output:

Query results based on omitting empty matches

as you can see the MATCH_NUMBER() column starts with match number 4 followed by match 6 followed by match 10. Therefore, you need to be very careful if you decide to test for a specific match number within the MATCH_RECOGNIZE section and/or the result set because you might get caught out if you are expecting a contiguous series of numbers.

Summary of EMPTY MATCHES

Some patterns permit empty matches such as those using the asterisk quantifier, as shown above. Three mains points to remember when your pattern permits this type of matching:

The value of MATCH_NUMBER() is the sequential match number of the empty match.
Any COUNT is 0.
Any other aggregate, row pattern navigation operation, or ordinary row pattern column reference is null.

The default is always to return empty matches, therefore, it is always a good idea to determine from the start if your pattern is capable of returning an empty match and how you want to manage those rows: include them (SHOW EMPTY MATCHES) or exclude them (OMIT EMPTY MATCHES). Be careful if you are using MATCH_NUMBER() within the DEFINE section as part of a formula because empty matches increment the MATCH_NUMBER() counter.

Reporting unmatched rows?

Always useful to view the complete result set - at least when you are running your code against test data sets. Getting all the input rows into your output is relatively easy because you just need to include the phrase ALL ROWS PER MATCH WITH UNMATCHED ROWS. Other than for testing purposes I can’t think of a good use case for using this in production so make sure you check your code before you submit your production-ready code to your DBA.

What about skipping?

Note that if ALL ROWS PER MATCH WITH UNMATCHED ROWS is used with the default skipping behaviour (AFTER MATCH SKIP PAST LAST ROW), then there is exactly one row in the output for every row in the input. This statement will lead us nicely into the next topic in this deep dive series where I will explore SKIPPING. Taking a quick peak into this next topic…obviously there are many different types of skipping behaviours that are permitted when using WITH UNMATCHED ROWS. It does, in fact, become possible for a row to be mapped by more than one match and appear in the row pattern output table multiple times. Unmatched rows will appear in the output only once.

Can a query contain all three types of match?

Now the big question: Can I have a query where it is possible to have both UNMATCHED ROWS and EMPTY MATCHES? Short answer: Yes.
When the PATTERN clause allows empty matches, nothing in the DEFINE clause can stop the empty matches from happening. However, there are special PATTERN symbols that are called anchors. Anchors work in terms of positions rather than rows. They match a position either at the start or end of a partition, or it used together then across the whole partition.

^ matches the position before the first row in the partition

$ matches the position after the last row in the partition

Therefore, using these symbols it is possible to create a PATTERN where the keywords SHOW EMPTY MATCHES, OMIT EMPTY MATCHES, and WITH UNMATCHED ROWS all produce different results from the same result set. For example, let’s start with the following:

SELECT symbol, tstamp, price, mnm, nmr, cls
FROM ticker MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
MEASURES match_number() AS mnm,
count(*) AS nmr,
classifier() AS cls
ALL ROWS PER MATCH SHOW EMPTY MATCHES
PATTERN ((^A*)|A+)
DEFINE A AS price > 11)
WHERE symbol = 'GLOBEX'
ORDER BY 1, 2;

returns the following 5 rows:

Screen Shot 2016 10 07 at 15 32 29

this shows row 1 as an empty match for the pattern A* because we are matching from the start of the partition. This sets the MATCH_NUMBER() counter to 1. After the empty match the state moves to the pattern A+ for the remainder of the rows. The first match for this pattern starts at row 2 and completes at row 4. The final match in our data set is found at the row containing 15-APR-11. Therefore, if we omit the empty match at row 1 we only get 4 rows returned as shown here:

SELECT symbol, tstamp, price, mnm, nmr, cls
FROM ticker MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
MEASURES match_number() AS mnm,
count(*) AS nmr,
classifier() AS cls
ALL ROWS PER MATCH OMIT EMPTY MATCHES
PATTERN ((^A*)|A+)
DEFINE A AS price > 11)
WHERE symbol = 'GLOBEX'
ORDER BY 1, 2;

returns the following 4 rows:

Now if we use the last iteration of this example the MATCH_RECOGNIZE statement returns all the rows from the input data. The actual “unmatched rows” are identified as having a NULL match number and NULL classifier. The “empty matches” are identified as having a NULL classifier and in this example the COUNT(*) function returns zero.

SELECT symbol, tstamp, price, mnm, nmr, cls
FROM ticker MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
MEASURES match_number() AS mnm,
count(*) AS nmr,
classifier() AS cls
ALL ROWS PER MATCH WITH UNMATCHED ROWS
PATTERN ((^A*)|A+)
DEFINE A AS price > 11)
WHERE symbol = 'GLOBEX'
ORDER BY 1, 2;

returns all 20 rows from our data set:

Screen Shot 2016 10 07 at 15 29 13

LiveSQL

I have taken all the code and the associated explanations and created a tutorial on LiveSQL so you can try out the code for yourself: https://livesql.oracle.com/apex/livesql/file/tutorial_DZO3CVNYA7IYFU1V8H0PWHPYN.html.

Summary

I hope this helps to explain how the various output keywords that are part of the ALL ROWS PER MATCH syntax can affect the results you get back. You should now understand why your results contains match_number values that are not contiguous and why classifier can return a NULL value along with specific aggregate functions. I expect the hardest concept to understand is the idea of empty matches. As I stated earlier it is always a good idea to determine from the start if your pattern is capable of returning an empty match: are you using an asterisk * within the PATTERN clause?

Then you can determine how you want to manage those rows: include the empty matches (SHOW EMPTY MATCHES) or exclude them (OMIT EMPTY MATCHES). Be careful if you are using MATCH_NUMBER() within the DEFINE section as part of a formula because empty matches increment the MATCH_NUMBER() counter.

What should be immediately obvious is that in all the examples I have used the default skip behaviour: AFTER MATCH SKIP PAST LAST ROW. In the next post I will explore the various skip keywords and how they can impact the results returned by your MATCH_RECOGNIZE statement.

What’s next?

In the next post in this series I am going to review the keywords that control where we restart searching once a pattern has been found: the keywords SKIP TO. Feel free to contact me if you have an interesting use cases for SQL pattern matching or if you just want some more information. Always happy to help. My email address is keith.laker@oracle.com.

Looking for more Information

Use the tag search to see more information about pattern matching or SQL Analytics or Database 12c.
Technorati Tags: Analytics, Data Warehousing, Database 12c, Pattern Matching, SQL Analytics

Search This Blog

SQL for Data Warehousing and Analytics