SQL Pattern Matching Deep Dive - Part 5, SKIP TO where exactly?

So far in this series we looked at how to ensure query consistency, how correctly use predicates, managing sorting, using the built-in measures to help with optimise your code and the impact of different types of quantifiers:

In this post I am going to review what MATCH_RECOGNIZE does after a match has been found i.e. where the search begins for the next match. It might seem obvious, i.e. you start at the next record, but MATCH_RECOGNIZE provides a lot of flexibility in this specific area (as you would expect).

Basic Syntax

We use the AFTER MATCH SKIP clause to determine the precise point to resume row pattern matching after a non-empty match is found. If you don’t supply an AFTER MATCH SKIP clause then the default is AFTER MATCH SKIP PAST LAST ROW.

Of course there are quite a few options available:

AFTER MATCH SKIP TO NEXT ROW Resume pattern matching at the row after the first row of the current match.
AFTER MATCH SKIP PAST LAST ROW Resume pattern matching at the next row after the last row of the current match.
AFTER MATCH SKIP TO FIRST pattern_variable Resume pattern matching at the first row that is mapped to the pattern variable.
AFTER MATCH SKIP TO LAST pattern_variable Resume pattern matching at the last row that is mapped to the pattern variable.
AFTER MATCH SKIP TO pattern_variable The same as AFTER MATCH SKIP TO LAST pattern_variable.

Using Pattern Variables and ORA-62154

Note that you can set the restart point to be linked to a specific pattern variable which allows you to work with overlapping patterns - i.e. where you are searching for “shapes” within your data set such as “W” shaped patterns within our ticker data stream. But what happens if the pattern variable within the SKIP TO clause is not matched? Let’s look at the following example:

SELECT *
FROM Ticker MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES STRT.tstamp AS start_tstamp,
LAST(UP.tstamp) as end_tstamp,
MATCH_NUMBER() AS match_num,
CLASSIFIER() AS var_match
ALL ROWS PER MATCH
AFTER MATCH SKIP TO DOWN
PATTERN (STRT DOWN* UP)
DEFINE
DOWN AS DOWN.price < PREV(DOWN.price),
UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.tstamp;

here we are stating that we need at least zero or more matches of the variable DOWN to occur and once a match has been found then we will resume the search for the next pattern at the DOWN event. With this pattern it is possible that DOWN will never get matched so the AFTER MATCH SKIP TO DOWN cannot happen even though a complete match for the pattern is found. Therefore, the compiler throws an error to let you know that this code will not work:

ORA-62514: AFTER MATCH SKIP TO variable is not bounded in the match found.
62514. 00000 - "AFTER MATCH SKIP TO variable is not bounded in the match found."
*Cause: AFTER MATCH SKIP TO variable was not bound in the match found due
to pattern operators such as |, *, ?, and so on.
*Action: Modify the query and retry the operation

Therefore, you need to change the pattern to search for at least one or more instances of DOWN rather than zero or more as this will allow the DOWN event to be matched at least once and therefore it will be available for AFTER MATCH SKIP TO processing.

Skipping PAST LAST ROW [DEFAULT]

This is the default behaviour and in many circumstances this is the most obvious choice. In these situations the searching for the next pattern it makes sense to resume at the row after the last match since going back over previous rows does not make any sense and would only result in more rows than necessary being processed. For example, let’s look at the sessionization example: http://oracle-big-data.blogspot.co.uk/2014/02/sessionization-with-12c-sql-pattern.html and if you want to try the code see the tutorial on the LiveSQL site.

Looking at the source data for the sessionization example it’s clear that as we walk through the entries in the log file to check if an entry is part of the current session or not, there is no point in stepping backwards to begin searching again once a match has been found.

Source data for sessionization example

You can run the code for this sessionization example on LiveSQL.

Looking for shapes and controlling skipping

As I previously stated, you might think the obvious position to start searching for the next occurrence of a pattern is the next record after the last row of the current match. But what if there are overlapping patterns where the middle of an earlier match overlaps with the start of the next match? For example if we are looking for a w-shaped pattern within our ticker data set then it is quite possible to have overlapping w-shapes where the next “W” starts within the second down phase of the previous ”W”.

Fortunately MATCH_RECOGNIZE provides great flexibility in terms of being able to specify the restart point. If we look at the source data for the ACME symbol within our ticker data set then we can see that there are overlapping W-shapes (assuming we allow for the flat-top in the middle of the 2nd w-shape by using the <= and >= tests for each pattern variable!).

Twin W-shapes within source data

Let’s use this example to explore the various AFTER MATCH SKIP TO options…starting with the default behaviour:

SELECT *
FROM Ticker MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES STRT.tstamp AS start_w,
LAST(z.tstamp) AS end_w
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (STRT x+ y+ w+ z+)
DEFINE
x AS x.price <= PREV(x.price),
y AS y.price >= PREV(y.price),
w AS w.price <= PREV(w.price),
z AS z.price >= PREV(z.price)
) MR
WHERE symbol='ACME'
ORDER BY symbol, MR.start_w;

returns only one match within the ACME data set:

Only one w-shape found in ACME data set

and if we expand the output, using ALL ROWS PER MATCH, so we can see how the pattern was matched we can see that it starts on 05-Apr-11 with pattern variable STRT and ends on 14-Apr-11 with pattern variable Z.

All rows report for single W shape

Now let’s change the above code sample so that after the first pattern has been found we begin searching at the row after the end of the matching process for the Y variable - i.e. row 6, 10-Apr-11.

SELECT *
FROM Ticker MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES STRT.tstamp AS start_w,
LAST(z.tstamp) AS end_w
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST Y
PATTERN (STRT x+ y+ w+ z+)
DEFINE
x AS x.price <= PREV(x.price),
y AS y.price >= PREV(y.price),
w AS w.price <= PREV(w.price),
z AS z.price >= PREV(z.price)
) MR
WHERE symbol='ACME'
ORDER BY symbol,start_w;

which now finds two w-shapes with the second W starting on 10-Apr-11 and ending on 18-Apr-11:

Two w shapes within ticker data stream

but what is going on under-the-covers?

SELECT *
FROM Ticker MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES STRT.tstamp AS start_w,
LAST(z.tstamp) AS end_w,
classifier() AS pv,
match_number() AS mn,
count(*) as row_count
ALL ROWS PER MATCH
AFTER MATCH SKIP TO LAST Y
PATTERN (STRT x+ y+ w+ z+)
DEFINE
x AS x.price <= PREV(x.price),
y AS y.price >= PREV(y.price),
w AS w.price <= PREV(w.price),
z AS z.price >= PREV(z.price)
) MR
WHERE symbol='ACME'
ORDER BY symbol, mn, tstamp;

now shows us that the records for 10-Apr-11 to 14-Apr-11 were actually processed twice:

Detailed report for 2 w-shapes

Skip to next row?

What about using the SKIP TO NEXT ROW syntax? How does that affect our results? It is important to remember that this will force MATCH_RECOGNIZE to resume pattern matching at the row after the first row of the current match. Using our ticker data we can see that this would actually increase the number of W-shapes to three!

3 W shapes in our ticker data stream

In match 2 we have two occurrences of pattern variable x, there once the second W-shape has been matched the search process restarts on row 12, i.e. the first row of the current match, which is row 12 mapped to STRT.

SELECT *
FROM Ticker MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES STRT.tstamp AS start_w,
LAST(z.tstamp) AS end_w
ONE ROW PER MATCH
AFTER MATCH SKIP TO NEXT ROW
PATTERN (STRT x+ y+ w+ z+)
DEFINE
x AS x.price <= PREV(x.price),
y AS y.price >= PREV(y.price),
w AS w.price <= PREV(w.price),
z AS z.price >= PREV(z.price)
) MR
WHERE symbol='ACME'
ORDER BY symbol, mr.start_w;

creates the following output:

Three w shapes within ticker stream

and if we change our code to return the more detailed report we can see how the pattern is being matched:

SELECT *
FROM Ticker MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES STRT.tstamp AS start_w,
LAST(z.tstamp) AS end_w,
classifier() AS pv,
match_number() AS mn,
count(*) as row_count
ALL ROWS PER MATCH
AFTER MATCH SKIP TO NEXT ROW
PATTERN (STRT x+ y+ w+ z+)
DEFINE
x AS x.price <= PREV(x.price),
y AS y.price >= PREV(y.price),
w AS w.price <= PREV(w.price),
z AS z.price >= PREV(z.price)
) MR
WHERE symbol='ACME'
ORDER BY symbol, mn, tstamp;

which produces the following output:

Detailed report showing mapping of 3 w-shapes

Note that match two, the 2nd W-shape, starts on line 11 but we began the search for this second match on row 2, i.e. the next row after the first start variable. Similarly, the search for the third W-shape on row 12 after the second STRT variable. Given that our original data set for ACME only contained 20 rows you can see from this example how it is possible to do a lot more processing when you start to fully exploit the power of the AFTER MATCH SKIP syntax.

Just accept the default?

The AFTER MATCH SKIP clause determines the point at which we will resume searching for the next match after a non-empty match has been found. The default for the clause is AFTER MATCH SKIP PAST LAST ROW: resume pattern matching at the next row after the last row of the current match. In most examples of using MATCH_RECOGNIZE you will notice that the AFTER MATCH clause is not present and the developer blindly assumes that the AFTER MATCH SKIP PAST LAST ROWclause is applied. This obviously does not help the next developer who has to amend the code to fit new business requirements.

Therefore, my recommendation is that you should always clearly state where you want the matching process to start searching for the next match. Never assume the default will behaviour will be good enough!

Summary

We are getting near the end of this series of deep dive posts. Hopefully this post has explained the ways in which you can use the AFTER MATCH SKIP… clause to ensure that you capture all of the required patterns/shapes within your data set. It’s always a good idea to explicitly include this clause because it is very important - if you don’t want to allow for overlapping matches then clearly state this in your code by using AFTER MATCH SKIP PAST LAST ROW clause. Don’t assume the default will kick-in and that the next developer will have time to read all your detailed documentation when making the next round of changes to the code.

Don’t forget to try our pattern matching tutorials and scripts on LiveSQL and all the above code examples are available via the “Skip to where?” tutorial on livesql.oracle.com.

What’s next?

In the next post in this series I am going to review the keywords that control the output from MATCH_RECOGNIZE: ALL ROWS vs. ONE ROW. Feel free to contact me if you have an interesting use cases for SQL pattern matching or if you just want some more information. Always happy to help. My email address is keith.laker@oracle.com.

Looking for more Information

Use the tag search to see more information about pattern matching or SQL Analytics or Database 12c.

Technorati Tags: Analytics, Data Warehousing, Database 12c, Pattern Matching, SQL Analytics

Search This Blog

SQL for Data Warehousing and Analytics