SQL Pattern Matching Deep Dive - Part 1
There has been quite a lot of interest in the new 12c MATCH_RECOGNIZE feature for SQL pattern matching. Therefore, I thought now would be the perfect time to start a series of quick deep dive posts that explain how SQL pattern matching works. Over the coming weeks I will cover the following topics in a series of posts.
This is the start of a series of posts based on a presentation that I put together for the recent annual BIWA conference at Oracle HQ. The Oracle BI, DW and Analytics user community always puts on a great conference and this year was the best yet. You can download any or all of the presentations from this year’s conference by following this link. My pattern matching deep dive presentation started life about a year ago as a post covering some of the new keywords in the explain plan that are linked to pattern matching, see here. It has now expanded to cover a much wider range of topics.
Firstly, a huge vote of thanks to my development team (Lei Sheng and Chun-Chieh Lin) who have spent a lot of time helping me put together this series of blog posts. Without their input it would have not been possible to drill-down to such a deep level of detail about all the processing going on underneath MATCH_RECOGNIZE.
The aim of this group of posts is to help you understand the underlying mechanics of the MATCH_RECOGNIZE clause. During these posts we will explore key concepts such as: how to get consistent results, using built-in debugging functions, deterministic vs. non-deterministic state machines, back-tracking (what is it and how to identify when it is occurring), and finally greedy vs. reluctant quantifiers. If you need a quick refresher on how MATCH_RECOGNIZE works then I would recommend that you take a look at the following links which are on the OTN home page for Analytical SQL:
- Presentation: Analyze This! Analytical Power in SQL, More Than You Ever Dreamed Of
- Whitepaper: Patterns Everywhere - Find then fast!
- Video: Introduction to SQL pattern matching
Hopefully the above links provide a useful recap regarding how to use MATCH_RECOGNIZE and give you a good foundation for the topics that we are about to explore. Let’s get started….
Step 1: Getting consistent results
Of course you want consistent results, right? Well sometimes there can be a conflict between achieving consistency and obtaining simplicity. Let me explain….Pre-12c it was possible to construct many forms of pattern matching solutions using a number of different SQL techniques. A lot pre-12c pattern matching code is constructed using self-joins, recursive queries (WITH clause and/or CONNECT BY) and window functions, probably within multiple query blocks. These Pre-12c solutions tend to be very difficult to understand and maintain. Every developer (along with their DBA) wants to simplify their code and MATCH_RECOGNIZE takes the simplification of pattern matching code to a whole new level. Once you start simplifying your code there is a strong desire to rip out as much as possible: rely on the defaults where ever possible. The MATCH_RECOGNIZE clause has a number of keywords and some of these are optional. If you look at the basic syntax:
the first two keywords “PARTITION BY” and “ORDER BY” are both optional and these obvious first choice candidates for removal in order to simplify SQL code. Of the two, omitting “ORDER BY” has the potential to cause the most problems. I am sure that a lot of developers and DBAs will think “well my data is already sorted so I don’t need ORDER BY”. Well, yes, it is very, very tempting to ignore the ORDER BY clause and assume data will be correctly ordered because who wants to do unnecessary sorts?
However, if you do omit ORDER BY, then what you end up getting inconsistent results because without it we cannot guarantee to return the same results every time the query is executed - for obvious reasons: If the order of two rows in a partition is not determined by ORDER BY results (non-unique order by key), the result will be non-deterministic. If you want consistent results you must always include ORDER BY clause. Most importantly, if you have non unique order by keys within a partition, consider adding additional order by columns to make the order unique and deterministic.
If you think that by adding an ORDER BY clause your execution plan will be littered with sort operations then don’t worry. You are not doomed to suffer lots and lots of sorting by including ORDER BY because if we can suppress the need to reorder then we will do so, i.e. if the input data satisfies MATCH_RECOGNIZE ordering requirements, then sort will be eliminated when generating execution plan.
WHERE clause: when/where/how?
In various conversations about using MATCH_RECOGNIZE there is one question that keeps coming up: when exactly is the WHERE clause processed? Obviously, it is possible, as with any SQL statement, to add a WHERE clause to a statement that includes MATCH_RECOGNIZE. The tricky part is understanding what is allowed and what’s not allowed. In very general terms the WHERE clause will be applied after the MATCH_RECOGNIZE clause has been applied. In more specific terms, things can get a little complicated….Let’s use the standard ticker symbol data set, which is available on our Github page. If we take the following pattern matching statement:
SELECT symbol, first_x, last_z
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES FIRST(x.tstamp) AS first_x,
LAST(z.tstamp) AS last_z
ONE ROWS PER MATCH
PATTERN (X+ Y+ W+ Z+)
DEFINE X AS (price < PREV(price)),
Y AS (price > PREV(price)),
W AS (price < PREV(price)),
Z AS (price > PREV(price) AND
z.tstamp - FIRST(x.tstamp) <= 7);
…generates this explain plan:
you will notice that the explain plan shows that we processed all 60 rows. Notice that we are only returning one row per match. Can we reduce the number of rows we have to work on? What happens if we only want to work on rows for two of the symbols? Do we need to create a view that contains a WHERE symbol IN (‘ACME’, ‘OSCORP’) clause or can we just apply the filter directly to our source table, ticker? In this case we can simply add our WHERE clause to our SQL statement and the predicates will be applied before the data is passed to the MATCH_RECOGNIZE clause:
SELECT symbol, first_x, last_z
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES FIRST(x.tstamp) AS first_x,
LAST(z.tstamp) AS last_z
ONE ROW PER MATCH
PATTERN (X+ Y+ W+ Z+)
DEFINE X AS (price < PREV(price)),
Y AS (price > PREV(price)),
W AS (price < PREV(price)),
Z AS (price > PREV(price) AND z.tstamp - FIRST(x.tstamp) <= 7 ))
WHERE symbol IN ('ACME', 'OSCORP');
…explain plan below:
the filtering on the SYMBOL column as taking place before the MATCH_RECOGNIZE is processed. Therefore, we are only processing 40 rows (ACME 20 rows and OSCORP 20 rows) and not processing all 60 rows from the source table. Now what happens if we want to apply a filter on our date column tstamp?
SELECT symbol, first_x, last_z
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES FIRST(x.tstamp) AS first_x,
LAST(z.tstamp) AS last_z
ONE ROW PER MATCH
PATTERN (X+ Y+ W+ Z+)
DEFINE X AS (price < PREV(price)),
Y AS (price > PREV(price)),
W AS (price < PREV(price)),
Z AS (price > PREV(price) AND z.tstamp - FIRST(x.tstamp) <= 7 ))
WHERE symbol IN (‘ACME', 'OSCORP')
AND tstamp BETWEEN ’01-Apr-11’ AND ’09-Apr-11’;
…well now we get an error message:
ORA-00904: "TSTAMP": invalid identifier
00904. 00000 - "%s: invalid identifier"
*Cause:
*Action:
Error at Line: 14 Column: 5
Why? As with the SELECT clause, the WHERE clause can only refer to columns that are visible from the MATCH_RECOGNIZE operation. As we are using ONE ROW PER MATCH then the user visible columns from the MATCH_RECOGNIZE operation are the PARTITION BY columns and the measure columns (which in our example above are: symbol, first_x and last_z). For reference purposes, if you use ALL ROWS PER MATCH then the visible columns include all the columns from input table (which includes partition by columns, order by columns, and the rest of columns).
Therefore, if we modify our example and use the following statement then the query does in fact run:
SELECT symbol, first_x, last_z
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES FIRST(x.tstamp) AS first_x,
LAST(z.tstamp) AS last_z,
tstamp AS tstamp
ONE ROW PER MATCH
PATTERN (X+ Y+ W+ Z+)
DEFINE X AS (price < PREV(price)),
Y AS (price > PREV(price)),
W AS (price < PREV(price)),
Z AS (price > PREV(price) AND z.tstamp - FIRST(x.tstamp) <= 7 ))
WHERE symbol IN ('ACME', 'OSCORP')
AND tstamp BETWEEN '01-Apr-11' and '09-Apr-11';
…but look what now happens to the explain plan:
So what is happening here with this query? Firstly, remember that we are using the ONE ROW PER MATCH syntax. Secondly, and most importantly, we need to understand what we are actually referring to when we use the column name tstamp. This does not in fact refer to column in our source table (TICKER). It actually refers to the tstamp value of the last row of the match found. What this means is that the predicate of tstamp BETWEEN ’01-Apr-11’ AND ’09-Apr-11’ actually only returns those matches where the last row’s tstamp value is between ’01-Apr-11’ and ’09-Apr-11’. The actual syntax of the line should really be LAST(tstamp) AS tstamp so that we make it clear what value we intend to assign to tstamp rather than just relying on default processing to deliver us the right result.
In our example above query, what happens if we change ONE ROW PER MATCH to ALL ROWS PER MATCH and remove tstamp from MEASURE clause? Now the semantics of the predicate is quite different. The predicate is applied to all the rows included within each match, i.e. for each match that is found, all rows within that match satisfying the predicate of tstamp BETWEEN ’01-Apr-11’ AND ’09-Apr-11’ are included.
So what happens if we want to filter on the final column in our ticker table, i.e. using price? When would a predicate on price be applied?
SELECT symbol, first_x, last_z
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES FIRST(x.tstamp) AS first_x,
LAST(z.tstamp) AS last_z,
tstamp AS tstamp,
price as price
ONE ROW PER MATCH
PATTERN (X+ Y+ W+ Z+)
DEFINE X AS (price < PREV(price)),
Y AS (price > PREV(price)),
W AS (price < PREV(price)),
Z AS (price > PREV(price) AND z.tstamp - FIRST(x.tstamp) <= 7 ))
WHERE symbol IN ('ACME', 'OSCORP')
AND tstamp BETWEEN ’01-Apr-11' and '09-Apr-11'
AND price > 0;
… as with the ORDER BY column, the predicate on our measure PRICE is also applied after the pattern matching processing has been completed and don’t forget the lessons from the last example: 1) we are not referencing the column price in our source table (TICKER), 2) price refers to the value in last row found within the match:
As with the ORDER BY example, rather than relying on the default processing to deliver is the right result we should explicitly state the value of price that we want to use: LAST(price) AS price.
In summary, it is clear that predicates will be applied at different points in the processing stack depending on whether you are referencing the column listed in the PARTITION BY clause or the ORDER BY clause, a computation/column within the MEASURES clause. Which columns are available for use as predicates and whether they need to be included in the MEASURES clause depends on the type of output (ONE ROW PER MATCH vs. ALL ROWS PER MATCH).
Remember that MATCH_RECOGNIZE is an extension to table expression and so the WHERE clause is applied on the result of MATCH_RECOGNIZE. In the examples above the key reason why we can push down predicates on PARTITION BY columns directly on the input table is that it will not change the query results. This does not hold on pushing predicates on other columns.
What this all means is that if you want to limit the rows flowing into the MATCH_RECOGNIZE clause then you will need to either create a view or use a subquery to apply the necessary predicates as follows:
SELECT symbol, first_x, last_z
FROM (SELECT * FROM ticker
WHERE symbol IN (‘ACME', 'OSCORP')
AND tstamp BETWEEN '01-Apr-11' AND '09-Apr-11')
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES FIRST(x.tstamp) AS first_x,
LAST(z.tstamp) AS last_z,
tstamp AS tstamp
ONE ROW PER MATCH
PATTERN (X+ Y+ W+ Z+)
DEFINE X AS (price < PREV(price)),
Y AS (price > PREV(price)),
W AS (price < PREV(price)),
Z AS (price > PREV(price) AND z.tstamp - FIRST(x.tstamp) <= 7 ));
which generates the following explain plan:
…which shows that our predicates allow only 19 rows to flow into the MATCH_RECOGNIZE clause, compared with the 40 rows in previous examples.
What about sorting?
What about sorting? I am sure you will have spotted that in all of the explain plans shown above, the SORT keyword is part of the MATCH_RECOGNIZE line. If we are applying predicates and there is an index that provides the correct sort order for our data then only the BUFFER keyword will be included in the MATCH_RECOGNIZE line of the explain plan. For example, if we add the following index:
CREATE INDEX ticker_tstamp_idx ON ticker(symbol, tstamp);
…and now run our original pattern matching query using predicates to filter on symbol for ACME and OSCORP what you will notice is that the SORT keyword within the MATCH_RECOGNIZE line has disappeared even though we are still using an ORDER BY clause on timestamp column. This is because the index is able to provide a correctly ordered row set into MATCH_RECOGNIZE. Now the MATCH_RECOGNIZE line is showing the keyword BUFFER to indicate that no additional sorting is being applied to the data as it flows into the pattern matching process.
Note that the index will actually be used even if we don’t have predicates. We are smart enough to spot that the index is useful and can provide the ordering needed for the MATCH_RECOGNIZE processing, therefore, we use it to provide the correct sort order and the plan switches to using the BUFFER keyword. As shown here:
SELECT symbol, first_x, last_z
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES FIRST(x.tstamp) AS first_x,
LAST(z.tstamp) AS last_z,
tstamp AS tstamp, price as price
ONE ROW PER MATCH
PATTERN (X+ Y+ W+ Z+)
DEFINE X AS (price < PREV(price)),
Y AS (price > PREV(price)),
W AS (price < PREV(price)),
Z AS (price > PREV(price) AND z.tstamp - FIRST(x.tstamp) <= 7 ));
Naming blocks - setting a correlation name
One quick final point. It is possible, and probably should be considered best practice, to name the MATCH_RECOGNIZE block. After the final bracket that encloses the MATCH_RECOGNIZE() code you can add an identifier. This allows you to explicitly refer to columns within the MEASURE clause. In the example below, I have identified the MATCH_RECOGNIZE block as “mr” and then labelled the columns in the SELECT clause and the WHERE clause accordingly:
SELECT mr.symbol, mr.first_x, mr.last_z
FROM ticker t
MATCH_RECOGNIZE ( PARTITION BY symbol ORDER BY tstamp
MEASURES FIRST(x.tstamp) AS first_x,
LAST(z.tstamp) AS last_z,
tstamp AS tstamp
ONE ROW PER MATCH
PATTERN (X+ Y+ W+ Z+)
DEFINE X AS (price < PREV(price)),
Y AS (price > PREV(price)),
W AS (price < PREV(price)),
Z AS (price > PREV(price) AND z.tstamp - FIRST(x.tstamp) <= 7 )) mr
WHERE mr.symbol IN ('ACME', 'OSCORP')
AND mr.tstamp BETWEEN '01-Apr-11' AND '09-Apr-11';
This is not only a simple courtesy for the developer who has to come along later and make changes to your code.
In our above example, mr is the correlation name assigned to the row pattern output table. The benefit to assigning a correlation name is that the correlation name can be used to qualify the column names of the row pattern output table, as in mr.first_x in the preceding example. This is especially important to resolve ambiguous column names if there are other tables in the FROM clause.
Be careful with trying to over-identify your column names. For example, in the following code I have identified the PARTITION BY and ORDER BY column names with the letter “t” which is linked to the table ticker:
SELECT mr.symbol, mr.first_x, mr.last_z
FROM ticker t
MATCH_RECOGNIZE(
PARTITION BY t.symbol ORDER BY t.tstamp
MEASURES FIRST(x.tstamp) AS first_x,
LAST(z.tstamp) AS last_z,
tstamp AS tstamp
ONE ROW PER MATCH
PATTERN (X+ Y+ W+ Z+)
DEFINE X AS (price < PREV(price)),
Y AS (price > PREV(price)),
W AS (price < PREV(price)),
Z AS (price > PREV(price) AND z.tstamp - FIRST(x.tstamp) <= 7 )) mr
WHERE mr.symbol IN ('ACME', 'OSCORP')
AND mr.tstamp BETWEEN '01-Apr-11' AND '09-Apr-11';
the following error is now signalled when we try to run this statement:
ORA-01748: only simple column names allowed here
01748. 00000 - "only simple column names allowed here"
This error is flagged becase with the SELECT and WHERE clauses it is not possible to reference columns from the source table. In the example above it is not possible to reference t.price, or t.symbol. or t.tstamp within the SELECT or WHERE clause. You can only refer to the visible columns of MATCH_RECOGNIZE (PARTITION BY columns + measure columns when using ONE ROW PER MATCH and all input columns + measure columns when using ALL ROWS PER MARTCH.
Providing a correlation name to MATCH_RECOGNIZE just makes our code a lot more readable, but it is not mandatory.
Summary
Wrapping up this blog post, what have we learnt so far?
- 1st rule of MATCH_RECOGNIZE: if you want consistent results then include ORDER BY. Makes sense, doesn’t it?
- 2nd of MATCH_RECOGNIZE: using ORDER BY does not imply lots of sort operations within the explain plan
- 3rd rule of MATCH_RECOGNIZE: think about how and when your predicates are applied and whether an index might help to reduce the amount of sorting
In part 2 we will look at how to use the built-in debugging functions, MATCH_NUMBER() and CLASSIFIER() to help you work out how your pattern is being interpreted. Feel free to contact me if you have an interesting use cases for SQL pattern matching or if you just want some more information. Always happy to help. My email address is keith.laker@oracle.com.
Lastly, thanks once again to my development team (Lei Sheng and Chun-Chieh Lin) for your patience and proof-reading.
Technorati Tags: Analytics, Big Data, Data Warehousing, Database 12c, Pattern Matching, SQL, SQL Analytics
Comments
Post a Comment