Google Analytics

Wednesday, 18 January 2017

Dealing with very very long string lists using Database 12.2

Lots Of Pieces String

 

Oracle RDBMS 11gR2 introduced the LISTAGG function for working with string values. It can be used to aggregate values from groups of rows and return a concatenated string where the values are typically separated by a comma or semi-colon - you can determine this yourself within the code by supplying your own separator symbol.

Based on the number of posts across various forums and blogs, it is widely used by developers. However, there is one key issue that has been highlighted by many people: when using LISTAGG on data sets that contain very large strings it is possible to create a list that is too long. This causes the following overflow error to be generated:

ORA-01489: result of string concatenation is too long.

Rather annoyingly for developers and DBAs, it is very difficult to determine ahead of time if the concatenation of the values within the specified LISTAGG measure_expr will cause an ORA-01489 error. Many people have posted workarounds to resolve this problem - including myself. Probably the most elegant and simple solution has been to use the 12c MATCH_RECOGNIZE feature, however, this required use of 12c Release 1 which was not always available to all DBAs and/or developers.

If you want to replicate the problem and you have access to the sample SH schema then try executing this query:

SELECT
g.country_region,
LISTAGG(c.cust_first_name||' '||c.cust_last_name, ',') WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

All the samples in this post use our sample SH schema. Once we release the on-premise version of 12.2 you will be able to download the Examples file for your platform from the database home page on OTN. I do have a tutorial ready-to-go on LiveSQL, however, there is currently a technical issue with using very very very long strings - essentially, running my LISTAGG workshop generates a ‘Data value out of range’ error so as soon as it’s fixed I will update this blog post with a link to the tutorial.

 

What have we changed in 12.2?

One way of resolving ORA-01489 errors is to simply increase the size of VARCHAR2 objects.

Larger object sizes

The size limit for VARCHAR2 objects is determined by the database parameter MAX_STRING_SIZE. You can check the setting in your database using the following command:

show parameter MAX_STRING_SIZE

in my demo environment this returns the following:

NAME            TYPE   VALUE
--------------- ------ --------
max_string_size string STANDARD

Prior to Oracle RDBMS 12.1.0.2 the upper limit for VARCHAR2 was 4K. With Oracle RDBMS 12.1.0.2 this limit has been raised to 32K. This increase may solve a lot of issues but it does require a change to the database parameter MAX_STRING_SIZE. By setting MAX_STRING_SIZE = EXTENDED this enables the new 32767 byte limit.

ALTER SYSTEM SET max_string_size=extended SCOPE= SPFILE;

However, with the increasing interest in big data sources it is clear that there is still considerable potential for ORA-01489 errors as you use the LISTAGG feature within queries against extremely large data sets.

What is needed is a richer syntax within the LISTAGG function and this has now been implemented as part of Database 12c Release 2.

Better list management

With 12.2 we have made it easier to manage to lists that are likely to generate an error because they are too long. There are a whole series of new keywords that can be used:

  • ON OVERFLOW ERROR
  • ON OVERFLOW TRUNCATE
  • WITH COUNT vs. WITHOUT COUNT

Let’s look a little closer at each of these features….. 

1. Keeping Pre-12.2 functionality

If you want to your existing code to continue to return an error if the string is too long then the great news is that this is the default behaviour. When the length of the LISTAGG string exceeds the VARCHAR2 limit then the standard error will be returned:

ERROR at line xxx:
ORA-01489: result of string concatenation is too long

 

However, where possible I would recommend adding “ON OVERFLOW ERROR” to your LISTAGG code to make it completely clear that you are expecting an error when an overflow happens:

SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW ERROR) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

So it’s important to note that by default the truncation features are disabled and you will need to change any existing code if you don’t want to raised an error. 

2. New ON OVERFLOW TRUNCATE… keywords

If you want to truncate the list of values at the 4K or 32K boundary then you need to use the newly added keywords ON OVERFLOW TRUNCATE as shown here:

SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW TRUNCATE) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

 

when truncation occurs we will truncate back to the next full value at which point you can control how you tell the user that the list has been truncated. By default we append three dots ‘…’ to the string as indicator that truncation has occurred but you can override this as follows:

SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW TRUNCATE ‘***) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

 

If you want to keep the existing pre-12.2 behaviour where we return an error if the string is too long then you can either rely on the default behaviour or explicitly state that an error should be returned (always a good idea to avoid relying on default behaviour in my opinion) by using the keywords:

SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW ERROR) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

which will now generate the normal error message - i.e. replicates the pre-12.2 behaviour:

ORA-01489: result of string concatenation is too long
01489. 00000 - "result of string concatenation is too long"
*Cause: String concatenation result is more than the maximum size.
*Action: Make sure that the result is less than the maximum size.

 of course you can simply omit the new keywords and get the same behaviour:

SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,') WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

which, as before, generates the normal error message - i.e. replicates the pre-12.2 behaviour:

ORA-01489: result of string concatenation is too long
01489. 00000 - "result of string concatenation is too long"
*Cause: String concatenation result is more than the maximum size.
*Action: Make sure that the result is less than the maximum size.

 

3. How many values are missing?

If you need to know how many values were removed from the list to make it fit into the available space then you can use the keywords ‘WITH COUNT’ - this is the default behaviour. Alternatively if you don’t want a count at the end of the truncated string you can use the keywords ‘WITHOUT COUNT’, which is the default behaviour.

SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW TRUNCATE ‘***’ WITH COUNT) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

 

4. Do we split values when truncation occurs?

No. When determining where to force the truncation we take into account the full length of each value. Therefore, if you consider the example that we have been using which creates a list of customer names within each country we will always include the customer full name  “Keith Laker” (i.e. first name + last name). There has to be enough space to add the complete string (first+last name) to the list otherwise the whole string, “Keith Laker” is removed and the truncation indicator is inserted. It’s not possible for the last value in the string to be only the first name where the last name has been truncated/removed. 

 

5. How do we calculate the overall length of the string values?

The characters to indicate that an overflow has occurred are appended at the end of the list of values, which in this case if the default value of three dots “. . .”.  The overflow functionality traverses backwards from the maximum possible length to the end of the last complete value in the LISTAGG clause, then it adds the user-defined separator followed by the user defined overflow indicator, followed by output from the ’WITH COUNT’ clause which adds a counter at the end of a truncated string to indicate the number of values that have been removed/truncated from the list. 

Summary

With Database 12c Release 2 we have tackled the ORA-01489 error in two ways: 1) increased the the size of VARCHAR2 objects to 32K and 2) extended functionality of LISTAGG to allow greater control over the management of extremely long lists. Specifically there are several new keywords:

  • ON OVERFLOW TRUNCATE
  • ON OVERFLOW ERROR (default behaviour)
  • WITH COUNT
  • WITHOUT COUNT  (default behaviour)

Hopefully this new functionality will mean that all those wonderful workarounds for dealing with “ORA-01489: result of string concatenation is too long“ errors that have been created over the years can now be replaced by standard SQL functionality. 

 

Technorati Tags: , , ,

Monday, 16 January 2017

MATCH_RECOGNIZE - What should I include in the MEASURE clause?

Image courtesy of wikipedia

This post is the result of reviewing a post on stackoverflow.com: http://stackoverflow.com/questions/41649178/getting-error-ora-00918-when-using-match-recognize. Here is my version of the code which includes the same issues/errors as the original, however, I am using the TICKER schema table that I always use for the tutorials that I post on liveSQL :

SELECT symbol, tstamp, price 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY symbol, tstamp
MEASURES
a.symbol AS a_symbol,
a.tstamp AS a_date,
a.price AS a_price
ALL ROWS PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
);

The above example will not run because of the following error:

ORA-00918: column ambiguously defined
00918. 00000 - "column ambiguously defined"
*Cause:
*Action:
Error at Line: 1 Column: 8

So what is wrong with our code? As MATHGUY pointed out in his reply on stackoverflow.com - quite a lot actually! Let’s start by differentiating between “won’t run” and “wrong”. The ORA-918 error is easy to resolve if you stare long enough at the ORDER BY clause! It’s pointless to include the SYMBOL column as the partition by key and the order by key. If we change the ORDER BY clause as shown here then the code will run:

SELECT symbol, tstamp, price 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
MEASURES
a.symbol AS a_symbol,
a.tstamp AS a_date,
a.price AS a_price
ALL ROWS PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

which returns the following resultset (all 60 rows from our source ticker table):

All rows from the ticker table

 

No MEASURE clause

Okay so our code is running now what? If you look at the output you will notice that it contains the same rows and columns as the source table. What happens if we omit the MEASURE clause? Well it’s optional so the code should still run…

SELECT symbol, tstamp, price 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
ALL ROWS PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

and sure enough we get the same resultset (all 60 rows from our source ticker table):

All rows from the ticker table

 

ALL ROWS PER MATCH vs. ONE ROW

from the above we can assume that there is no need to list the source columns from your input table in the MEASURE clause because they are automatically included in the output. BUT this is ONLY true when you use ALL ROWS PER MATCH. If we change the output control to ONE ROW PER MATCH:

SELECT symbol, tstamp, price
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
ONE ROW PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

you will now get an error:

ORA-00904: "PRICE": invalid identifier

00904. 00000 - "%s: invalid identifier"
*Cause:
*Action:
Error at Line: 19 Column: 24

 

because when using ONE ROW PER MATCH the only columns that are automatically returned are those listed in the PARTITION BY clause. Therefore, we need to use either “SELECT * FROM …..” or “SELECT symbol FROM…” to get a working version of our code. Using “SELECT * FROM…” as follows:

SELECT * 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
ONE ROW PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

 

actually returns only one column (symbol) from the ticker table:

 

Only partition by column returned

So what should we include in the MEASURE clause?

Based on the query that was in the original post I think the following syntax would make it easier to understand what is happening within the pattern matching process and provide useful information about the data that matches the pattern:

SELECT symbol, tstamp, price, first_date, last_date, first_price, last_price, m_n, classi 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
MEASURES
FIRST(b.tstamp) AS first_date,
LAST(b.tstamp) AS last_date,
FIRST(b.price) AS first_price,
LAST(b.price) AS last_price,
match_number() AS m_n,
classifier() AS classi
ALL ROWS PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

which gives us the following output based on ALL ROWS PER MATCH:

Output from Amended Measure Clause

 

and if we want to switch to using ONE ROW PER MATCH then we need to remove references to the columns tstamp and price and replace them with references to the pattern variable specific versions, or we can just remove the references all together. In this case as we only have two pattern variables we can NVL the references to return the required data:

SELECT symbol, o_tstamp, o_price, first_date, last_date, first_price, last_price, m_n, classi 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
MEASURES
nvl(a.tstamp, b.tstamp) as o_tstamp,
nvl(a.price, b.price) as o_price,
FIRST(b.tstamp) as first_date,
LAST(b.tstamp) as last_date,
FIRST(b.price) as first_price,
LAST(b.price) as last_price,
match_number() as m_n,
classifier() as classi
ONE ROW PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
);

 

which generates roughly the same output as the previous statement except that this time we are referencing specific instances of tstamp and price.

 

New Measures For TSTAMP and PRICE

 

Summary

What have we learned:

Point 1: Check your PARTITION BY and ORDER BY clauses to ensure they make sense!

Point 2: there is no need to list the source columns from your input table in the MEASURE clause because they are automatically included BUT ONLY when you use ALL ROWS PER MATCH.

Point 3: Decide on your output method and match the columns listed in the SELECT clause with those returned by either ALL ROWS PER MATCH or ONE ROW PER MATCH.

Point 4: Always a good idea to check your pattern is being applied correctly by using the built-in  MATCH_NUMBER and CLASSIFIER() measures.

 

For more general information about how to get started with MATCH_RECOGNIZE follow these links to previous blog posts:

and checkout the growing library of tutorials on liveSQL.oracle.com. Hope this helps to throw some light on the workings of the MEASURE clause within MATCH_RECOGNIZE.

Technorati Tags: , , , ,

Friday, 13 January 2017

Are approximate answers the best way to analyze big data

Lots of candy

Image courtesy of pixabay.com

In my previous post I reviewed some reasons why people seem reluctant to accept approximate results as being correct and useful. The general consensus is that approximate results are wrong which is very strange when you consider how often we interact with approximations as part of our everyday life.

Most of the use cases in my first post on this topic covered situations where distinct counts were the primary goal - how many click throughs did an advert generate, how many unique sessions were recorded for a web site etc. The use cases that I outlined provided some very good reasons for using approximations of distinct counts. As we move forward into the era of Analytics-of-Things the use of approximations in queries will expand and this approach to processing data will become an accepted part of our analytical workflows.

To support Analytics-of-Things, Database 12c Release 12.2 includes even more approximate functions. In this release we have added approximations for median and percentile computations and support for aggregating approximate results (counts, median and percentiles).

What is a median and percentile?

A quick refresher course….according to wikipedia a percentile is:

a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20 percent of the observations may be found.

Percentiles are prefect for locating outliers in your data set. In the vast majority of cases you can start with the assumption that a data set exhibits a normal distribution. Therefore if you take the data around the 0.13th and 99.87th percentiles (i.e. outside 3 standard deviations from the mean) then you get the anomalies. Percentiles are great for allowing you to quickly eyeball the distribution of a data set so that you can check for skew or bimodalities etc. Probably, the most common use case is around monitoring service levels where these anomalies are the values of most interest.  

On the other hand, a median is:

the number separating the higher half of a data sample, a population, or a probability distribution, from the lower half. 

Why would you use median rather than the mean? In other words, what are the use cases that require median? Median is great at removing the impact of outliers because the data is sorted and then the middle value is extracted. The average is susceptible to be skewed by outliers. A great use case for median is in resource planning. If you want to know how many staff you should assign to manage your web-store application you might create a metric based on number of sessions during the year. With a web-store the number of sessions will peak around key dates such as July 4th and Thanksgiving. Calculating the average number of sessions over the year will be skewed by these two dates and you will probably end-up with too many staff looking after your application. Using the median removes these two spikes and will return a more realistic figure for the number of sessions per day during the year.

But before you start to consider where, when, how or even if you want to consider using approximate calculations you need to step back for a moment and think about the accuracy of your existing calculations, which I am guessing you think are 100% accurate!

Is your data accurate anyway?

Most business users work on the assumption that the data set they are using is actually 100% accurate and for the vast majority of operational sources flowing into the data warehouse this is probably true although there will always parentless dimension values and in some cases “Other” bucket dimension members to create some semblance of accuracy.

As we start to explore big data related sources pulled from untrusted external sources and IoT sensor streams, which typically are inherently “noisy”, then the level of “accuracy” within the data warehouse starts to become a range rather than a single specific value.

Let’s quickly explore the three key ways that noise gets incorporated into data sets:

1) Human errors

Human input errors: probably the most obvious. It affects both and internal and external sources that rely on human input or interpretation of manually prepared data. Free format fields on forms create all sorts of problems because the answers need to be interpreted. Good examples are insurance claim forms, satisfaction surveys, crime reports, sales returns forms etc 

2) Coding errors

ETL errors: Just about every data source feeding a data warehouse goes through some sort of ETL process. Whilst this is sort of linked to the first group of errors it does fall into this group simply because of the number of steps involved in most ETL jobs. There are some many places where errors can be introduced

Rounding and conversion errors: When an ETL job takes source data, converts it and then aggregates it before pushing it into the warehouse it will always be difficult to back trace the aggregated numbers down to the source data because of inherent rounding errors. When dealing with currency exchange rates it can be a little difficult to tie-back source data in one currency to the aggregated data in the common currency dues to tiny rounding errors.

3) Data Errors

Missing data points:  Data always get lost in translation somewhere down the line or is simply just out of date. In many cases this is the biggest source of errors. For example, one bank recently put together a marketing campaign to stop customer churn. Before they launched they campaign one of their data scientists did some deeper analysis and discovered that the training data for the model included customers who were getting divorced and this was being flagged as a lost customer. Including this group ended up skewing the results. The data about changes to marital status was not being pushed through fast enough to the data warehouse.

Meaningless or distracting data points: with the growth in interest in the area of IoT it is likely that this type of “noise” will become more prevalent in data sets. Sensor data is rarely 100% accurate mainly because in many cases it does not need to deliver that level of accuracy. The volume of data being sent from the sensor will allow you to easily remove or flag meaningless or distracting data. With weblogs it is relatively easy to ignore click-events where a user clicks on an incorrect link and immediately clicks the back-button.

In other words, in many situations, getting precise answers is nothing but an illusion: even when you process your entire data, the answer is still an approximate one. So why not use approximation to your computational advantage and in a way where the trade off between accuracy and efficiency is controlled by you? 

Use cases for these new features

There are a lot of really good use cases for these types of approximations but here are my two personal favorites:

Hypothesis testing — a good example of this is A/B testing which is most commonly used in conjunction with website design and ads design to select the page design or ad that generates the best response. With this type of analysis it is not vital that you have accurate, precise values .What is needed is the ability to reliably compare results and approximations are good normally enough.

Ranking — How does your ISP calculate your monthly usage so they can bill you fairly for your usage? They use a percentile calculation where they will remove the top 5% - 2%, of your bandwidth peaks. and then use that information to calculate your bill. By using data below the 95th-98th percentile they can ignore the infrequent peaks when say your are downloading the lasted update to your Android or iOS device. Again, having precise numbers for this percentile cut-off is not really necessary. A good enough approximation of  the 95th percentile is usually going to be sufficient because it implies that approximately 95% of the time, your usage is below the data volume identified around that percentile. An conversely the remaining 5% of the time, your usage creeps above that amount. 

Of course all the use cases that we considered for distinct counts in the first posts are also valid:

Discovery analytics: data analysts often slice and dice their dataset in their quest for interesting trends, correlations or outliers. If your application falls into this type of explorative analytics, getting an approximate answer within a second is much better compared to waiting twenty minutes for an exact answer. In fact, research on human-computer interaction has shown that, to keep business users engaged and productive, the response times for queries must be below 10 seconds. In particular, if the user has to wait for the answer to their query for more than a couple of seconds then their level of analytical thinking can be seriously impaired.

Market testing: most common use case for market testing is around serving ads on websites. This is where two variants of a specific ad (each with a group of slightly different attributes such as animations or colour schemes) are served up to visitors during a session. The objective is to measure which version generates a higher conversion rate (i.e. more click-throughs). The analytics requires counting the number of clicks per ad with respect to the number of times each ad was displayed. Using an approximation of the number of click-throughs is perfectly acceptable. This is similar to the crowd-counting problem where it is not really necessary to report exactly how many people joined a rally or turned up to an event.

Root cause analysis:  contrary to perceived wisdom, this can in fact be accomplished using approximations. Typically RCA follows a workflow model where results from one query trigger another query, which in turn triggers another related query. Approximations are used to speed up that the decision as to whether or not to continue with a specific line of analysis. Of course you need to incorporate the likelihood of edge cases within your thinking process because there is the danger that the edge values will get lost within the general hashing process.

however, in these examples we usually end up merging or blending the first two use cases with the three above to gain a deeper level of insight so now let’s look at the new approximate statistical functions introduced in Database 12.2

Approximate median and percentile

With Database 12c Release 2 we have added two new approximate functions:

APPROX_PERCENTILE(%_number [DETERMINISTIC], [ERROR_RATE|CONFIDENCE]) WITHIN GROUP (ORDER BY expr [ DESC | ASC ])


This function takes three input arguments. The first argument is numeric type ranging from 0% to 100%. The second parameter is optional. If ‘DETERMINISTIC’ argument is provided, it means user requires deterministic results. If it is not provided, it means deterministic results are not mandatory.  The input expression for the function is derived from the expr in the ORDER BY clause.  

 

The approx_median function has the following syntax:

APPROX_MEDIAN(expr [DETERMINISTIC], [ERROR_RATE|CONFIDENCE]

)  

We can use these functions separately or together as shown here using the SH schema:

SELECT
  calendar_year,
  APPROX_PERCENTILE(0.25) WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.25",
  TRUNC(APPROX_PERCENTILE(0.25, 'ERROR_RATE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.25-er",
  TRUNC(APPROX_PERCENTILE(0.25, 'CONFIDENCE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.25-ci",
  APPROX_MEDIAN(amount_sold deterministic) as "p-0.50",
  TRUNC(APPROX_MEDIAN(amount_sold deterministic, 'ERROR_RATE'),2) as "p-0.50-er",
  TRUNC(APPROX_MEDIAN(amount_sold deterministic, 'CONFIDENCE'),2) as "p-0.50-ci",
  APPROX_PERCENTILE(0.75 deterministic) WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.75",
  TRUNC(APPROX_PERCENTILE(0.75, 'ERROR_RATE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.75-er",
  TRUNC(APPROX_PERCENTILE(0.75, 'CONFIDENCE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.75-ci"
FROM sales s, times t
WHERE s.time_id = t.time_id
GROUP BY calendar_year
ORDER BY calendar_year

The results from the above query are shown below

Resultset showing approx median and percentile calculations

Note that for the APPROX_MEDIAN function I have included the keyword “DETERMINISTIC”.  What does this actually mean? 

Due to the nature of computing approximate percentiles and medians it is not possible to provide a specific and constant value for the error rate or the confidence interval. However, when we have used a large scale real world customer data set (manufacturing use case) we saw an error range of around 0.1 - 1.0%. Therefore, in broad general terms, accuracy will not be a major concern.

Error rates and confidence intervals

 How closely an approximate answers matches the precise answer is gauged by two important statistics:

  • margin of error
  • confidence level.

These two pieces of information tell us how well the approximation represents the precise value. For example, a result may have a margin of error of plus or minus 3 percent at a 95 percent level of confidence. These terms simply mean that if the analysis were conducted 100 times, the data would be within a certain number of percentage points above or below the percentage reported in 95 of the 100 runs.

In other words, Company X surveys customers and finds that 50 percent of the respondents say its customer service is “very good.” The confidence level is cited as 95 percent plus or minus 3 percent. This information means that if the survey were conducted 100 times, the percentage who say service is “very good” will range between 47% and 53% most (95%) of the time (for more information see here: https://www.isixsigma.com/tools-templates/sampling-data/margin-error-and-confidence-levels-made-simple/).

Please note that if you search for more information about error rates and confidence levels then a lot of results will talk about sample size and working back from typical or expected error rates and confidence levels to determine the sample size needed. With approximate query process we do not sample the source data. We always read all the source values, there is no sampling!

 

Performance - how much faster is an approximate result?

 As a test against a real world schema we took a simple query from the customer that computed a number of different median calculations:

 

SELECT count(*) FROM (SELECT /*+ NO_GBY_PUSHDOWN */ b15, median(b4000), median(b776), median(e), median(f), median(n), median(z) FROM mdv group by b15);

 

Performance Graph for non-approximate query

As you can see from the real-time monitoring page, the query accessed 105 million rows and the calculations generated 11GB of temp. That’s a lot of data for one query to spill to disk!

Resource usage for non-approximate query

 

Now if we convert the above query to use the approx_median function and rerun the query we can see below that we get a very different levels of resource usage: 

Performance graph for approximate query

 

Looking closely at the resource usage you can see that the query is 13x faster, uses considerably less memory (830Kb vs 1GB) but most importantly there is no usage of temp: 

Resource graph for approximate qquery

 

 Summary

One of the most important take-aways from this post relates to the fact that we always read all the source data. The approximate functions in Database 12c Release 2 do not using sampling as a way to increase performance. These new features are significantly faster and use fewer resources which means more resources are available for other queries - allowing you to do more with the same level of resources.

Technorati Tags: , , , , , ,

Tuesday, 10 January 2017

SQL Pattern Matching Deep Dive - Part 6, state machines

The obvious way to start this particular post is to pose a couple of simple questions: what is a state machine and why should you care? In general I would say that you don't need to know about or care about state machines. That's the beauty of using SQL for pattern matching. The MATCH_RECOGNIZE clause encapsulates all the deep technical modelling and processing that has to be performed to run pattern matching on a data set. However, there are times when it is useful, probably vital, that you understand what is going on behind the scenes and one of the most obvious situations is when backtracking happens.

Therefore, the content covered in this post is a going to be a gently lead-in into my next post where I am going to discuss the concept of “backtracking”  and the dreaded ORA-30009 error.

Let’s start our voyage of discovery…when you attempt to run a SQL statement containing a MATCH_RECOGNIZE clause during the compilation phase we generate a finite state machine based on the PATTERN and DEFINE clauses in your statement. What is an Finite State Machine? According to wikipedia:

A finite-state machine (FSM)…is a mathematical model of computation....it is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time…changes from one state to another when initiated by a triggering event or condition; this is called a transition. A particular FSM is defined by a list of its states, and the triggering condition for each transition.

Reference from wikipedia - https://en.wikipedia.org/wiki/Finite-state_machine

 

 

A state machine, which is the PATTERN and DEFINE elements of your MATCH_RECOGNIZE clause, can be represented by a directed graph called a state diagram. This diagram shows each of the possible states for the “machine” and the conditions that force the machine to either remain in its current state or move to the next state. Below is a simple example of a state machine:  

State Machine

On the above diagram each state is represented by a node (ellipse) which in this case are marked as “State 1” to “State 4”. The arrows, known as Edges, show the transition(s) from one state to another. If you look at states 2 and 4 you will notice that they have two edges although these edges are shown in different vertical positions on the diagram. When drawing a proper state diagram each event is labeled with the event (condition) that triggers transition. Events (conditions) that don’t cause a change of state are represented by a circular arrow returning to the original state and these can be seen on states 2 and 4.

The precedence for reading the information is to read from top-down. What this means is that when in State 2 the FSM will test to see if State 3 can be achieved and if it can’t it will then test to see if State 2 can be maintained. The reverse is true for State 4 where the FSM will test to see if State 4 can be maintained and if it can’t it will then, in this example, either end having determined that a match has completed or start backtracking to try and complete a match. I am sure you can now see how this is going to link into my next blog post.

State machines are not limited to just pattern matching. They have all sorts of other uses. If you want a gentle diversion to look at state machines in a little more detail then try this article by Enrique Ortiz from the OTN magazine in August 2004: Managing the MIDlet Life-Cycle with a Finite State Machine.

All of this flows directly into keywords that appear (or don’t appear) in the explain plans which was covered in this post MATCH_RECOGNIZE and the Optimizer from January 2015. As quick refresher…essentially there are four new keywords that you need to be aware of:

  • MATCH RECOGNIZE
  • SORT
  • BUFFER
  • DETERMINISTIC FINITE AUTO

The fist three bullet points are reasonably obvious. The last keyword is linked to the use of “state machine”. Its appearance, or lack of appearance, affects the way our pattern is applied to our data set but that is all explained in the blog post. Most of my MATCH_RECOGNIZE examples are based on the stock ticker data set. Let’s assume that we are searching for V-shaped patters in our data set (https://docs.oracle.com/database/121/DWHSG/pattern.htm#CACHHJJG):

SELECT *
FROM Ticker MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES STRT.tstamp AS start_tstamp,
LAST(DOWN.tstamp) AS bottom_tstamp,
LAST(UP.tstamp) AS end_tstamp
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST UP
PATTERN (STRT DOWN+ UP+)
DEFINE
DOWN AS price < PREV(down.price),
UP AS price > PREV(up.price)
) MR
ORDER BY MR.symbol, MR.start_tstamp;

 

this is what the state diagram would look like:

 

State Machine graph for V-Shaped patterm

These diagrams can be really helpful when you have more complex patterns and you need to consider the impact of backtracking. This posts is all about laying the building blocks for my next post on backtracking and the dreaded ORA-30009 error. If you have managed to read this far then you are guaranteed to be ready for an in-depth look at what happens inside MATCH_RECOGNIZE when we have move from right to left through our state diagram in an attempt to find a complete match. Now you should know everything you need to know about state machines and I am going to carry over the “why care” part to the next post…

If you want a recap of where we are in this series of pattern matching deep dive posts here is the full list:

As per usual, if you have any comments or questions then feel free to contact me directly via email: keith.laker@oracle.com 

Technorati Tags: , , , ,

Thursday, 5 January 2017

Exploring the interfaces for User Defined Aggregates

 

image courtesy of wikipedia

Whilst I was working on the functional specification for the LISTAGG extensions that we implemented in 12c Release 2, I came across Tom Kyte’s stragg function which uses the User Defined Aggregate API introduced in database 9i. Tom’s comprehensive answer covers the two important areas that need to considered when using the data cartridge API: 1) a UDA can run as serial process and 2) a UDA can run as a parallel process. Therefore, you need to code for both these eventualities. Dealing with both scenarios can be a little challenging - as I have discovered over the last few weeks. having looked at a number of posts there is a common theme for explaining how the various interfaces for user defined aggregate actually work. One the clearest examples is on Tim Hall’s blog: String Aggregation Techniques.

This got me thinking…would it be possible to take the new  extensions we made to LISTAGG and incorporate them into custom string manipulation function built using the UDA interfaces? Essentially providing a pre-12.2 solution to prevent the text-string overflow error “ORA-01489: result of string concatenation is too long”? After some initial coding I managed to get a solution that worked perfectly as long as I did not try and run the query using parallel execution. Eventually I managed to get the parallel execution process coded but it was not deterministic and the results differed from the results from the serial query.

After revisiting Tom Kyte’s  stragg function solution I think I have finally created a fully working solution and here is what I have learned along the way…

Overview

Before the going into the code, let’s explore the data cartridge interface for creating user defined aggregates. As far as I can tell this is not a very well known or well used feature in that I have not seen any presentations at user conferences that explain why and how to build these functions. It is a very interesting feature because User-defined aggregate functions can be used in SQL DML statements just like Oracle’s built-in aggregates. Most importantly they allow you to work with and manipulate complex data types such as multimedia data stored using object types, opaque types, and LOBs.

Each user-defined aggregate function is made up of three mandatory and three optional ODCIAggregate interfaces, or steps, to define internal operations that any aggregate function performs. The four mandatory interfaces are: initialization, iteration, merging, and termination.

  • Initialization is accomplished by the ODCIAggregateInitialize() routine, which is invoked by Oracle to initialize the computation of the user-defined aggregate. The initialized aggregation context is passed back to Oracle as an object type instance.
  • Iteration is performed through the ODCIAggregateIterate() routine, which is repeatedly invoked by Oracle. On each invocation, a new value or a set of new values and the current aggregation context are passed in. The routine processes the new values and returns the updated aggregation context. This routine is invoked for every non-NULL value in the underlying group. NULL values are ignored during aggregation and are not passed to the routine.
  • Merging is an optional step and is performed by ODCIAggregateMerge(), a routine invoked by Oracle to combine two aggregation contexts. This routine takes the two contexts as inputs, combines them, and returns a single aggregation context.
  • Termination takes place when the ODCIAggregateTerminate() routine is invoked by Oracle as the final step of aggregation. The routine takes the aggregation context as input and returns the resulting aggregate value.

This is how these four main functions fit together…

The most important observations I would make are:

  1. You need to think very carefully about where and how you want to process your data. 
  2. There are essentially two options: 1) during the Iterate stage and/or 2) during the Terminate stage. Of course you need to remember that code used in the Iterate stage needs to replicated at the Merge stage.
  3. It’s tempting but don’t ignore the Merge stagebut if you do then when the function is run in parallel you won’t see any results! 

What’s missing….

In the code sample below you will notice that I am missing two interfaces: ODCIAggregateDelete() and ODCIAggregateWrapContext(). Both functions are optional and for the purposes of creating a replacement for the LISTAGG function, these two functions were not needed. But for the sake of completeness below is a brief description of each function:  

ODCIAggregateDelete() removes an input value from the current group. The routine is invoked by Oracle by passing in the aggregation context and the value of the input to be removed. It processes the input value, updates the aggregation context, and returns the context. This is an optional routine and is implemented as a member method.

ODCIAggregateWrapContext() integrates all external pieces of the current aggregation context to make the context self-contained. Invoked by Oracle if the user-defined aggregate has been declared to have external context and is transmitting partial aggregates from slave processes. This is an optional routine and is implemented as a member method.

I have searched the internet for examples of when and how to use these two optional functions and there is not much out there. One example was posted by Gary Myers http://blog.sydoracle.com/2005/09/analytics-with-order-by-and-distinct.html which is another derivation of the STRAGG function. In  Gary’s example the objective is to return rows until 5 distinct values of a specific column had been returned. I think it is possible to do this without resorting to implementing this requirement within the  ODCIAggregateTerminate function but I will leave you to under that one!

 

Schema - Sales History

For this simple LISTAGG alternative I am using the sample sales history schema and I am going to create a concatenated list of the first name and last name of each customer within each sales region. To make the code examples a little easier to read I have created a view over the CUSTOMERS and COUNTRIES table:

CREATE OR REPLACE VIEW CG_LIST AS
SELECT
g.country_region_id,
c.cust_first_name,
c.cust_last_name
FROM countries g, customers c
WHERE g.country_id = c.country_id
ORDER BY g.country_region_id;

 

 If we try to run the following SQL query we will get the usual LISTAGG overflow error:

SELECT 
country_region_id,
LISTAGG(cust_first_name||' '||cust_last_name) WITHIN GROUP (ORDER BY country_region_id) AS "customers"
FROM MY_LIST
GROUP BY country_region_id
ORDER BY country_region_id;

 

ORA-01489: result of string concatenation is too long
01489. 00000 - "result of string concatenation is too long"
*Cause: String concatenation result is more than the maximum size.
*Action: Make sure that the result is less than the maximum size.

Now let’s use the User Defined Aggregates framework to resolve this issue…

 

Stage 1 - Building the basic framework

First we need a storage object to hold the results generated during the Iterate stage. Following Tom Kyte’s example I am using an array/table as my storage object as this will ensure that I never hit the limits of the VARCHAR2 object when I am building my list of string values. Here I have checked the maximum size of the concatenated first name and last name combinations which is 21 characters. This means I can set the limit for the varchar2 column at 25 characters (…giving myself a little bit of headroom, just in case…).

CREATE OR REPLACE TYPE string_varray AS TABLE OF VARCHAR2(25);

Note that if we did not use an array we would be forced to use a single VARCHAR2 variable to hold the string values being pieced together. This would mean testing the length of the VARCHAR2 object before adding the next string value within both the Iterate and Merge functions, which causes all sorts of problems! As with most things in life, when working with the user defined aggregate interfaces it pays to keep things as simple as possible and put the processing in the most appropriate place. Therefore, in this case the best solution is to use an array because it makes the code simple and processing more efficient! 

Here is the definition of for the interfaces:

CREATE OR REPLACE TYPE t_string_agg AS OBJECT
(
a_string_data string_varray,

STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT t_string_agg)
RETURN NUMBER,
MEMBER FUNCTION ODCIAggregateIterate(self IN OUT t_string_agg,
value IN VARCHAR2 )
RETURN NUMBER,
MEMBER FUNCTION ODCIAggregateTerminate(self IN t_string_agg,
returnValue OUT VARCHAR2,
flags IN NUMBER)
RETURN NUMBER,
MEMBER FUNCTION ODCIAggregateMerge(self IN OUT t_string_agg,
ctx2 IN t_string_agg)
RETURN NUMBER
);
/
SHOW ERRORS

 

Initially I included some additional variables at the start of the OBJECT definition for managing various counters and the maximum string length. However, it’s not possible to pass parameters into this function except for the string value to be processed. Therefore, I removed the counter definitions from the header and just instantiated my array. This actually makes perfect sense because the counters and maximum string length variables are only needed at one specific point within the process flow - more on this later -  plus this keeps the definition stage nice and simple. 

Now we have the definition for our interfaces which means we can move on to the actual coding associated with each interface.

 

Stage 2 - Coding the Initialization phase

The code for this interface is relatively simple because all we need to do is instantiate the array that will hold our list of string values being passed into the aggregation process. We do this using a call to the object named in the previous section: sctx := t_string_agg(string_varray() ); This is the start of our processing flow…

STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT t_string_agg)
RETURN NUMBER IS
BEGIN
sctx := t_string_agg(string_varray() );

RETURN ODCIConst.Success;
END;

 

Stage 3 - Coding the Iterate phase

The iterate phase will get called multiple times as we process the string values that need to be aggregated. At this point all we need to do is to collect the string values being passed in and insert them into the array object that was created at the start of stage 1. We use the extend function to add a new row into the array and assign the string value. The important point to note here is that the Iterate process will also form part of the parallel execution framework, therefore,  we need to keep the processing as simple as possible. 

MEMBER FUNCTION ODCIAggregateIterate(self IN OUT t_string_agg,
value IN VARCHAR2)
RETURN NUMBER IS
BEGIN
a_string_data.extend;
a_string_data(a_string_data.count) := value;

RETURN ODCIConst.Success;
END;

Given that we want to try and ensure that we don’t exceed the limits of the VARCHAR2 object and generate an  “ORA-01489“ it’s tempting to code this logic within the Iterate function. The issue with placing the processing logic within this function is that all the logic will need to be replicated within the ODCIAggregateMerge function to cope parallel execution where multiple Iterate functions are executed and the results merged to generate a final, single resultset that can be passed to the ODCIAggregateTerminate function. 

 

Stage 4 - Coding the Merge phase

This function gets called during parallel execution and is used to merge results from multiple Iterate processes to generate a final, single resultset that can be passed to the ODCIAggregateTerminate function. The basic logic needs to replicate the logic from the ODCIAggregateIterate function but needs to take account of multiple values coming into the function via the CTX2 instantiation of our string function rather than the single string value being passed into the Iterate function. In essence the CTX2 provides the data from the various parallel execution processes and the self object simply accumulates the results being passed in from the various processes.

MEMBER FUNCTION ODCIAggregateMerge(self IN OUT t_string_agg,
ctx2 IN t_string_agg)
RETURN NUMBER IS
BEGIN
FOR i IN 1 .. ctx2.a_string_data.count
LOOP
a_string_data.extend;
a_string_data(a_string_data.count) := ctx2.a_string_data(i);
END LOOP;

RETURN ODCIConst.Success;
END;

If we tried to enforce the maximum string length functionality within Iterate function then the same processing would need to be enforced within this function as well and having tried to do it I can tell you that the code very quickly gets very complicated. This is not the right place to do complex processing that is better done in the final, terminate, phase. If you do find yourself putting a lot of code within either or both the Iterate and/or Merge functions then I would recommend taking a very hard look at whether the code would be better and simpler if it was placed within the Terminate stage.

 

Stage 5 - Coding the Terminate phase

 Given that we want to avoid blowing the limits of the VARCHAR2 object this is the most obvious place to code our processing logic. By placing the code here, we can ensure that our string processing function will work when the query is run in both serial and parallel.  Here is the code:

MEMBER FUNCTION ODCIAggregateTerminate(self IN t_string_agg,
returnValue OUT VARCHAR2,
flags IN NUMBER)
RETURN NUMBER IS

l_data varchar2(32000);
ctx_len NUMBER;
string_max NUMBER;
BEGIN
ctx_len := 0;
string_max := 100;

FOR x IN (SELECT column_value FROM TABLE(a_string_data) order by 1)
LOOP
IF LENGTH(l_data || ',' || x.column_value) <= string_max THEN
l_data := l_data || ',' || x.column_value;
ELSE
ctx_len := ctx_len + 1;
END IF;
END LOOP;

IF ctx_len > 1 THEN
l_data := l_data || '...(' || ctx_len||')';
END IF;

returnValue := LTRIM(l_data, ',');

RETURN ODCIConst.Success;
END;

Note that the function returns the list of string values from the array sorted in alphabetical order: (SELECT column_value FROM TABLE(a_string_data) order by 1). If you don’t need a sorted list then you can remove the ORDER BY clause. Cycling through the ordered list of strings allows us to add a comma to separate each value from the next but first we need to check the length of the list of strings to see if we have reached the maximum string length set by the variable string_max. As soon as we reach this maximum value we start counting the number of values that are excluded from the final list:

    IF LENGTH(l_data || ',' || x.column_value) <= string_max THEN
l_data := l_data || ',' || x.column_value;
ELSE
ctx_len := ctx_len + 1;
END IF;

For this code sample the maximum string length is set as 100 characters. Once we reach that value then we simply continue looping through the array but now we increment our counter (ctx_len).  Given that we have sorted the list of string values, the next question is: can we remove duplicate values? The answer is yes! It is a relatively simple change:

 FOR x IN (SELECT DISTINCT column_value FROM TABLE(a_string_data) order by 1)

There is an obvious cost to pay here because the use of the DISTINCT keyword will require additional processing which will have a performance impact so think very carefully about whether this is actually needed. 

 

Full code sample

Here is the complete definition of the code:

 

CREATE OR REPLACE TYPE BODY t_string_agg IS
STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT t_string_agg)
RETURN NUMBER IS
BEGIN
sctx := t_string_agg(string_varray() );

RETURN ODCIConst.Success;
END;
MEMBER FUNCTION ODCIAggregateIterate(self IN OUT t_string_agg,
value IN VARCHAR2)
RETURN NUMBER IS
BEGIN
a_string_data.extend;
a_string_data(a_string_data.count) := value;

RETURN ODCIConst.Success;
END;
MEMBER FUNCTION ODCIAggregateTerminate(self IN t_string_agg,
returnValue OUT VARCHAR2,
flags IN NUMBER)
RETURN NUMBER IS

l_data varchar2(32000);
ctx_len NUMBER;
string_max NUMBER;
BEGIN
ctx_len := 0;
string_max := 100;

FOR x IN (SELECT DISTINCT column_value FROM TABLE(a_string_data) order by 1)
LOOP
IF LENGTH(l_data || ',' || x.column_value) <= string_max THEN
l_data := l_data || ',' || x.column_value;
ELSE
ctx_len := ctx_len + 1;
END IF;
END LOOP;

IF ctx_len > 1 THEN
l_data := l_data || '...(' || ctx_len||')';
END IF;

returnValue := LTRIM(l_data, ',');

RETURN ODCIConst.Success;
END;
MEMBER FUNCTION ODCIAggregateMerge(self IN OUT t_string_agg,
ctx2 IN t_string_agg)
RETURN NUMBER IS
BEGIN
FOR i IN 1 .. ctx2.a_string_data.count
LOOP
a_string_data.EXTEND;
a_string_data(a_string_data.COUNT) := ctx2.a_string_data(i);
END LOOP;

RETURN ODCIConst.Success;
END;
END;
/
SHOW ERRORS
 

Stage 6 - Coding the actual string function

 The last step is to create a function that calls our string processing object and takes a string (VARCHAR2) object as it’s input:

CREATE OR REPLACE FUNCTION string_agg (p_input VARCHAR2)
RETURN VARCHAR2
PARALLEL_ENABLE AGGREGATE USING t_string_agg;

 

Bringing it all together…

Now let’s run a query using our replacement for LISTAGG()..

 

SELECT 
country_region_id,
string_agg(cust_first_name||' '||cust_last_name) AS s_str_list
FROM MY_LIST
GROUP BY country_region_id
ORDER BY country_region_id;
 

and here is the output:

 

Results from serial execution of query

 

the length of the concatenated string for each region is limited to a maximum of 100 characters and we have a count of the number of values excluded from the list - shown in brackets at the end of the list. Just so you can check that the truncation is occurring correctly I have included an additional column in the resultset below which shows the length of the final string (including the three dots + the numbers showing the count of truncated values).

 

Count of final string length

 

 

Conclusion

First a huge vote of thanks to Tom Kyte for providing the original code and ideas behind this particular blog post.  

If you don’t have access to Database 12c Release 2 and the enhancements that we made to LISTAGG to control  “ORA-01489“ errors then here is a “usable” alternative that provides a lot more control over string concatenation. Obviously if you want string concatenation with a count of missing values and without a count then you will need to create two separate functions. If you need a distinct list of values for some queries and not others then you will need to create separate functions to handle the different processing.

This example also provides a great introduction to the area of user defined aggregates, which were introduced in Oracle 9i. A quick Google search returned a lot of similar examples but no “real-world” use cases so I assume that UDAs are a little used gem within the long list of Oracle Database features. UDAs are a very powerful and flexible feature that you should definitely add to your toolbox of skills.

     

Technorati Tags: , , , ,

Wednesday, 4 January 2017

Simplifying your data validation code with Database 12.2

Data is good

Image courtesy of pixabay.com
 

Doesn’t matter who much testing you do (well, it actually does but that’s a whole different issue) you can almost guarantee that at some point your beautiful data validation code, that parses data input from a web form or loads data from some external file, will pop up with the error:

SQL Error: ORA-01722: invalid number

01722. 00000 - "invalid number"
*Cause: The specified number was invalid.
*Action: Specify a valid number.

Of course, what’s is really annoying at this point is that you don’t know which column value of the record failed (assuming that you have more than one numeric column)

Managing conversion errors during data loads

What’s to do? Of course the sensible thing is to add lots of data validation checks into your code to try and catch the situations where the wrong type of data  arrives from your data source. It’s likely that all the additional validation checks will slow down the process of inserting data, which is not a great result.

If your data is arriving via an external file then you can use the BADFILE clause to capture records that cannot be loaded because of data type errors. But what if the data source for your insert statement is a staging table that was populated by an ETL job or a series of values from a web form? 

How to manage conversion errors during INSERTs

Panic over - Database 12c Release 2 contains important changes to the CAST and TO_xxx functions to manage the most common data conversion errors. The CAST function now has the ability to return a user-specified value if there is a conversion error.

For example, let’s build a simple staging table in the schema:

CREATE TABLE STAGING_EMP
( "EMPNO" VARCHAR2(6),
  "ENAME" VARCHAR2(10),
  "JOB" VARCHAR2(9),
  "MGR" VARCHAR2(4),
  "HIREDATE" VARCHAR2(10),
  "SAL" VARCHAR2(7),
  "COMM" VARCHAR2(9),
  "DEPTNO" VARCHAR2(6));

and let’s insert some data, which includes values that will cause data conversion errors when we try to add the values into our target table:

-- INSERTING DATA INTO STAGING_EMP
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
            values ('GB9369','SMITH','CLERK','7902','17-DEC-80','800',null,'20');

-- INVALID DATE
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('9499','ALLEN','SALESMAN','7698','31-FEB-81','1600','300','30');

-- INVALID NUMBER FOR DEPTNO
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('9521','WARD','SALESMAN','7698','22-FEB-81','1250','500','SALES');

-- INVALID NUMBER FOR EMPNO KEY
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('US9566','JONES','MANAGER','7839','02-APR-81','2975',null,'20');

Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('9782','CLARK','MANAGER','7839','09-JUN-81','2450',null,'10');

-- INVALID NUMBER FOR EMPNO KEY
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('FR9788','SCOTT','ANALYST','7566','19-APR-87','3000',null,'20');

-- INVALID NUMBER FOR MGR KEY
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('9839','KING','PRESIDENT','null','17-NOV-81','5000',null,'10');

-- INVALID NUMBER FOR EMPNO KEY
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('DE9844','TURNER','SALESMAN','7698','08-SEP-81','1500',0,'30');

Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('9876','ADAMS','CLERK','7788','23-MAY-87','1100',null,'20');

 

Now let’s try inserting the data from our staging table into the EMP table and see what happens:

INSERT INTO scott.emp SELECT * FROM staging_emp;

… and not surprisingly I get the following error:

Error starting at line : 52 in command -
INSERT INTO emp SELECT * FROM staging_emp
Error report -
SQL Error: ORA-01722: invalid number
01722. 00000 - "invalid number"
*Cause: The specified number was invalid.
*Action: Specify a valid number.

I can deal with this situation in a couple of different ways. Firstly let’s try and discover which rows and columns in my staging table contain values that are likely to cause data conversion errors. To do this I am going to use the new VALIDATE_CONVERSION() function which identifies problem data that cannot be converted to the required data type. It returns 1 if a given expression can be converted to the specified data type, else it returns 0.

SELECT
  VALIDATE_CONVERSION(empno AS NUMBER) AS is_empno,
  VALIDATE_CONVERSION(mgr AS NUMBER) AS is_mgr,
  VALIDATE_CONVERSION(hiredate AS DATE) AS is_hiredate,
  VALIDATE_CONVERSION(sal AS NUMBER) AS is_sal,
  VALIDATE_CONVERSION(comm AS NUMBER) AS is_comm,
  VALIDATE_CONVERSION(deptno AS NUMBER) AS is_deptno 
FROM staging_emp;

this produces a table where I can easily pick out the rows where the data conversion is going to succeed (column value is 1) and fail (column value is 0):

Results from validation_conversion function

I could use this information to filter the data in my staging table as I insert it into my EMP table or I could use the enhanced CAST and TO_xxx functions within the INSERT INTO ….. SELECT statements.

The CAST function (along with TO_NUMBER, TO_BINARY_FLOAT, TO_BINARY_DOUBLE, TO_DATE, TO_TIMESTAMP, TO_TIMESTAMP_TZ, TO_DSINTERVAL, and TO_YMINTERVAL functions) can now return a user-specified value, instead of an error, when data type conversion errors occur. This reduces failures during an data transformation and data loading processes.

Therefore, my new 12.2 self-validating SELECT statement looks like this:

INSERT INTO emp
SELECT
  empno,
  ename,
  job,
  CAST(mgr AS NUMBER DEFAULT 9999 ON CONVERSION ERROR),
  CAST(hiredate AS DATE DEFAULT sysdate ON CONVERSION ERROR),
  CAST(sal AS NUMBER DEFAULT 0 ON CONVERSION ERROR),
  CAST(comm AS NUMBER DEFAULT null ON CONVERSION ERROR),
  CAST(deptno AS NUMBER DEFAULT 99 ON CONVERSION ERROR)
FROM staging_emp
WHERE VALIDATE_CONVERSION(empno AS NUMBER) = 1;

which results in five rows being inserted into my EMP table - obviously this means that 4 rows were rejected during the insert process (rows 1, 4, 6 and 8) because they contain errors converting the contents to a number for the empno key. Here is the data that was loaded:

Data loaded into emp table from staging table

we can see that on row 1 the HIERDATE was invalid so it was replaced by the value from sys date (07-JUL-16). Row 2 the value of DEPTNO is the conversion default of 99 and on row 4 the value for MGR is the conversion default of 9999.

Conclusion

The enhanced  CAST function (along with TO_NUMBER, TO_BINARY_FLOAT, TO_BINARY_DOUBLE, TO_DATE, TO_TIMESTAMP, TO_TIMESTAMP_TZ, TO_DSINTERVAL, and TO_YMINTERVAL functions) can help you deal with data conversion errors without having to resort to complicated PL/SQL code or writing data validation routines within your application code.

The new VALIDATE_CONVERSION() function can be used to help you identify column values that cannot be converted to the required data type.

Hope these two features are useful. Enjoy!

Don’t forget that LiveSQL is now running Database 12c Release so check out all the new tutorials and code samples that have recently been posted. I have posted a tutorial for the features discussed above: https://livesql.oracle.com/apex/livesql/file/tutorial_EDVE861IMHO1W3Q654ES9EQQW.html

 

 

 

 

 

Wednesday, 9 November 2016

SQL Pattern Matching Deep Dive - Part 5, SKIP TO where exactly?


Image courtesy of flicker.com

 

So far in this series we looked at how to ensure query consistency, how correctly use predicates, managing sorting, using the built-in measures to help with optimise your code and the impact of different types of quantifiers:

In this post I am going to review what MATCH_RECOGNIZE does after a match has been found i.e. where the search begins for the next match. It might seem obvious, i.e. you start at the next record, but MATCH_RECOGNIZE provides a lot of flexibility in this specific area (as you would expect).

Basic Syntax

We use the AFTER MATCH SKIP clause to determine the precise point to resume row pattern matching after a non-empty match is found. If you don’t supply an AFTER MATCH SKIP clause then the default is AFTER MATCH SKIP PAST LAST ROW.

Of course there are quite a few options available:

  • AFTER MATCH SKIP TO NEXT ROW Resume pattern matching at the row after the first row of the current match.
  • AFTER MATCH SKIP PAST LAST ROW Resume pattern matching at the next row after the last row of the current match.
  • AFTER MATCH SKIP TO FIRST pattern_variable Resume pattern matching at the first row that is mapped to the pattern variable.
  • AFTER MATCH SKIP TO LAST pattern_variable Resume pattern matching at the last row that is mapped to the pattern variable.
  • AFTER MATCH SKIP TO pattern_variable The same as AFTER MATCH SKIP TO LAST pattern_variable.

 

Using Pattern Variables and ORA-62154

Note that you can set the restart point to be linked to a specific pattern variable which allows you to work with overlapping patterns - i.e. where you are searching for “shapes” within your data set such as “W” shaped patterns within our ticker data stream. But what happens if the pattern variable within the SKIP TO clause is not matched? Let’s look at the following example:

SELECT *
FROM Ticker MATCH_RECOGNIZE (
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES STRT.tstamp AS start_tstamp,
                     LAST(UP.tstamp) as end_tstamp,
          MATCH_NUMBER() AS match_num,
          CLASSIFIER() AS var_match
 ALL ROWS PER MATCH
 AFTER MATCH SKIP TO DOWN
 PATTERN (STRT DOWN* UP)
 DEFINE
       DOWN AS DOWN.price < PREV(DOWN.price),
       UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.tstamp;

here we are stating that we need at least zero or more matches of the variable DOWN to occur and once a match has been found then we will resume the search for the next pattern at the DOWN event. With this pattern it is possible that DOWN will never get matched so the AFTER MATCH SKIP TO DOWN cannot happen even though a complete match for the pattern is found. Therefore, the compiler throws an error to let you know that this code will not work:

ORA-62514: AFTER MATCH SKIP TO variable is not bounded in the match found.
62514. 00000 - "AFTER MATCH SKIP TO variable is not bounded in the match found."
*Cause: AFTER MATCH SKIP TO variable was not bound in the match found due
to pattern operators such as |, *, ?, and so on.
*Action: Modify the query and retry the operation

Therefore, you need to change the pattern to search for at least one or more instances of DOWN rather than zero or more as this will allow the DOWN event to be matched at least once and therefore it will be available for AFTER MATCH SKIP TO processing.

Skipping PAST LAST ROW [DEFAULT]

This is the default behaviour and in many circumstances this is the most obvious choice. In these situations the searching for the next pattern it makes sense to resume at the row after the last match since going back over previous rows does not make any sense and would only result in more rows than necessary being processed. For example, let’s look at the sessionization example: http://oracle-big-data.blogspot.co.uk/2014/02/sessionization-with-12c-sql-pattern.html and if you want to try the code see the tutorial on the LiveSQL site.

 Looking at the source data for the sessionization example it’s clear that as we walk through the entries in the log file to check if an entry is part of the current session or not, there is no point in stepping backwards to begin searching again once a match has been found.

Source data for sessionization example

 

You can run the code for this sessionization example on LiveSQL.

Looking for shapes and controlling skipping

As I previously stated, you might think the obvious position to start searching for the next occurrence of a pattern is the next record after the last row of the current match. But what if there are overlapping patterns where the middle of an earlier match overlaps with the start of the next match? For example if we are looking for a w-shaped pattern within our ticker data set then it is quite possible to have overlapping w-shapes where the next “W” starts within the second down phase of the previous ”W”.

Fortunately MATCH_RECOGNIZE  provides great flexibility in terms of being able to specify the restart point. If we look at the source data for the ACME symbol within our ticker data set then we can see that there are overlapping W-shapes (assuming we allow for the flat-top in the middle of the 2nd w-shape by using the <= and >= tests for each pattern variable!).

 

 

 

Twin W-shapes within source data

 

Let’s use this example to explore the various AFTER MATCH SKIP TO options…starting with the default behaviour:

 

SELECT *
FROM Ticker MATCH_RECOGNIZE (
 PARTITION BY symbol ORDER BY tstamp
 MEASURES STRT.tstamp AS start_w,
          LAST(z.tstamp) AS end_w
 ONE ROW PER MATCH
 AFTER MATCH SKIP PAST LAST ROW
 PATTERN (STRT x+ y+ w+ z+)
DEFINE
  x AS x.price <= PREV(x.price),
  y AS y.price >= PREV(y.price),
  w AS w.price <= PREV(w.price),
  z AS z.price >= PREV(z.price)
) MR
WHERE symbol='ACME'
ORDER BY symbol, MR.start_w;

 

returns only one match within the ACME data set:

Only one w-shape found in ACME data set

 

and if we expand the output, using ALL ROWS PER MATCH, so we can see how the pattern was matched we can see that it starts on 05-Apr-11 with pattern variable STRT and ends on 14-Apr-11 with pattern variable Z.

All rows report for single W shape

 

 

Now let’s change the above code sample so that after the first pattern has been found we begin searching at the row after the end of the matching process for the Y variable - i.e. row 6, 10-Apr-11.

SELECT *
FROM Ticker MATCH_RECOGNIZE (
 PARTITION BY symbol ORDER BY tstamp
 MEASURES STRT.tstamp AS start_w,
          LAST(z.tstamp) AS end_w
 ONE ROW PER MATCH
 AFTER MATCH SKIP TO LAST Y
 PATTERN (STRT x+ y+ w+ z+)
 DEFINE
     x AS x.price <= PREV(x.price),
     y AS y.price >= PREV(y.price),
     w AS w.price <= PREV(w.price),
     z AS z.price >= PREV(z.price)
) MR
WHERE symbol='ACME'
ORDER BY symbol,start_w;

which now finds two w-shapes with the second W starting on 10-Apr-11 and ending on 18-Apr-11:

Two w shapes within ticker data stream

but what is going on under-the-covers?

SELECT *
FROM Ticker MATCH_RECOGNIZE (
 PARTITION BY symbol ORDER BY tstamp
 MEASURES STRT.tstamp AS start_w,
          LAST(z.tstamp) AS end_w,
          classifier() AS pv,
          match_number() AS mn,
          count(*) as row_count
 ALL ROWS PER MATCH
 AFTER MATCH SKIP TO LAST Y
 PATTERN (STRT x+ y+ w+ z+)
 DEFINE
      x AS x.price <= PREV(x.price),
      y AS y.price >= PREV(y.price),
      w AS w.price <= PREV(w.price), 
      z AS z.price >= PREV(z.price)
) MR
WHERE symbol='ACME'
ORDER BY symbol, mn, tstamp;

 

now shows us that the records for 10-Apr-11 to 14-Apr-11 were actually processed twice:

 

Detailed report for 2 w-shapes

Skip to next row?

What about using the SKIP TO NEXT ROW syntax? How does that affect our results? It is important to remember that this will force MATCH_RECOGNIZE to resume pattern matching at the row after the first row of the current match. Using our ticker data we can see that this would actually increase the number of W-shapes to three!

 

3 W shapes in our ticker data stream

 

In match 2 we have two occurrences of pattern variable x, there once the second W-shape has been matched the search process restarts on row 12, i.e. the first row of the current match, which is row 12 mapped to STRT.

SELECT *
FROM Ticker MATCH_RECOGNIZE (
 PARTITION BY symbol ORDER BY tstamp
 MEASURES STRT.tstamp AS start_w,
          LAST(z.tstamp) AS end_w
 ONE ROW PER MATCH
 AFTER MATCH SKIP TO NEXT ROW
 PATTERN (STRT x+ y+ w+ z+)
 DEFINE
      x AS x.price <= PREV(x.price),
      y AS y.price >= PREV(y.price),
      w AS w.price <= PREV(w.price),
      z AS z.price >= PREV(z.price)
 ) MR
WHERE symbol='ACME'
ORDER BY symbol, mr.start_w;

creates the following output:

Three w shapes within ticker stream

 

and if we change our code to return the more detailed report we can see how the pattern is being matched:

SELECT *
FROM Ticker MATCH_RECOGNIZE (
 PARTITION BY symbol ORDER BY tstamp
 MEASURES STRT.tstamp AS start_w,
          LAST(z.tstamp) AS end_w,
          classifier() AS pv,
          match_number() AS mn,
          count(*) as row_count
 ALL ROWS PER MATCH
 AFTER MATCH SKIP TO NEXT ROW
 PATTERN (STRT x+ y+ w+ z+)
 DEFINE
      x AS x.price <= PREV(x.price),
      y AS y.price >= PREV(y.price),
      w AS w.price <= PREV(w.price),
      z AS z.price >= PREV(z.price)
) MR
WHERE symbol='ACME'
ORDER BY symbol, mn, tstamp;

 

which produces the following output:

Detailed report showing mapping of 3 w-shapes

 

Note that match two, the 2nd W-shape, starts on line 11 but we began the search for this second match on row 2, i.e. the next row after the first start variable. Similarly, the search for the third W-shape on row 12 after the second STRT variable. Given that our original data set for ACME only contained 20 rows you can see from this example how it is possible to do a lot more processing when you start to fully exploit the power of the AFTER MATCH SKIP syntax.

Just accept the default?

The AFTER MATCH SKIP clause determines the point at which we will resume searching for the next match after a non-empty match has been found. The default for the clause is AFTER MATCH SKIP PAST LAST ROW: resume pattern matching at the next row after the last row of the current match. In most examples of using MATCH_RECOGNIZE you will notice that the AFTER MATCH clause is not present and the developer blindly assumes that the AFTER MATCH SKIP PAST LAST ROWclause is applied. This obviously does not help the next developer who has to amend the code to fit new business requirements. 

Therefore, my recommendation is that you should always clearly state where you want the matching process to start searching for the next match. Never assume the default will behaviour will be good enough!

Summary

We are getting near the end of this series of deep dive posts. Hopefully this post has explained the ways in which you can use the AFTER MATCH SKIP… clause to ensure that you capture all of the required patterns/shapes within your data set. It’s always a good idea to explicitly include this clause because it is very important - if you don’t want to allow for overlapping matches then clearly state this in your code by using AFTER MATCH SKIP PAST LAST ROW clause. Don’t assume the default will kick-in and that the next developer will have time to read all your detailed documentation when making the next round of changes to the code.

Don’t forget to try our pattern matching tutorials and scripts on LiveSQL and all the above code examples are available via the “Skip to where?” tutorial on livesql.oracle.com.

 

What’s next?

In the next post in this series I am going to review the keywords that control the output from MATCH_RECOGNIZE: ALL ROWS vs. ONE ROW. Feel free to contact me if you have an interesting use cases for SQL pattern matching or if you just want some more information. Always happy to help. My email address is keith.laker@oracle.com

 

Looking for more Information

Use the tag search to see more information about pattern matching or SQL Analytics or Database 12c.

 

Technorati Tags: