Monday, 23 January 2017

How to intelligently aggregate approximations


Simple abacus with coloured beads

The growth of low-cost storage platforms has allowed many companies to actively seeking out new external data sets and combine them with internal historical data that goes back over a very long time frame. Therefore, as both the type of data and the volume of data continue to grow the challenge for many businesses is how to process this every expanding pool of data and at the same time, make timely decisions based on all the available data.
(Image above courtesy of http://taneszkozok.hu/)
In previous posts I have discussed whether an approximate answer is just plain wrong and whether approximate answers really are the best way to analyze big data. As with the vast majority of data analysis at some point there is going to be a need to aggregate a data set to get a higher level view across various dimensions. When working with results from approximate queries, dealing with aggregations can get a little complicated because it is not possible to “reuse” an aggregated approximate result as a basis for aggregating the results to an even higher level across the various dimensions of the original query. To obtain a valid approximate result set requires a query to rescan the source data and compute the required analysis for the given combination of levels. Just because I have a result set that contains a count of the number of unique products sold this week at the county level does not mean that I can simply reuse that result set to determine the number of distinct products sold this week at the state level. In many cases you cannot just rollup aggregations of aggregations..

Cannot aggregate data from existing approximate result sets

Until now….

With Database 12c Release 2 we have introduced a series of new functions to deal with this specific issue - the need to create reusable aggregated results that can be “rolled-up” to higher aggregate levels. So at long last you can now intelligently aggregate approximations! Here is how e do it….Essentially there are three parts that provide the “intelligence”:
  • APPROX_xxxxxx_DETAIL
  • APPROX_xxxxxx_AGG
  • TO_APPROX_xxxxxx
Here is a quick overview of each function:

APPROX_xxxxxx_DETAIL

This function takes a numeric expression and builds summary result set containing results for all dimensions in GROUP BY clause. The output from this function is a column containing BLOB data. As with other approximate functions the results can be deterministic or non-deterministic depending on your requirements.

APPROX_xxxxxx_AGG 

This function builds a higher level summary table based on results from the _DETAIL function. This means that it is not necessary to re-query base fact table in order to derive new aggregates results. As with the _DETAIL function, the results are returned as a blob.

TO_APPROX_xxxxxx

Returns results from the _AGG and _DETAIL functions in user readable format.

Three new functions for managing approximate aggregations

A worked example

Let’s build a working example using the sample SH schema. Our product marketing team wants to know within each year the approximate number of unique customers within each product category. Thinking ahead we know that once they have this result set we expect them to do further analysis such as drilling on the time and product dimension levels to get deeper insight. The best solution is to build a reusable aggregate approximate result set using the new functions in 12.2.
SELECT
 t.calendar_year,
 t.calendar_quarter_number AS qtr,
 p.prod_category_desc AS category_desc,
 p.prod_subcategory_desc AS subcategory_desc,
 APPROX_COUNT_DISTINCT_DETAIL(s.cust_id) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;

this returns my result set as a blob as shown below and this blob contains the various tuple combinations from my GROUP BY clause. As a result I can reuse this result set to answer new questions based around higher levels of aggregation.

Aggregated results from query contained within BLOB column

and the explain plan for this query shows the new sort keywords (GROUP BY APPROX) which tells us that approximate processing has been used as part of this query.
Explain plan showing new sort keywords

If we want to convert the BLOB data into a readable format we can transform it by using the TO_APPROX_xxx function as follows:

SELECT
t.calendar_year,
t.calendar_quarter_number AS qtr,
p.prod_category_desc AS category_desc,
p.prod_subcategory_desc AS subcategory_desc,
TO_APPROX_COUNT_DISTINCT(APPROX_COUNT_DISTINCT_DETAIL(s.cust_id)) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;

this creates the following results
shows use of TO_APPROX function to transform blob data
alternatively we coult create a table using the above query and then simply pass the BLOB column directly into the TO_APPROX function as follows:
CREATE TABLE agg_cd AS
SELECT

 t.calendar_year,
 t.calendar_quarter_number AS qtr,
 p.prod_category_desc AS category_desc,
 p.prod_subcategory_desc AS subcategory_desc,
 APPROX_COUNT_DISTINCT_DETAIL(s.cust_id) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;
using this table we can simplify our query to return the approximate number of distinct customers directly from the above table:
SELECT
 calendar_year,
 qtr,
 category_desc,
 subcategory_desc,
 TO_APPROX_COUNT_DISTINCT(acd_agg)
FROM agg_cd
ORDER BY calendar_year, qtr, category_desc, subcategory_desc;
which returns the same results as before - as you would expect!

using TO_APPROX function directly against BLOB colum

using the aggregated table as our source we can now change the levels that we wish to calculate without having to go back to the original source table and again scan all the rows. However, to extract the new aggregations we need to introduce the third function APPROX_COUNT_DISTINCT_AGG to our query and wrap this within the TO_APPROX_COUNT_DISTINCT function to see the results:
SELECT
 calendar_year,
 subcategory_desc,
 TO_APPROX_COUNT_DISTINCT(APPROX_COUNT_DISTINCT_AGG(acd_agg))
FROM agg_cd
GROUP BY calendar_year, subcategory_desc
ORDER BY calendar_year, subcategory_desc;
will return the following results based only on the new combination of levels included in the GROUP BY clause:

Using higher level aggregations


Summary

This post has reviewed the three new functions that we have introduced in Database 12c Release 2 that allow you to reuse aggregated approximate result sets:
  • APPROX_xxxxxx_DETAIL
  • APPROX_xxxxxx_AGG
  • TO_APPROX_xxxxxx
 Database 12c Release 2 makes it possible to intelligently aggregate approximations. In the next post I will explore how you can combine approximate processing with existing query rewrite functionality so you can have intelligent approximate query rewrite.

Technorati Tags: , , , ,

Wednesday, 18 January 2017

Dealing with very very long string lists using Database 12.2


Lots Of Pieces String

Oracle RDBMS 11gR2 introduced the LISTAGG function for working with string values. It can be used to aggregate values from groups of rows and return a concatenated string where the values are typically separated by a comma or semi-colon - you can determine this yourself within the code by supplying your own separator symbol.

Based on the number of posts across various forums and blogs, it is widely used by developers. However, there is one key issue that has been highlighted by many people: when using LISTAGG on data sets that contain very large strings it is possible to create a list that is too long. This causes the following overflow error to be generated:
ORA-01489: result of string concatenation is too long.
Rather annoyingly for developers and DBAs, it is very difficult to determine ahead of time if the concatenation of the values within the specified LISTAGG measure_expr will cause an ORA-01489 error. Many people have posted workarounds to resolve this problem - including myself. Probably the most elegant and simple solution has been to use the 12c MATCH_RECOGNIZE feature, however, this required use of 12c Release 1 which was not always available to all DBAs and/or developers.

If you want to replicate the problem and you have access to the sample SH schema then try executing this query:
SELECT
g.country_region,
LISTAGG(c.cust_first_name||' '||c.cust_last_name, ',') WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;
All the samples in this post use our sample SH schema. Once we release the on-premise version of 12.2 you will be able to download the Examples file for your platform from the database home page on OTN. I have a published a tutorial on LiveSQL so you can play with all the new keywords covered in this post.

What have we changed in 12.2?

One way of resolving ORA-01489 errors is to simply increase the size of VARCHAR2 objects.

Larger object sizes

The size limit for VARCHAR2 objects is determined by the database parameter MAX_STRING_SIZE. You can check the setting in your database using the following command:
show parameter MAX_STRING_SIZE
in my demo environment this returns the following:
NAME            TYPE   VALUE
--------------- ------ --------
max_string_size string STANDARD
Prior to Oracle RDBMS 12.1.0.2 the upper limit for VARCHAR2 was 4K. With Oracle RDBMS 12.1.0.2 this limit has been raised to 32K. This increase may solve a lot of issues but it does require a change to the database parameter MAX_STRING_SIZE. By setting MAX_STRING_SIZE = EXTENDED this enables the new 32767 byte limit.
ALTER SYSTEM SET max_string_size=extended SCOPE= SPFILE;
However, with the increasing interest in big data sources it is clear that there is still considerable potential for ORA-01489 errors as you use the LISTAGG feature within queries against extremely large data sets.

What is needed is a richer syntax within the LISTAGG function and this has now been implemented as part of Database 12c Release 2.

Better list management

With 12.2 we have made it easier to manage to lists that are likely to generate an error because they are too long. There are a whole series of new keywords that can be used:
  • ON OVERFLOW ERROR
  • ON OVERFLOW TRUNCATE
  • WITH COUNT vs. WITHOUT COUNT
Let’s look a little closer at each of these features…..


1. Keeping Pre-12.2 functionality
If you want to your existing code to continue to return an error if the string is too long then the great news is that this is the default behaviour. When the length of the LISTAGG string exceeds the VARCHAR2 limit then the standard error will be returned:
ERROR at line xxx:
ORA-01489: result of string concatenation is too long

However, where possible I would recommend adding “ON OVERFLOW ERROR” to your LISTAGG code to make it completely clear that you are expecting an error when an overflow happens:
SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW ERROR) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;
So it’s important to note that by default the truncation features are disabled and you will need to change any existing code if you don’t want to raised an error.


2. New ON OVERFLOW TRUNCATE… keywords
If you want to truncate the list of values at the 4K or 32K boundary then you need to use the newly added keywords ON OVERFLOW TRUNCATE as shown here:
SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW TRUNCATE) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

when truncation occurs we will truncate back to the next full value at which point you can control how you tell the user that the list has been truncated. By default we append three dots ‘…’ to the string as indicator that truncation has occurred but you can override this as follows:
SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW TRUNCATE ‘***) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

If you want to keep the existing pre-12.2 behaviour where we return an error if the string is too long then you can either rely on the default behaviour or explicitly state that an error should be returned (always a good idea to avoid relying on default behaviour in my opinion) by using the keywords:
SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW ERROR) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;
which will now generate the normal error message - i.e. replicates the pre-12.2 behaviour:
ORA-01489: result of string concatenation is too long
01489. 00000 - "result of string concatenation is too long"
*Cause: String concatenation result is more than the maximum size.
*Action: Make sure that the result is less than the maximum size.
 of course you can simply omit the new keywords and get the same behaviour:
SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,') WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;
which, as before, generates the normal error message - i.e. replicates the pre-12.2 behaviour:
ORA-01489: result of string concatenation is too long
01489. 00000 - "result of string concatenation is too long"
*Cause: String concatenation result is more than the maximum size.
*Action: Make sure that the result is less than the maximum size.

3. How many values are missing?
If you need to know how many values were removed from the list to make it fit into the available space then you can use the keywords ‘WITH COUNT’ - this is the default behaviour. Alternatively if you don’t want a count at the end of the truncated string you can use the keywords ‘WITHOUT COUNT’, which is the default behaviour.
SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW TRUNCATE ‘***’ WITH COUNT) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

4. Do we split values when truncation occurs?
No. When determining where to force the truncation we take into account the full length of each value. Therefore, if you consider the example that we have been using which creates a list of customer names within each country we will always include the customer full name  “Keith Laker” (i.e. first name + last name). There has to be enough space to add the complete string (first+last name) to the list otherwise the whole string, “Keith Laker” is removed and the truncation indicator is inserted. It’s not possible for the last value in the string to be only the first name where the last name has been truncated/removed.

5. How do we calculate the overall length of the string values?
The characters to indicate that an overflow has occurred are appended at the end of the list of values, which in this case if the default value of three dots “. . .”.  The overflow functionality traverses backwards from the maximum possible length to the end of the last complete value in the LISTAGG clause, then it adds the user-defined separator followed by the user defined overflow indicator, followed by output from the ’WITH COUNT’ clause which adds a counter at the end of a truncated string to indicate the number of values that have been removed/truncated from the list.

Summary
With Database 12c Release 2 we have tackled the ORA-01489 error in two ways: 1) increased the the size of VARCHAR2 objects to 32K and 2) extended functionality of LISTAGG to allow greater control over the management of extremely long lists. Specifically there are several new keywords:
  • ON OVERFLOW TRUNCATE
  • ON OVERFLOW ERROR (default behaviour)
  • WITH COUNT
  • WITHOUT COUNT  (default behaviour)
Hopefully this new functionality will mean that all those wonderful workarounds for dealing with “ORA-01489: result of string concatenation is too long“ errors that have been created over the years can now be replaced by standard SQL functionality. Don't forget to try the tutorial on livesql - it's a completely free service, all you need to do is register for an account.

Technorati Tags: , , ,

Monday, 16 January 2017

MATCH_RECOGNIZE - What should I include in the MEASURES clause?

Image courtesy of wikipedia

This post is the result of reviewing a post on stackoverflow.com: http://stackoverflow.com/questions/41649178/getting-error-ora-00918-when-using-match-recognize. Here is my version of the code which includes the same issues/errors as the original, however, I am using the TICKER schema table that I always use for the tutorials that I post on liveSQL :

SELECT symbol, tstamp, price 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY symbol, tstamp
MEASURES
a.symbol AS a_symbol,
a.tstamp AS a_date,
a.price AS a_price
ALL ROWS PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
);

The above example will not run because of the following error:

ORA-00918: column ambiguously defined
00918. 00000 - "column ambiguously defined"
*Cause:
*Action:
Error at Line: 1 Column: 8

So what is wrong with our code? As MATHGUY pointed out in his reply on stackoverflow.com - quite a lot actually! Let’s start by differentiating between “won’t run” and “wrong”. The ORA-918 error is easy to resolve if you stare long enough at the ORDER BY clause! It’s pointless to include the SYMBOL column as the partition by key and the order by key. If we change the ORDER BY clause as shown here then the code will run:

SELECT symbol, tstamp, price 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
MEASURES
a.symbol AS a_symbol,
a.tstamp AS a_date,
a.price AS a_price
ALL ROWS PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

which returns the following resultset (all 60 rows from our source ticker table):

All rows from the ticker table

 

No MEASURE clause

Okay so our code is running now what? If you look at the output you will notice that it contains the same rows and columns as the source table. What happens if we omit the MEASURES clause? Well it’s optional so the code should still run…

SELECT symbol, tstamp, price 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
ALL ROWS PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

and sure enough we get the same resultset (all 60 rows from our source ticker table):

All rows from the ticker table

 

ALL ROWS PER MATCH vs. ONE ROW

from the above we can assume that there is no need to list the source columns from your input table in the MEASURE clause because they are automatically included in the output. BUT this is ONLY true when you use ALL ROWS PER MATCH. If we change the output control to ONE ROW PER MATCH:

SELECT symbol, tstamp, price
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
ONE ROW PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

you will now get an error:

ORA-00904: "PRICE": invalid identifier

00904. 00000 - "%s: invalid identifier"
*Cause:
*Action:
Error at Line: 19 Column: 24

 

because when using ONE ROW PER MATCH the only columns that are automatically returned are those listed in the PARTITION BY clause. Therefore, we need to use either “SELECT * FROM …..” or “SELECT symbol FROM…” and list all the columns we want to be returned to get a working version of our code. Using “SELECT * FROM…” as follows:

SELECT * 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
ONE ROW PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

 

actually returns only one column (symbol) from the ticker table:

 

Only partition by column returned

So what should we include in the MEASURE clause?

Based on the query that was in the original post I think the following syntax would make it easier to understand what is happening within the pattern matching process and provide useful information about the data that matches the pattern:

SELECT symbol, tstamp, price, first_date, last_date, first_price, last_price, m_n, classi 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
MEASURES
FIRST(b.tstamp) AS first_date,
LAST(b.tstamp) AS last_date,
FIRST(b.price) AS first_price,
LAST(b.price) AS last_price,
match_number() AS m_n,
classifier() AS classi
ALL ROWS PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

which gives us the following output based on ALL ROWS PER MATCH:

Output from Amended Measure Clause

 

and if we want to switch to using ONE ROW PER MATCH then we need to remove references to the columns tstamp and price and replace them with references to the pattern variable specific versions, or we can just remove the references all together. In this case as we only have two pattern variables we can NVL the references to return the required data:

SELECT symbol, o_tstamp, o_price, first_date, last_date, first_price, last_price, m_n, classi 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
MEASURES
nvl(a.tstamp, b.tstamp) as o_tstamp,
nvl(a.price, b.price) as o_price,
FIRST(b.tstamp) as first_date,
LAST(b.tstamp) as last_date,
FIRST(b.price) as first_price,
LAST(b.price) as last_price,
match_number() as m_n,
classifier() as classi
ONE ROW PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
);

 

the above query generates slightly fewer rows compared with the previous statement because we are only returning one for each match. However, note that this time we are referencing specific instances of tstamp and price.

New Measures For TSTAMP and PRICE

 

Summary

What have we learned:

Point 1: Check your PARTITION BY and ORDER BY clauses to ensure they make sense!

Point 2: there is no need to list the source columns from your input table in the MEASURE clause because they are automatically included BUT ONLY when you use ALL ROWS PER MATCH.

Point 3: Decide on your output method and match the columns listed in the SELECT clause with those returned by either ALL ROWS PER MATCH or ONE ROW PER MATCH.

Point 4: Always a good idea to check your pattern is being applied correctly by using the built-in  MATCH_NUMBER and CLASSIFIER() measures.

 

For more general information about how to get started with MATCH_RECOGNIZE follow these links to previous blog posts:

and checkout the growing library of tutorials on liveSQL.oracle.com. Hope this helps to throw some light on the workings of the MEASURE clause within MATCH_RECOGNIZE.

Technorati Tags: , , , ,

Friday, 13 January 2017

Are approximate answers the best way to analyze big data

Lots of candy

Image courtesy of pixabay.com

In my previous post I reviewed some reasons why people seem reluctant to accept approximate results as being correct and useful. The general consensus is that approximate results are wrong which is very strange when you consider how often we interact with approximations as part of our everyday life.

Most of the use cases in my first post on this topic covered situations where distinct counts were the primary goal - how many click throughs did an advert generate, how many unique sessions were recorded for a web site etc. The use cases that I outlined provided some very good reasons for using approximations of distinct counts. As we move forward into the era of Analytics-of-Things the use of approximations in queries will expand and this approach to processing data will become an accepted part of our analytical workflows.

To support Analytics-of-Things, Database 12c Release 12.2 includes even more approximate functions. In this release we have added approximations for median and percentile computations and support for aggregating approximate results (counts, median and percentiles).

What is a median and percentile?

A quick refresher course….according to wikipedia a percentile is:

a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20 percent of the observations may be found.

Percentiles are prefect for locating outliers in your data set. In the vast majority of cases you can start with the assumption that a data set exhibits a normal distribution. Therefore if you take the data around the 0.13th and 99.87th percentiles (i.e. outside 3 standard deviations from the mean) then you get the anomalies. Percentiles are great for allowing you to quickly eyeball the distribution of a data set so that you can check for skew or bimodalities etc. Probably, the most common use case is around monitoring service levels where these anomalies are the values of most interest.  

On the other hand, a median is:

the number separating the higher half of a data sample, a population, or a probability distribution, from the lower half. 

Why would you use median rather than the mean? In other words, what are the use cases that require median? Median is great at removing the impact of outliers because the data is sorted and then the middle value is extracted. The average is susceptible to be skewed by outliers. A great use case for median is in resource planning. If you want to know how many staff you should assign to manage your web-store application you might create a metric based on number of sessions during the year. With a web-store the number of sessions will peak around key dates such as July 4th and Thanksgiving. Calculating the average number of sessions over the year will be skewed by these two dates and you will probably end-up with too many staff looking after your application. Using the median removes these two spikes and will return a more realistic figure for the number of sessions per day during the year.

But before you start to consider where, when, how or even if you want to consider using approximate calculations you need to step back for a moment and think about the accuracy of your existing calculations, which I am guessing you think are 100% accurate!

Is your data accurate anyway?

Most business users work on the assumption that the data set they are using is actually 100% accurate and for the vast majority of operational sources flowing into the data warehouse this is probably true although there will always parentless dimension values and in some cases “Other” bucket dimension members to create some semblance of accuracy.

As we start to explore big data related sources pulled from untrusted external sources and IoT sensor streams, which typically are inherently “noisy”, then the level of “accuracy” within the data warehouse starts to become a range rather than a single specific value.

Let’s quickly explore the three key ways that noise gets incorporated into data sets:

1) Human errors

Human input errors: probably the most obvious. It affects both and internal and external sources that rely on human input or interpretation of manually prepared data. Free format fields on forms create all sorts of problems because the answers need to be interpreted. Good examples are insurance claim forms, satisfaction surveys, crime reports, sales returns forms etc 

2) Coding errors

ETL errors: Just about every data source feeding a data warehouse goes through some sort of ETL process. Whilst this is sort of linked to the first group of errors it does fall into this group simply because of the number of steps involved in most ETL jobs. There are some many places where errors can be introduced

Rounding and conversion errors: When an ETL job takes source data, converts it and then aggregates it before pushing it into the warehouse it will always be difficult to back trace the aggregated numbers down to the source data because of inherent rounding errors. When dealing with currency exchange rates it can be a little difficult to tie-back source data in one currency to the aggregated data in the common currency dues to tiny rounding errors.

3) Data Errors

Missing data points:  Data always get lost in translation somewhere down the line or is simply just out of date. In many cases this is the biggest source of errors. For example, one bank recently put together a marketing campaign to stop customer churn. Before they launched they campaign one of their data scientists did some deeper analysis and discovered that the training data for the model included customers who were getting divorced and this was being flagged as a lost customer. Including this group ended up skewing the results. The data about changes to marital status was not being pushed through fast enough to the data warehouse.

Meaningless or distracting data points: with the growth in interest in the area of IoT it is likely that this type of “noise” will become more prevalent in data sets. Sensor data is rarely 100% accurate mainly because in many cases it does not need to deliver that level of accuracy. The volume of data being sent from the sensor will allow you to easily remove or flag meaningless or distracting data. With weblogs it is relatively easy to ignore click-events where a user clicks on an incorrect link and immediately clicks the back-button.

In other words, in many situations, getting precise answers is nothing but an illusion: even when you process your entire data, the answer is still an approximate one. So why not use approximation to your computational advantage and in a way where the trade off between accuracy and efficiency is controlled by you? 

Use cases for these new features

There are a lot of really good use cases for these types of approximations but here are my two personal favorites:

Hypothesis testing — a good example of this is A/B testing which is most commonly used in conjunction with website design and ads design to select the page design or ad that generates the best response. With this type of analysis it is not vital that you have accurate, precise values .What is needed is the ability to reliably compare results and approximations are good normally enough.

Ranking — How does your ISP calculate your monthly usage so they can bill you fairly for your usage? They use a percentile calculation where they will remove the top 5% - 2%, of your bandwidth peaks. and then use that information to calculate your bill. By using data below the 95th-98th percentile they can ignore the infrequent peaks when say your are downloading the lasted update to your Android or iOS device. Again, having precise numbers for this percentile cut-off is not really necessary. A good enough approximation of  the 95th percentile is usually going to be sufficient because it implies that approximately 95% of the time, your usage is below the data volume identified around that percentile. An conversely the remaining 5% of the time, your usage creeps above that amount. 

Of course all the use cases that we considered for distinct counts in the first posts are also valid:

Discovery analytics: data analysts often slice and dice their dataset in their quest for interesting trends, correlations or outliers. If your application falls into this type of explorative analytics, getting an approximate answer within a second is much better compared to waiting twenty minutes for an exact answer. In fact, research on human-computer interaction has shown that, to keep business users engaged and productive, the response times for queries must be below 10 seconds. In particular, if the user has to wait for the answer to their query for more than a couple of seconds then their level of analytical thinking can be seriously impaired.

Market testing: most common use case for market testing is around serving ads on websites. This is where two variants of a specific ad (each with a group of slightly different attributes such as animations or colour schemes) are served up to visitors during a session. The objective is to measure which version generates a higher conversion rate (i.e. more click-throughs). The analytics requires counting the number of clicks per ad with respect to the number of times each ad was displayed. Using an approximation of the number of click-throughs is perfectly acceptable. This is similar to the crowd-counting problem where it is not really necessary to report exactly how many people joined a rally or turned up to an event.

Root cause analysis:  contrary to perceived wisdom, this can in fact be accomplished using approximations. Typically RCA follows a workflow model where results from one query trigger another query, which in turn triggers another related query. Approximations are used to speed up that the decision as to whether or not to continue with a specific line of analysis. Of course you need to incorporate the likelihood of edge cases within your thinking process because there is the danger that the edge values will get lost within the general hashing process.

however, in these examples we usually end up merging or blending the first two use cases with the three above to gain a deeper level of insight so now let’s look at the new approximate statistical functions introduced in Database 12.2

Approximate median and percentile

With Database 12c Release 2 we have added two new approximate functions:

APPROX_PERCENTILE(%_number [DETERMINISTIC], [ERROR_RATE|CONFIDENCE]) WITHIN GROUP (ORDER BY expr [ DESC | ASC ])


This function takes three input arguments. The first argument is numeric type ranging from 0% to 100%. The second parameter is optional. If ‘DETERMINISTIC’ argument is provided, it means user requires deterministic results. If it is not provided, it means deterministic results are not mandatory.  The input expression for the function is derived from the expr in the ORDER BY clause.  

 

The approx_median function has the following syntax:

APPROX_MEDIAN(expr [DETERMINISTIC], [ERROR_RATE|CONFIDENCE]

)  

We can use these functions separately or together as shown here using the SH schema:

SELECT
  calendar_year,
  APPROX_PERCENTILE(0.25) WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.25",
  TRUNC(APPROX_PERCENTILE(0.25, 'ERROR_RATE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.25-er",
  TRUNC(APPROX_PERCENTILE(0.25, 'CONFIDENCE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.25-ci",
  APPROX_MEDIAN(amount_sold deterministic) as "p-0.50",
  TRUNC(APPROX_MEDIAN(amount_sold deterministic, 'ERROR_RATE'),2) as "p-0.50-er",
  TRUNC(APPROX_MEDIAN(amount_sold deterministic, 'CONFIDENCE'),2) as "p-0.50-ci",
  APPROX_PERCENTILE(0.75 deterministic) WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.75",
  TRUNC(APPROX_PERCENTILE(0.75, 'ERROR_RATE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.75-er",
  TRUNC(APPROX_PERCENTILE(0.75, 'CONFIDENCE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.75-ci"
FROM sales s, times t
WHERE s.time_id = t.time_id
GROUP BY calendar_year
ORDER BY calendar_year

The results from the above query are shown below

Resultset showing approx median and percentile calculations

Note that for the APPROX_MEDIAN function I have included the keyword “DETERMINISTIC”.  What does this actually mean? 

Due to the nature of computing approximate percentiles and medians it is not possible to provide a specific and constant value for the error rate or the confidence interval. However, when we have used a large scale real world customer data set (manufacturing use case) we saw an error range of around 0.1 - 1.0%. Therefore, in broad general terms, accuracy will not be a major concern.

Error rates and confidence intervals

 How closely an approximate answers matches the precise answer is gauged by two important statistics:

  • margin of error
  • confidence level.

These two pieces of information tell us how well the approximation represents the precise value. For example, a result may have a margin of error of plus or minus 3 percent at a 95 percent level of confidence. These terms simply mean that if the analysis were conducted 100 times, the data would be within a certain number of percentage points above or below the percentage reported in 95 of the 100 runs.

In other words, Company X surveys customers and finds that 50 percent of the respondents say its customer service is “very good.” The confidence level is cited as 95 percent plus or minus 3 percent. This information means that if the survey were conducted 100 times, the percentage who say service is “very good” will range between 47% and 53% most (95%) of the time (for more information see here: https://www.isixsigma.com/tools-templates/sampling-data/margin-error-and-confidence-levels-made-simple/).

Please note that if you search for more information about error rates and confidence levels then a lot of results will talk about sample size and working back from typical or expected error rates and confidence levels to determine the sample size needed. With approximate query process we do not sample the source data. We always read all the source values, there is no sampling!

 

Performance - how much faster is an approximate result?

 As a test against a real world schema we took a simple query from the customer that computed a number of different median calculations:

 

SELECT count(*) FROM (SELECT /*+ NO_GBY_PUSHDOWN */ b15, median(b4000), median(b776), median(e), median(f), median(n), median(z) FROM mdv group by b15);

 

Performance Graph for non-approximate query

As you can see from the real-time monitoring page, the query accessed 105 million rows and the calculations generated 11GB of temp. That’s a lot of data for one query to spill to disk!

Resource usage for non-approximate query

 

Now if we convert the above query to use the approx_median function and rerun the query we can see below that we get a very different levels of resource usage: 

Performance graph for approximate query

 

Looking closely at the resource usage you can see that the query is 13x faster, uses considerably less memory (830Kb vs 1GB) but most importantly there is no usage of temp: 

Resource graph for approximate qquery

 

 Summary

One of the most important take-aways from this post relates to the fact that we always read all the source data. The approximate functions in Database 12c Release 2 do not using sampling as a way to increase performance. These new features are significantly faster and use fewer resources which means more resources are available for other queries - allowing you to do more with the same level of resources.

Technorati Tags: , , , , , ,

Tuesday, 10 January 2017

SQL Pattern Matching Deep Dive - Part 6, state machines

The obvious way to start this particular post is to pose a couple of simple questions: what is a state machine and why should you care? In general I would say that you don't need to know about or care about state machines. That's the beauty of using SQL for pattern matching. The MATCH_RECOGNIZE clause encapsulates all the deep technical modelling and processing that has to be performed to run pattern matching on a data set. However, there are times when it is useful, probably vital, that you understand what is going on behind the scenes and one of the most obvious situations is when backtracking happens.

Therefore, the content covered in this post is a going to be a gently lead-in into my next post where I am going to discuss the concept of “backtracking”  and the dreaded ORA-30009 error.

Let’s start our voyage of discovery…when you attempt to run a SQL statement containing a MATCH_RECOGNIZE clause during the compilation phase we generate a finite state machine based on the PATTERN and DEFINE clauses in your statement. What is an Finite State Machine? According to wikipedia:

A finite-state machine (FSM)…is a mathematical model of computation....it is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time…changes from one state to another when initiated by a triggering event or condition; this is called a transition. A particular FSM is defined by a list of its states, and the triggering condition for each transition.

Reference from wikipedia - https://en.wikipedia.org/wiki/Finite-state_machine

 

 

A state machine, which is the PATTERN and DEFINE elements of your MATCH_RECOGNIZE clause, can be represented by a directed graph called a state diagram. This diagram shows each of the possible states for the “machine” and the conditions that force the machine to either remain in its current state or move to the next state. Below is a simple example of a state machine:  

State Machine

On the above diagram each state is represented by a node (ellipse) which in this case are marked as “State 1” to “State 4”. The arrows, known as Edges, show the transition(s) from one state to another. If you look at states 2 and 4 you will notice that they have two edges although these edges are shown in different vertical positions on the diagram. When drawing a proper state diagram each event is labeled with the event (condition) that triggers transition. Events (conditions) that don’t cause a change of state are represented by a circular arrow returning to the original state and these can be seen on states 2 and 4.

The precedence for reading the information is to read from top-down. What this means is that when in State 2 the FSM will test to see if State 3 can be achieved and if it can’t it will then test to see if State 2 can be maintained. The reverse is true for State 4 where the FSM will test to see if State 4 can be maintained and if it can’t it will then, in this example, either end having determined that a match has completed or start backtracking to try and complete a match. I am sure you can now see how this is going to link into my next blog post.

State machines are not limited to just pattern matching. They have all sorts of other uses. If you want a gentle diversion to look at state machines in a little more detail then try this article by Enrique Ortiz from the OTN magazine in August 2004: Managing the MIDlet Life-Cycle with a Finite State Machine.

All of this flows directly into keywords that appear (or don’t appear) in the explain plans which was covered in this post MATCH_RECOGNIZE and the Optimizer from January 2015. As quick refresher…essentially there are four new keywords that you need to be aware of:

  • MATCH RECOGNIZE
  • SORT
  • BUFFER
  • DETERMINISTIC FINITE AUTO

The fist three bullet points are reasonably obvious. The last keyword is linked to the use of “state machine”. Its appearance, or lack of appearance, affects the way our pattern is applied to our data set but that is all explained in the blog post. Most of my MATCH_RECOGNIZE examples are based on the stock ticker data set. Let’s assume that we are searching for V-shaped patters in our data set (https://docs.oracle.com/database/121/DWHSG/pattern.htm#CACHHJJG):

SELECT *
FROM Ticker MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES STRT.tstamp AS start_tstamp,
LAST(DOWN.tstamp) AS bottom_tstamp,
LAST(UP.tstamp) AS end_tstamp
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST UP
PATTERN (STRT DOWN+ UP+)
DEFINE
DOWN AS price < PREV(down.price),
UP AS price > PREV(up.price)
) MR
ORDER BY MR.symbol, MR.start_tstamp;

 

this is what the state diagram would look like:

 

State Machine graph for V-Shaped patterm

These diagrams can be really helpful when you have more complex patterns and you need to consider the impact of backtracking. This posts is all about laying the building blocks for my next post on backtracking and the dreaded ORA-30009 error. If you have managed to read this far then you are guaranteed to be ready for an in-depth look at what happens inside MATCH_RECOGNIZE when we have move from right to left through our state diagram in an attempt to find a complete match. Now you should know everything you need to know about state machines and I am going to carry over the “why care” part to the next post…

If you want a recap of where we are in this series of pattern matching deep dive posts here is the full list:

As per usual, if you have any comments or questions then feel free to contact me directly via email: keith.laker@oracle.com 

Technorati Tags: , , , ,

Thursday, 5 January 2017

Exploring the interfaces for User Defined Aggregates

 

image courtesy of wikipedia

Whilst I was working on the functional specification for the LISTAGG extensions that we implemented in 12c Release 2, I came across Tom Kyte’s stragg function which uses the User Defined Aggregate API introduced in database 9i. Tom’s comprehensive answer covers the two important areas that need to considered when using the data cartridge API: 1) a UDA can run as serial process and 2) a UDA can run as a parallel process. Therefore, you need to code for both these eventualities. Dealing with both scenarios can be a little challenging - as I have discovered over the last few weeks. having looked at a number of posts there is a common theme for explaining how the various interfaces for user defined aggregate actually work. One the clearest examples is on Tim Hall’s blog: String Aggregation Techniques.

This got me thinking…would it be possible to take the new  extensions we made to LISTAGG and incorporate them into custom string manipulation function built using the UDA interfaces? Essentially providing a pre-12.2 solution to prevent the text-string overflow error “ORA-01489: result of string concatenation is too long”? After some initial coding I managed to get a solution that worked perfectly as long as I did not try and run the query using parallel execution. Eventually I managed to get the parallel execution process coded but it was not deterministic and the results differed from the results from the serial query.

After revisiting Tom Kyte’s  stragg function solution I think I have finally created a fully working solution and here is what I have learned along the way…

Overview

Before the going into the code, let’s explore the data cartridge interface for creating user defined aggregates. As far as I can tell this is not a very well known or well used feature in that I have not seen any presentations at user conferences that explain why and how to build these functions. It is a very interesting feature because User-defined aggregate functions can be used in SQL DML statements just like Oracle’s built-in aggregates. Most importantly they allow you to work with and manipulate complex data types such as multimedia data stored using object types, opaque types, and LOBs.

Each user-defined aggregate function is made up of three mandatory and three optional ODCIAggregate interfaces, or steps, to define internal operations that any aggregate function performs. The four mandatory interfaces are: initialization, iteration, merging, and termination.

  • Initialization is accomplished by the ODCIAggregateInitialize() routine, which is invoked by Oracle to initialize the computation of the user-defined aggregate. The initialized aggregation context is passed back to Oracle as an object type instance.
  • Iteration is performed through the ODCIAggregateIterate() routine, which is repeatedly invoked by Oracle. On each invocation, a new value or a set of new values and the current aggregation context are passed in. The routine processes the new values and returns the updated aggregation context. This routine is invoked for every non-NULL value in the underlying group. NULL values are ignored during aggregation and are not passed to the routine.
  • Merging is an optional step and is performed by ODCIAggregateMerge(), a routine invoked by Oracle to combine two aggregation contexts. This routine takes the two contexts as inputs, combines them, and returns a single aggregation context.
  • Termination takes place when the ODCIAggregateTerminate() routine is invoked by Oracle as the final step of aggregation. The routine takes the aggregation context as input and returns the resulting aggregate value.

This is how these four main functions fit together…

The most important observations I would make are:

  1. You need to think very carefully about where and how you want to process your data. 
  2. There are essentially two options: 1) during the Iterate stage and/or 2) during the Terminate stage. Of course you need to remember that code used in the Iterate stage needs to replicated at the Merge stage.
  3. It’s tempting but don’t ignore the Merge stagebut if you do then when the function is run in parallel you won’t see any results! 

What’s missing….

In the code sample below you will notice that I am missing two interfaces: ODCIAggregateDelete() and ODCIAggregateWrapContext(). Both functions are optional and for the purposes of creating a replacement for the LISTAGG function, these two functions were not needed. But for the sake of completeness below is a brief description of each function:  

ODCIAggregateDelete() removes an input value from the current group. The routine is invoked by Oracle by passing in the aggregation context and the value of the input to be removed. It processes the input value, updates the aggregation context, and returns the context. This is an optional routine and is implemented as a member method.

ODCIAggregateWrapContext() integrates all external pieces of the current aggregation context to make the context self-contained. Invoked by Oracle if the user-defined aggregate has been declared to have external context and is transmitting partial aggregates from slave processes. This is an optional routine and is implemented as a member method.

I have searched the internet for examples of when and how to use these two optional functions and there is not much out there. One example was posted by Gary Myers http://blog.sydoracle.com/2005/09/analytics-with-order-by-and-distinct.html which is another derivation of the STRAGG function. In  Gary’s example the objective is to return rows until 5 distinct values of a specific column had been returned. I think it is possible to do this without resorting to implementing this requirement within the  ODCIAggregateTerminate function but I will leave you to under that one!

 

Schema - Sales History

For this simple LISTAGG alternative I am using the sample sales history schema and I am going to create a concatenated list of the first name and last name of each customer within each sales region. To make the code examples a little easier to read I have created a view over the CUSTOMERS and COUNTRIES table:

CREATE OR REPLACE VIEW CG_LIST AS
SELECT
g.country_region_id,
c.cust_first_name,
c.cust_last_name
FROM countries g, customers c
WHERE g.country_id = c.country_id
ORDER BY g.country_region_id;

 

 If we try to run the following SQL query we will get the usual LISTAGG overflow error:

SELECT 
country_region_id,
LISTAGG(cust_first_name||' '||cust_last_name) WITHIN GROUP (ORDER BY country_region_id) AS "customers"
FROM MY_LIST
GROUP BY country_region_id
ORDER BY country_region_id;

 

ORA-01489: result of string concatenation is too long
01489. 00000 - "result of string concatenation is too long"
*Cause: String concatenation result is more than the maximum size.
*Action: Make sure that the result is less than the maximum size.

Now let’s use the User Defined Aggregates framework to resolve this issue…

 

Stage 1 - Building the basic framework

First we need a storage object to hold the results generated during the Iterate stage. Following Tom Kyte’s example I am using an array/table as my storage object as this will ensure that I never hit the limits of the VARCHAR2 object when I am building my list of string values. Here I have checked the maximum size of the concatenated first name and last name combinations which is 21 characters. This means I can set the limit for the varchar2 column at 25 characters (…giving myself a little bit of headroom, just in case…).

CREATE OR REPLACE TYPE string_varray AS TABLE OF VARCHAR2(25);

Note that if we did not use an array we would be forced to use a single VARCHAR2 variable to hold the string values being pieced together. This would mean testing the length of the VARCHAR2 object before adding the next string value within both the Iterate and Merge functions, which causes all sorts of problems! As with most things in life, when working with the user defined aggregate interfaces it pays to keep things as simple as possible and put the processing in the most appropriate place. Therefore, in this case the best solution is to use an array because it makes the code simple and processing more efficient! 

Here is the definition of for the interfaces:

CREATE OR REPLACE TYPE t_string_agg AS OBJECT
(
a_string_data string_varray,

STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT t_string_agg)
RETURN NUMBER,
MEMBER FUNCTION ODCIAggregateIterate(self IN OUT t_string_agg,
value IN VARCHAR2 )
RETURN NUMBER,
MEMBER FUNCTION ODCIAggregateTerminate(self IN t_string_agg,
returnValue OUT VARCHAR2,
flags IN NUMBER)
RETURN NUMBER,
MEMBER FUNCTION ODCIAggregateMerge(self IN OUT t_string_agg,
ctx2 IN t_string_agg)
RETURN NUMBER
);
/
SHOW ERRORS

 

Initially I included some additional variables at the start of the OBJECT definition for managing various counters and the maximum string length. However, it’s not possible to pass parameters into this function except for the string value to be processed. Therefore, I removed the counter definitions from the header and just instantiated my array. This actually makes perfect sense because the counters and maximum string length variables are only needed at one specific point within the process flow - more on this later -  plus this keeps the definition stage nice and simple. 

Now we have the definition for our interfaces which means we can move on to the actual coding associated with each interface.

 

Stage 2 - Coding the Initialization phase

The code for this interface is relatively simple because all we need to do is instantiate the array that will hold our list of string values being passed into the aggregation process. We do this using a call to the object named in the previous section: sctx := t_string_agg(string_varray() ); This is the start of our processing flow…

STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT t_string_agg)
RETURN NUMBER IS
BEGIN
sctx := t_string_agg(string_varray() );

RETURN ODCIConst.Success;
END;

 

Stage 3 - Coding the Iterate phase

The iterate phase will get called multiple times as we process the string values that need to be aggregated. At this point all we need to do is to collect the string values being passed in and insert them into the array object that was created at the start of stage 1. We use the extend function to add a new row into the array and assign the string value. The important point to note here is that the Iterate process will also form part of the parallel execution framework, therefore,  we need to keep the processing as simple as possible. 

MEMBER FUNCTION ODCIAggregateIterate(self IN OUT t_string_agg,
value IN VARCHAR2)
RETURN NUMBER IS
BEGIN
a_string_data.extend;
a_string_data(a_string_data.count) := value;

RETURN ODCIConst.Success;
END;

Given that we want to try and ensure that we don’t exceed the limits of the VARCHAR2 object and generate an  “ORA-01489“ it’s tempting to code this logic within the Iterate function. The issue with placing the processing logic within this function is that all the logic will need to be replicated within the ODCIAggregateMerge function to cope parallel execution where multiple Iterate functions are executed and the results merged to generate a final, single resultset that can be passed to the ODCIAggregateTerminate function. 

 

Stage 4 - Coding the Merge phase

This function gets called during parallel execution and is used to merge results from multiple Iterate processes to generate a final, single resultset that can be passed to the ODCIAggregateTerminate function. The basic logic needs to replicate the logic from the ODCIAggregateIterate function but needs to take account of multiple values coming into the function via the CTX2 instantiation of our string function rather than the single string value being passed into the Iterate function. In essence the CTX2 provides the data from the various parallel execution processes and the self object simply accumulates the results being passed in from the various processes.

MEMBER FUNCTION ODCIAggregateMerge(self IN OUT t_string_agg,
ctx2 IN t_string_agg)
RETURN NUMBER IS
BEGIN
FOR i IN 1 .. ctx2.a_string_data.count
LOOP
a_string_data.extend;
a_string_data(a_string_data.count) := ctx2.a_string_data(i);
END LOOP;

RETURN ODCIConst.Success;
END;

If we tried to enforce the maximum string length functionality within Iterate function then the same processing would need to be enforced within this function as well and having tried to do it I can tell you that the code very quickly gets very complicated. This is not the right place to do complex processing that is better done in the final, terminate, phase. If you do find yourself putting a lot of code within either or both the Iterate and/or Merge functions then I would recommend taking a very hard look at whether the code would be better and simpler if it was placed within the Terminate stage.

 

Stage 5 - Coding the Terminate phase

 Given that we want to avoid blowing the limits of the VARCHAR2 object this is the most obvious place to code our processing logic. By placing the code here, we can ensure that our string processing function will work when the query is run in both serial and parallel.  Here is the code:

MEMBER FUNCTION ODCIAggregateTerminate(self IN t_string_agg,
returnValue OUT VARCHAR2,
flags IN NUMBER)
RETURN NUMBER IS

l_data varchar2(32000);
ctx_len NUMBER;
string_max NUMBER;
BEGIN
ctx_len := 0;
string_max := 100;

FOR x IN (SELECT column_value FROM TABLE(a_string_data) order by 1)
LOOP
IF LENGTH(l_data || ',' || x.column_value) <= string_max THEN
l_data := l_data || ',' || x.column_value;
ELSE
ctx_len := ctx_len + 1;
END IF;
END LOOP;

IF ctx_len > 1 THEN
l_data := l_data || '...(' || ctx_len||')';
END IF;

returnValue := LTRIM(l_data, ',');

RETURN ODCIConst.Success;
END;

Note that the function returns the list of string values from the array sorted in alphabetical order: (SELECT column_value FROM TABLE(a_string_data) order by 1). If you don’t need a sorted list then you can remove the ORDER BY clause. Cycling through the ordered list of strings allows us to add a comma to separate each value from the next but first we need to check the length of the list of strings to see if we have reached the maximum string length set by the variable string_max. As soon as we reach this maximum value we start counting the number of values that are excluded from the final list:

    IF LENGTH(l_data || ',' || x.column_value) <= string_max THEN
l_data := l_data || ',' || x.column_value;
ELSE
ctx_len := ctx_len + 1;
END IF;

For this code sample the maximum string length is set as 100 characters. Once we reach that value then we simply continue looping through the array but now we increment our counter (ctx_len).  Given that we have sorted the list of string values, the next question is: can we remove duplicate values? The answer is yes! It is a relatively simple change:

 FOR x IN (SELECT DISTINCT column_value FROM TABLE(a_string_data) order by 1)

There is an obvious cost to pay here because the use of the DISTINCT keyword will require additional processing which will have a performance impact so think very carefully about whether this is actually needed. 

 

Full code sample

Here is the complete definition of the code:

 

CREATE OR REPLACE TYPE BODY t_string_agg IS
STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT t_string_agg)
RETURN NUMBER IS
BEGIN
sctx := t_string_agg(string_varray() );

RETURN ODCIConst.Success;
END;
MEMBER FUNCTION ODCIAggregateIterate(self IN OUT t_string_agg,
value IN VARCHAR2)
RETURN NUMBER IS
BEGIN
a_string_data.extend;
a_string_data(a_string_data.count) := value;

RETURN ODCIConst.Success;
END;
MEMBER FUNCTION ODCIAggregateTerminate(self IN t_string_agg,
returnValue OUT VARCHAR2,
flags IN NUMBER)
RETURN NUMBER IS

l_data varchar2(32000);
ctx_len NUMBER;
string_max NUMBER;
BEGIN
ctx_len := 0;
string_max := 100;

FOR x IN (SELECT DISTINCT column_value FROM TABLE(a_string_data) order by 1)
LOOP
IF LENGTH(l_data || ',' || x.column_value) <= string_max THEN
l_data := l_data || ',' || x.column_value;
ELSE
ctx_len := ctx_len + 1;
END IF;
END LOOP;

IF ctx_len > 1 THEN
l_data := l_data || '...(' || ctx_len||')';
END IF;

returnValue := LTRIM(l_data, ',');

RETURN ODCIConst.Success;
END;
MEMBER FUNCTION ODCIAggregateMerge(self IN OUT t_string_agg,
ctx2 IN t_string_agg)
RETURN NUMBER IS
BEGIN
FOR i IN 1 .. ctx2.a_string_data.count
LOOP
a_string_data.EXTEND;
a_string_data(a_string_data.COUNT) := ctx2.a_string_data(i);
END LOOP;

RETURN ODCIConst.Success;
END;
END;
/
SHOW ERRORS
 

Stage 6 - Coding the actual string function

 The last step is to create a function that calls our string processing object and takes a string (VARCHAR2) object as it’s input:

CREATE OR REPLACE FUNCTION string_agg (p_input VARCHAR2)
RETURN VARCHAR2
PARALLEL_ENABLE AGGREGATE USING t_string_agg;

 

Bringing it all together…

Now let’s run a query using our replacement for LISTAGG()..

 

SELECT 
country_region_id,
string_agg(cust_first_name||' '||cust_last_name) AS s_str_list
FROM MY_LIST
GROUP BY country_region_id
ORDER BY country_region_id;
 

and here is the output:

 

Results from serial execution of query

 

the length of the concatenated string for each region is limited to a maximum of 100 characters and we have a count of the number of values excluded from the list - shown in brackets at the end of the list. Just so you can check that the truncation is occurring correctly I have included an additional column in the resultset below which shows the length of the final string (including the three dots + the numbers showing the count of truncated values).

 

Count of final string length

 

 

Conclusion

First a huge vote of thanks to Tom Kyte for providing the original code and ideas behind this particular blog post.  

If you don’t have access to Database 12c Release 2 and the enhancements that we made to LISTAGG to control  “ORA-01489“ errors then here is a “usable” alternative that provides a lot more control over string concatenation. Obviously if you want string concatenation with a count of missing values and without a count then you will need to create two separate functions. If you need a distinct list of values for some queries and not others then you will need to create separate functions to handle the different processing.

This example also provides a great introduction to the area of user defined aggregates, which were introduced in Oracle 9i. A quick Google search returned a lot of similar examples but no “real-world” use cases so I assume that UDAs are a little used gem within the long list of Oracle Database features. UDAs are a very powerful and flexible feature that you should definitely add to your toolbox of skills.

     

Technorati Tags: , , , ,

Wednesday, 4 January 2017

Simplifying your data validation code with Database 12.2

Data is good

Image courtesy of pixabay.com
 

Doesn’t matter who much testing you do (well, it actually does but that’s a whole different issue) you can almost guarantee that at some point your beautiful data validation code, that parses data input from a web form or loads data from some external file, will pop up with the error:

SQL Error: ORA-01722: invalid number

01722. 00000 - "invalid number"
*Cause: The specified number was invalid.
*Action: Specify a valid number.

Of course, what’s is really annoying at this point is that you don’t know which column value of the record failed (assuming that you have more than one numeric column)

Managing conversion errors during data loads

What’s to do? Of course the sensible thing is to add lots of data validation checks into your code to try and catch the situations where the wrong type of data  arrives from your data source. It’s likely that all the additional validation checks will slow down the process of inserting data, which is not a great result.

If your data is arriving via an external file then you can use the BADFILE clause to capture records that cannot be loaded because of data type errors. But what if the data source for your insert statement is a staging table that was populated by an ETL job or a series of values from a web form? 

How to manage conversion errors during INSERTs

Panic over - Database 12c Release 2 contains important changes to the CAST and TO_xxx functions to manage the most common data conversion errors. The CAST function now has the ability to return a user-specified value if there is a conversion error.

For example, let’s build a simple staging table in the schema:

CREATE TABLE STAGING_EMP
( "EMPNO" VARCHAR2(6),
  "ENAME" VARCHAR2(10),
  "JOB" VARCHAR2(9),
  "MGR" VARCHAR2(4),
  "HIREDATE" VARCHAR2(10),
  "SAL" VARCHAR2(7),
  "COMM" VARCHAR2(9),
  "DEPTNO" VARCHAR2(6));

and let’s insert some data, which includes values that will cause data conversion errors when we try to add the values into our target table:

-- INSERTING DATA INTO STAGING_EMP
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
            values ('GB9369','SMITH','CLERK','7902','17-DEC-80','800',null,'20');

-- INVALID DATE
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('9499','ALLEN','SALESMAN','7698','31-FEB-81','1600','300','30');

-- INVALID NUMBER FOR DEPTNO
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('9521','WARD','SALESMAN','7698','22-FEB-81','1250','500','SALES');

-- INVALID NUMBER FOR EMPNO KEY
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('US9566','JONES','MANAGER','7839','02-APR-81','2975',null,'20');

Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('9782','CLARK','MANAGER','7839','09-JUN-81','2450',null,'10');

-- INVALID NUMBER FOR EMPNO KEY
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('FR9788','SCOTT','ANALYST','7566','19-APR-87','3000',null,'20');

-- INVALID NUMBER FOR MGR KEY
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('9839','KING','PRESIDENT','null','17-NOV-81','5000',null,'10');

-- INVALID NUMBER FOR EMPNO KEY
Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('DE9844','TURNER','SALESMAN','7698','08-SEP-81','1500',0,'30');

Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)
           values ('9876','ADAMS','CLERK','7788','23-MAY-87','1100',null,'20');

 

Now let’s try inserting the data from our staging table into the EMP table and see what happens:

INSERT INTO scott.emp SELECT * FROM staging_emp;

… and not surprisingly I get the following error:

Error starting at line : 52 in command -
INSERT INTO emp SELECT * FROM staging_emp
Error report -
SQL Error: ORA-01722: invalid number
01722. 00000 - "invalid number"
*Cause: The specified number was invalid.
*Action: Specify a valid number.

I can deal with this situation in a couple of different ways. Firstly let’s try and discover which rows and columns in my staging table contain values that are likely to cause data conversion errors. To do this I am going to use the new VALIDATE_CONVERSION() function which identifies problem data that cannot be converted to the required data type. It returns 1 if a given expression can be converted to the specified data type, else it returns 0.

SELECT
  VALIDATE_CONVERSION(empno AS NUMBER) AS is_empno,
  VALIDATE_CONVERSION(mgr AS NUMBER) AS is_mgr,
  VALIDATE_CONVERSION(hiredate AS DATE) AS is_hiredate,
  VALIDATE_CONVERSION(sal AS NUMBER) AS is_sal,
  VALIDATE_CONVERSION(comm AS NUMBER) AS is_comm,
  VALIDATE_CONVERSION(deptno AS NUMBER) AS is_deptno 
FROM staging_emp;

this produces a table where I can easily pick out the rows where the data conversion is going to succeed (column value is 1) and fail (column value is 0):

Results from validation_conversion function

I could use this information to filter the data in my staging table as I insert it into my EMP table or I could use the enhanced CAST and TO_xxx functions within the INSERT INTO ….. SELECT statements.

The CAST function (along with TO_NUMBER, TO_BINARY_FLOAT, TO_BINARY_DOUBLE, TO_DATE, TO_TIMESTAMP, TO_TIMESTAMP_TZ, TO_DSINTERVAL, and TO_YMINTERVAL functions) can now return a user-specified value, instead of an error, when data type conversion errors occur. This reduces failures during an data transformation and data loading processes.

Therefore, my new 12.2 self-validating SELECT statement looks like this:

INSERT INTO emp
SELECT
  empno,
  ename,
  job,
  CAST(mgr AS NUMBER DEFAULT 9999 ON CONVERSION ERROR),
  CAST(hiredate AS DATE DEFAULT sysdate ON CONVERSION ERROR),
  CAST(sal AS NUMBER DEFAULT 0 ON CONVERSION ERROR),
  CAST(comm AS NUMBER DEFAULT null ON CONVERSION ERROR),
  CAST(deptno AS NUMBER DEFAULT 99 ON CONVERSION ERROR)
FROM staging_emp
WHERE VALIDATE_CONVERSION(empno AS NUMBER) = 1;

which results in five rows being inserted into my EMP table - obviously this means that 4 rows were rejected during the insert process (rows 1, 4, 6 and 8) because they contain errors converting the contents to a number for the empno key. Here is the data that was loaded:

Data loaded into emp table from staging table

we can see that on row 1 the HIERDATE was invalid so it was replaced by the value from sys date (07-JUL-16). Row 2 the value of DEPTNO is the conversion default of 99 and on row 4 the value for MGR is the conversion default of 9999.

Conclusion

The enhanced  CAST function (along with TO_NUMBER, TO_BINARY_FLOAT, TO_BINARY_DOUBLE, TO_DATE, TO_TIMESTAMP, TO_TIMESTAMP_TZ, TO_DSINTERVAL, and TO_YMINTERVAL functions) can help you deal with data conversion errors without having to resort to complicated PL/SQL code or writing data validation routines within your application code.

The new VALIDATE_CONVERSION() function can be used to help you identify column values that cannot be converted to the required data type.

Hope these two features are useful. Enjoy!

Don’t forget that LiveSQL is now running Database 12c Release so check out all the new tutorials and code samples that have recently been posted. I have posted a tutorial for the features discussed above: https://livesql.oracle.com/apex/livesql/file/tutorial_EDVE861IMHO1W3Q654ES9EQQW.html