Tuesday, 21 March 2017

My query just got faster - brief introduction to 12.2 in-memory cursor duration temp tables


This post covers one of the new SQL performance enhancements that we incorporated into Database 12c Release 2. All of these enhancements are completely automatic, i.e. transparent to the calling app/developer code/query. These features are enabled by default because who doesn’t want their queries running faster with zero code changes?

So in this post I am going to focus on the new In-Memory “cursor duration” temporary table feature. Let’s start by looking at cursor duration temp tables…

Above image courtesy of wikimedia.org

What is a cursor duration temp table?

This is a feature that has been around for quite a long time. Cursor duration temporary tables (CDTs) are used to materialize intermediate results of a query to improve performance or to support complex multi step query execution. The following types of queries commonly use cursor duration temp tables:
What happens during the query execution process, assuming CDTs are used to materialize intermediate results, is that a temporary table is created during compilation and the query is rewritten to refer to the temporary table(s). Before the actual query executes, the temporary tables are populated using “Insert Direct Load”. The required temporary segments are allocated when the data is loaded and the segment information is managed within the session for the duration of the query. Obviously, the scope of data in the CDT is private to the specific query execution. There is no exposed mechanism available today that allows you to view the data within a CDT except through tracing.

So what is an In-Memory cursor duration temp table?

An In-Memory cursor duration temp table (IMCDTt) is simply where the data for a CDTs is stored in memory, which results in a “significant” performance boost for queries that make multiple passes over a data set or need to materialise intermediate results. The switch to using in-memory tables means that your queries should see a reduction in I/O since each pass-over the data set does not incur additional I/O operations to access that data.
There maybe times when there is insufficient memory available to load all the data, so what happens in these situations? When this does happen then we use local (private) temp segments will be allocated to store the excess data and when an in-memory cursor duration temp table is queried it will query both the memory and the private temp segments to return the data.

Will all my queries use this new feature?

Our internal algorithms determine when and how this new feature is used. At the moment only serial queries that make multiple passes over a data set or queries that need to materialize intermediate results will use this feature. Don’t be concerned if your queries do not end up using these new in-memory cursor duration temp tables. There is nothing you can do to force their use within a query. The point of this post is to simply make you aware of a new term that could, potentially, appear in your explain plans. Obviously, going forward we will explore the possibility of expanding the scope of this feature to cover other types of queries.

Do I need to license database in-memory option?

No. There is no need to license the costed Database In-Memory option. Of course, if you are using the Exadata and Database Cloud services then really useful analytical options such as advanced analytics (data mining, machine learning), spatial and graph analytics and in-memory are included in most of the service prices. If that isn’t good enough reason to move to the Oracle Cloud then I don’t know what is!

Where does the memory come from?

In broad general terms the memory used by In-Memory “cursor duration” temporary tables comes from the PGA pool. Does this mean you might need to increase the size of your PGA memory area? As usual, “it depends…..” on a lot of different factors including whether you have a lot of queries today that use cursor duration temporary tables and which are likely to switch over to using the new in-memory cursor duration temporary tables. All I can say is: monitor your usage of PGA and determine if you need to increase the size of your PGA because you are running out of resources. Of course, if a query cannot allocate sufficient memory to use in-memory cursor duration” temporary tables it will simply revert back to using the pre-12.2 cursor duration” temporary tables.
That’s the background stuff all covered so now we can look at a couple of SQL code examples to see how this all works in practice.

Sample Schema

Let’s look at a simple query using the sales history sample schema (to download this sample schema goto the database downloads page on OTN, find your operating system and click on the “See All” link, then scroll down until you find the download link for the “Oracle Database xxxxx Examples”. Finally, follow the installation instructions in the Oracle Documentation to install the SH schema into your database instance).
Alternatively you can access the sales history schema using our new web-based LiveSQL. You will need to create an account if you don’t already have one. 

Example: GROUP BY with GROUPING SETS

What we want to find is the total revenue for each product and promotion during the period 01-Jan-2000 and 31-Dec-2000 along with the total sales in each channel. We can do this by using the GROUP BY GROUPING SETS feature (a very big topic for another day). Here’s the query we need to run:

SELECT /*+MONITOR */
 p.prod_category, 
 x.promo_id,
 c.channel_class,
 t.calendar_quarter_desc,
 s.time_id,
 SUM(s.amount_sold)
FROM sales s, products p, promotions x, channels c, times t
WHERE s.time_id 
      BETWEEN to_date('1998-01-01', 'YYYY-MM-DD') 
      AND to_date('2000-12-31', 'YYYY-MM-DD')
AND p.prod_id = s.prod_id
AND x.promo_id = s.promo_id
AND c.channel_id = s.channel_id
AND t.time_id = s.time_id
GROUP BY GROUPING SETS 
 ((p.prod_category, x.promo_id), 
  (c.channel_class, t.calendar_quarter_desc), 
  (p.prod_category, t.calendar_quarter_desc), 
  (s.time_id, t.calendar_quarter_desc))
ORDER BY 1,2,3,4,5
Just to add some additional functionality to this example I am going to include a hint for parallel query and a hint that will allow me to access the real-time SQL monitoring within SQLDeveloper - for more information about this really great feature read this blog post by ThatJeffSmith: On Real-Time SQL Monitoring and the /*+MONITOR*/ Hint.
The examples below are not going to showing huge performance gains simply because I am using a very small data set (the sales history sample schema) and I am running the queries in two separate VirtualBox machines on my laptop and the Big Data Lite image is also running a lot of other features, such as a complete Hadoop environment. Therefore, you just need to stay focused on the changes to I/O in the examples below.

Part A, Pre 12-2:  using cursor duration temp tables

Using the GROUPING SETS feature we will create totals for each specific combination of product category, promo_id,  channel class along with quarters and months. The output is quite long because there are quite a few permutations of dimension members even in the sales history schema, so here is first block of rows returned into SQL Developer so you can get a sense of the structure of the data set that is returned. This query is running in a 12.1 Database (I am actually using one of the current release of the Big Data Lite VirtualBox image that includes Database 12c Release 1). I have truncated the output but you can see most of the results for the first GROUPING SET of prod_id and promo_id. The query does return a lot more rows: 1,205 rows.


Output from Query 1

The monitoring report in SQL Developer for the above query looks like this:

Fullsizeoutput 1007


We can see that there are a number of distinct blocks of work but at the top of the plan we can see the TEMP TABLE TRANSFORMATION reference followed by LOAD AS SELECT with no further keywords, which is expected because we are not using 12.2 to run this query. About half-way down the report you can see the additional LOAD AS SELECT statements against the temp table containing the base level information we need to create the various total combinations within the GROUPING SETS clause.
If you want more information the temp table transformation step then there is an excellent post on the optimizer bloghttps://blogs.oracle.com/optimizer/entry/star_transformation. From the plan below you can see that our temp table is then reused during the rest of the query where we reuse the temp table to construct the various totals for each combination of dimension  members. 
We can see that we are incurring I/O during each phase of the query: we are making 684 I/O requests and 172.6MB of I/O Bytes. Given that I am using two VMs running at the same time I don’t see much point in focusing on the actual execution time. So that’s our plan running in Database 12c Release 1.

Part B, 12.2: using in-memory cursor duration temp tables

Now let’s switch over to using our 12.2 database - I am using the latest developer VirtualBox image that is posted on OTN. Using this evironment we can see that if we re-run our query the result of the query , in terms of rows returned, is the same which is always great news.


Output from Query 1

Let’s take a look at the monitoring report for the same query now running in Database 12c Release 2:

SQL Developer Monitoring Report for 12-2 Query


The first to notice when comparing the two monitoring plans is that we have significantly reduced the amount of I/O in 12.2. In 12.1 our relatively simple grouping set query generations 684 I/O requests and 172.6MB of I/O. Compare that with the data in the monitoring report for the same query running in 12-2 - 40 I/O requests and 9.9MB of I/O. This means that we have managed to improve the overall efficiency of our query by simply upgrading to 12.2.

Obviously your mileage will vary according to the details of the query you are executing but that is a nice resource efficiency and performance boost that has required zero code changes and it’s completely FREE!. 

 In summary, with this GROUPING SET example we have reduced the amount of I/O and the number of I/O requests through the use of in-memory cursor duration temp tables. As with the previous report, you will see continual references to  “LOAD AS SELECT”, however, in 12.2 there is an additional set of keywords which identify the use of the new in-memory cursor duration temp tables:
LOAD AS SELECT ((CURSOR DURATION MEMORY)) 
In the bottom half of the report you should notice that the above statement covers two additional plan lines HASH (GROUP BY) and TABLE ACCESS (FULL) which reference the temp table object, however, there are no I/O operations - which confirms the use of in-memory cursor duration temp tables.

Summary

This post has covered just one of the many new SQL performance enhancements that we incorporated into Database 12c Release 2. I covered the most important features in my presentation at last year’s OpenWorld. A lot of these enhancements, including  In-Memory “cursor duration” temporary tables, are completely automatic, i.e. transparent to the calling app/query and they are enabled by default because who doesn’t want their queries running faster with zero code changes? Within this release of 12.2 we are limiting their use to just serial queries of the types listed at the start of the post.

Just to be absolutely clear - there are no hints or parameters you can set to force the use of In-Memory “cursor duration” temporary tables. Our internal algorithms will determine if this feature is used within your query. If In-Memory “cursor duration” temporary tables are used then you will see the following lines in your explain plans: LOAD AS SELECT (CURSOR DURATION MEMORY)

As I have outlined above, there are definite efficiency benefits to be gained from using this feature due to the reduction in I/O which should also improve overall query performance- although your mileage will vary depending on your particular environment! If you would like to share your experiences of using this new feature then please contact me via email (keith.laker@oracle.com).



Tuesday, 14 March 2017

Sneak preview of demo for Oracle Code events

I will be presenting at a number of the Oracle Code events over the coming months on the subject of…..(drum roll please) SQL pattern matching. Oracle Code is a great series of conferences dedicated to developers who want to get the absolute maximum benefit from using today's cutting edge technologies. If you want to register for any of the dates listed below then follow this link to the registration page.

North and Latin America

San Francisco
March 1, 2017

Austin
March 8, 2017

New York City
March 21, 2017

Washington DC
March 27, 2017

Toronto
April 18, 2017

Atlanta
June 22, 2017

Sao Paulo
,
June 27, 2017

Mexico City
June 29, 2017
Europe and Middle East

London
,
April 20, 2017

Berlin
,
April 24, 2017

Prague
,
April 28, 2017

Moscow
,
May 22, 2017

Brussels
,
June 6, 2017

Tel Aviv
,
July 11, 2017
Asia

New Delhi
,
May 10, 2017

Tokyo
,
May 18, 2017

Beijing
,
July 14, 2017

Sydney
,
July 18, 2017

Seoul
,
August 30, 2017

Bangalore
,
August 4, 2017

Back to my session...the actual topic of my session is: Simplified and fast fraud detection. The overall aim of this session is to demonstrate the key benefits of using SQL row pattern matching techniques compared to using other programming languages. As part of the presentation I will be using live demos to review a specific use related to fraud detection where I will walk through the MATCH_RECOGNIZE clause and explain the concepts and keywords. The demo will use a simple five step framework to construct pattern matching queries.

I will aim to show how easy it is to write and amend SQL-based pattern matching queries as requirements change and the slide deck includes a link to the video from OpenWorld 2015 where we showed a sophisticated fraud detection application that used pattern matching and spatial to create an analytical mash-up that processed a data stream in real-time.
If you would like a sneak preview of the presentation and the demo then follow these links:
If you are planning to attend any of the above events then please let me know as it would be great to meet up and talk about your experiences with analytic SQL and, especially, pattern matching. Email is the usual address keith.laker@oracle.com, or you can ping me on Twitter - @ASQLBarista.

Technorati Tags: , , , , ,

Tuesday, 7 March 2017

MATCH_RECOGNIZE: Can I use MATCH_NUMBER() as a filter?

Espresso Machine

Recently I spotted a post on OTN that asked the question: Can MATCH_RECOGNIZE skip out of partition? This requires a bit more detail because it raises all sorts of additional questions. Fortunately the post included more information which went something like this:
after a match is found I would like match_recognize to stop searching - I want at most one match per partition. I don’t want to filter by MATCH_NUMBER() in an outer query - that is too wasteful (or, in some cases, I may know in advance that there is at most one match per partition, and I don’t want match_recognize to waste time searching for more matches which I know don't exist).
Can MATCH_RECOGNIZE do this? Short answer is: NO.
Long answer is: Still NO.

Going back to the original question… you could interpret it as asking “is it possible to only return the first match”? The answer to this question is YES, it is possible.

There are a couple of different ways of doing it. Let’s use our good old “TICKER”  data set. The point of this exercise is to simply show that there are different ways to achieve the same result but in the end you need to look beyond what is passed back from MATCH_RECOGNIZE to understand what is going on as we process the rows from our TICKER table…

In simple terms, I only want my query to return the first match. Here is my starting query:
SELECT 
 symbol,
 tstamp,
 price,
 match_number,
 classifier,
 first_x,
 last_y 
FROM ticker
MATCH_RECOGNIZE (
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES
   FIRST(x.tstamp) AS first_x,
   LAST(y.tstamp) AS last_y,
   MATCH_NUMBER() AS match_number,
 CLASSIFIER() AS classifier
 ALL ROWS PER MATCH WITH UNMATCHED ROWS
 AFTER MATCH SKIP PAST LAST ROW
 PATTERN (strt X+ Y)
 DEFINE 
   X AS (price <= PREV(price)),
   Y AS (price >= PREV(price))
)
ORDER BY symbol, match_number, tstamp asc;
This is the output from the above query which shows that for each symbol we have multiple matches of our pattern:

Starting Query


Returning only the 1st match

If we want to return just the first match it is simply a matter of using applying a filter on MATCH_NUMBER() as follows:

SELECT 
 symbol,
 tstamp,
 price,
 match_number,
 classifier,
 first_x,
 last_y 
FROM ticker
MATCH_RECOGNIZE (
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES
   FIRST(x.tstamp) AS first_x,
   LAST(y.tstamp) AS last_y,
   MATCH_NUMBER() AS match_number,
   CLASSIFIER() AS classifier
 ALL ROWS PER MATCH WITH UNMATCHED ROWS
 AFTER MATCH SKIP PAST LAST ROW
 PATTERN (strt X+ Y)
 DEFINE 
  X AS (price <= PREV(price)),
  Y AS (price >= PREV(price))
)
WHERE match_number = 1
ORDER BY symbol, match_number, tstamp asc;
which returns the desired results:
Filter Query

BUT have we saved any processing? That is to say: did MATCH_RECOGNIZE stop searching for matches after the first match was found? NO! Checking the explain plan we can see that all 60 rows from our table where processed:
Explain Plan

Anyway the original post pointed out that simply filtering was not what they wanted so we can discount using MATCH_NUMBER within the WHERE clause. Although it does sort of achieve the result we wanted.

Let’s try an alternative approach. Can we limit the number of rows that are processed by using the exclude syntax within the PATTERN clause?

SELECT 
 symbol,
 tstamp,
 price,
 match_number,
 classifier,
 first_x,
 last_y 
FROM ticker
MATCH_RECOGNIZE (
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES
   FIRST(x.tstamp) AS first_x,
   LAST(y.tstamp) AS last_y,
   MATCH_NUMBER() AS match_number,
   CLASSIFIER() AS classifier
 ALL ROWS PER MATCH
 AFTER MATCH SKIP PAST LAST ROW
 PATTERN (strt X+ Y c*)
 DEFINE 
   X AS (price <= PREV(price)),
   Y AS (price >= PREV(price))
)
ORDER BY symbol, match_number, tstamp asc;
I have added another pattern variable “c” but made it always true by not providing a definition within the DEFINE clause.

Single Match for each symbol
This is getting close to what we might need because now we have only one match for each symbol. Therefore, if we now use the exclude syntax around the pattern variable c we should be able to remove all matches except the first!
SELECT 
 symbol,
 tstamp,
 price,
 match_number,
 classifier,
 first_x,
 last_y 
FROM ticker
MATCH_RECOGNIZE (
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES
   FIRST(x.tstamp) AS first_x,
   LAST(y.tstamp) AS last_y,
   MATCH_NUMBER() AS match_number,
   CLASSIFIER() AS classifier
 ALL ROWS PER MATCH
 AFTER MATCH SKIP PAST LAST ROW
 PATTERN (strt X+ Y {-c*-})
 DEFINE 
   X AS (price <= PREV(price)),
   Y AS (price >= PREV(price))
)
ORDER BY symbol, match_number, tstamp asc;

which does in fact return exactly the same result as our second query where we applied a filter on the column MATCH_NUMBER:

Filter Query

but if we check the explain plan we can see that yet again all 60 rows were processed.

Explain Plan

Therefore, we have got the right result but we have not been able to actually halt the MATCH_RECOGNIZE processing after the first match has been found.

Returning only the 2nd match

What if we wanted to return only the 2nd match? Well for this use case the exclude syntax is not going to work. The only viable solution in this situation would be to using the match_number column and apply a filter to find the required match. However, all rows from input table will be processed!

…and the final answer is: Enhancement Request

Let’s start with the simple answer to our original problem: after a match is found I would like match_recognize to stop searching
Alas, there is definitely no way to stop MATCH_RECOGNIZE processing all  the rows passed to it. To make this happen we would need to extend the AFTER MATCH SKIP TO syntax to include phrases that let us call a halt to the pattern matching process. What we need is something like “AFTER MATCH SKIP TO END”, however, this assumes that only the first match is important.

What if you wanted the first and the second or maybe it’s the second match that’s of most interest. What we really need then is something like the following: “AFTER MATCH ’N SKIP TO END  where ’N’ indicates the maximum number of matches that you want to process before jumping to the end of the partition.

Assuming I can find enough valid use cases I will put this on the “enhancement” list for MATCH_RECOGNIZE. If you some great use cases for this scenario then please send me the details (keith.laker@oracle.com).

Technorati Tags: , , ,

Thursday, 2 March 2017

It's out now - Database 12c Release 2 available for download

Database 12c Release 2 available for download 

Yes, it’s the moment the world has been waiting for: the latest generation of the world’s most popular database, Oracle Database 12c Release 2 (12.2) is now available everywhere - in the Cloud and on-premises. You can download this latest version from the database home page on OTN - click on the Downloads tab.

So What’s New in 12.2 for Data Warehousing?

This latest release provides some incredible new features for data warehouse and big data. If you attended last year’s OpenWorld event in San Francisco then you probably already know all about the new features that we have added to 12.2 - checkout my blog post from last year for a comprehensive review of #oow16:

Blog: The complete review of data warehousing and big data content from Oracle OpenWorld 2016

 If you missed OpenWorld and if you are a data warehouse architect, developer or DBA then here are the main feature highlights 12.2 with links to additional content from OpenWorld and my data warehouse blog: 

 

General Database Enhancements

1) Partitioning

Partitioning: External Tables
Partitioned external tables provides both the functionality to map partitioned Hive tables into the Oracle Database ecosystem as well as providing declarative partitioning on top of any Hadoop Distributed File System (HDFS) based data store.

Partitioning: Auto-List Partitioning
The database automatically creates a separate (new) partition for every distinct partition key value of the table.

Auto-list partitioning removes the management burden from the DBAs to manually maintain a list of partitioned tables for a large number of distinct key values that require individual partitions. It also automatically copes with the unplanned partition key values without the need of a DEFAULT partition.

Partitioning: Read-Only Partitions
Partitions and sub-partitions can be individually set to a read-only state. This then disables DML operations on these read-only partitions and sub-partitions. This is an extension to the existing read-only table functionality. Read-only partitions and subpartitions enable fine-grained control over DML activity. 

Partitioning: Multi-Column List Partitioning
List partitioning functionality is expanded to enable multiple partition key columns. Using multiple columns to define the partitioning criteria for list partitioned tables enables new classes of applications to benefit from partitioning.

For more information about partitioning see:

#OOW16 - Oracle Partitioning: Hidden Old Gems and Great New Tricks, by Hermann Baer, Senior Director Product Management

Partitioning home page on OTN

2) Parallel Execution

Parallel Query Services on RAC Read-Only Nodes
Oracle parallel query services on Oracle RAC read-only nodes represents a scalable parallel data processing architecture. The architecture allows for the distribution of a high number of processing engines dedicated to parallel execution of queries.

For more information about parallel execution see: 

#OOW16 - The Best Way to Tune Your Parallel Statements: Real-Time SQL Monitoring by Yasin Baskan, Senior Principal Product Manager

Parallel Execution home page on OTN

 

Schema Enhancements

Dimensional In-Database analysis with Analytic Views

Analytic views provide a business intelligence layer over a star schema, making it easy to extend the data set with hierarchies, levels, aggregate data, and calculated measures. Analytic views promote consistency across applications. By defining aggregation and calculation rules centrally in the database, the risk of inconsistent results in different reporting tools is reduced or eliminated.

The analytic view feature includes the new DDL statements, such as CREATE ATTRIBUTE DIMENSION, CREATE HIERARCHY and CREATE ANALYTIC VIEW, new calculated measure expression syntax, and new data dictionary views.These analytic views allow data warehouse and BI  developers to extend the star schema with time series and other calculations eliminating the need to define calculations within the application. Calculations can be defined in the analytic view and can be selected by including the measure name in the SQL select list.

For more information about Analytic Views see:

#OOW16 - Analytic Views: A New Type of Database View for Simple, Powerful Analytics by Bud Endress, Director, Product Management



SQL Enhancements

Cursor-Duration Temporary Tables Cached in Memory
Complex queries often process the same SQL fragment (query block) multiple times to answer a question. The results of these queries are stored internally, as cursor-duration temporary tables, to avoid the multiple processing of the same query fragment. With this new functionality, these temporary tables can reside completely in memory avoiding the need to write them to disk. Performance gains are the result of the reduction in I/O resource consumption.

Enhancing CAST Function With Error Handling
The existing CAST function is enhanced to return a user-specified value in the case of a conversion error instead of raising an error. This new functionality provides more robust and simplified code development.

New SQL and PL/SQL Function VALIDATE_CONVERSION
The new function, VALIDATE_CONVERSION, determines whether a given input value can be converted to the requested data type. The VALIDATE_CONVERSION function provides more robust and simplified code development.

Enhancing LISTAGG Functionality
LISTAGG aggregates the values of a column by concatenating them into a single string. New functionality is added for managing situations where the length of the concatenated string is too long. Developers can now control the process for managing overflowing LISTAGG aggregates. This increases the productivity and flexibility of this aggregation function.

Approximate Query Processing
This release extends the area of approximate query processing by adding approximate percentile aggregation. With this feature, the processing of large volumes of data is significantly faster than the exact aggregation. This is especially true for data sets that have a large number of distinct values with a negligible deviation from the exact result.

Approximate query aggregation is a common requirement in today's data analysis. It optimizes the processing time and resource consumption by orders of magnitude while providing almost exact results. Approximate query aggregation can be used to speed up existing processing.

Parallel Recursive WITH Enhancements
Oracle Database supports recursive queries through the use of a proprietary CONNECT BY clause, and an ANSI compliant resursive WITH clause. The parallel recursive WITH clause enables this type of query to run in parallel mode. These types of queries are typical with graph data found in social graphs, such as Twitter graphs or call records and commonly used in transportation networks (for example, for flight paths, roadways, and so on).

Recursive WITH ensures the efficient computation of the shortest path from a single source node to single or multiple destination nodes in a graph. Bi-directional searching is used to ensure the efficient computation of the shortest path from a single source node to single or multiple destination nodes in a graph. A bi-directional search starts from both source and destination nodes, and then advancing the search in both directions.

 

For more information about the new data warehouse SQL enhancements see:

 

In addition to the above features, we have made a lot of enhancements and added new features to the Optimizer and there is a comprehensive review by Nigel Bayliss, senior principle product manager, available on the optimizer blog. Obviously, the above is my take on what you need to know about for 12.2 and it’s not meant to be an exhaustive list of all the data warehouse and big data features. For the complete list of all the new features in 12.2 please refer to the New Features Guide in the database documentation set.

I would really like to thank my amazing development team for all their hard work on the above list of data warehouse features and the all the time they have spent proof-reading and fact-checking my blog posts on these new features. 

Enjoy using this great new release and checkout all the 12.2 tutorials and scripts on livesql!

 

 

Technorati Tags: , , , ,

Monday, 23 January 2017

How to intelligently aggregate approximations


Simple abacus with coloured beads

The growth of low-cost storage platforms has allowed many companies to actively seeking out new external data sets and combine them with internal historical data that goes back over a very long time frame. Therefore, as both the type of data and the volume of data continue to grow the challenge for many businesses is how to process this every expanding pool of data and at the same time, make timely decisions based on all the available data.
(Image above courtesy of http://taneszkozok.hu/)
In previous posts I have discussed whether an approximate answer is just plain wrong and whether approximate answers really are the best way to analyze big data. As with the vast majority of data analysis at some point there is going to be a need to aggregate a data set to get a higher level view across various dimensions. When working with results from approximate queries, dealing with aggregations can get a little complicated because it is not possible to “reuse” an aggregated approximate result as a basis for aggregating the results to an even higher level across the various dimensions of the original query. To obtain a valid approximate result set requires a query to rescan the source data and compute the required analysis for the given combination of levels. Just because I have a result set that contains a count of the number of unique products sold this week at the county level does not mean that I can simply reuse that result set to determine the number of distinct products sold this week at the state level. In many cases you cannot just rollup aggregations of aggregations..

Cannot aggregate data from existing approximate result sets

Until now….

With Database 12c Release 2 we have introduced a series of new functions to deal with this specific issue - the need to create reusable aggregated results that can be “rolled-up” to higher aggregate levels. So at long last you can now intelligently aggregate approximations! Here is how e do it….Essentially there are three parts that provide the “intelligence”:
  • APPROX_xxxxxx_DETAIL
  • APPROX_xxxxxx_AGG
  • TO_APPROX_xxxxxx
Here is a quick overview of each function:

APPROX_xxxxxx_DETAIL

This function takes a numeric expression and builds summary result set containing results for all dimensions in GROUP BY clause. The output from this function is a column containing BLOB data. As with other approximate functions the results can be deterministic or non-deterministic depending on your requirements.

APPROX_xxxxxx_AGG 

This function builds a higher level summary table based on results from the _DETAIL function. This means that it is not necessary to re-query base fact table in order to derive new aggregates results. As with the _DETAIL function, the results are returned as a blob.

TO_APPROX_xxxxxx

Returns results from the _AGG and _DETAIL functions in user readable format.

Three new functions for managing approximate aggregations

A worked example

Let’s build a working example using the sample SH schema. Our product marketing team wants to know within each year the approximate number of unique customers within each product category. Thinking ahead we know that once they have this result set we expect them to do further analysis such as drilling on the time and product dimension levels to get deeper insight. The best solution is to build a reusable aggregate approximate result set using the new functions in 12.2.
SELECT
 t.calendar_year,
 t.calendar_quarter_number AS qtr,
 p.prod_category_desc AS category_desc,
 p.prod_subcategory_desc AS subcategory_desc,
 APPROX_COUNT_DISTINCT_DETAIL(s.cust_id) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;

this returns my result set as a blob as shown below and this blob contains the various tuple combinations from my GROUP BY clause. As a result I can reuse this result set to answer new questions based around higher levels of aggregation.

Aggregated results from query contained within BLOB column

and the explain plan for this query shows the new sort keywords (GROUP BY APPROX) which tells us that approximate processing has been used as part of this query.
Explain plan showing new sort keywords

If we want to convert the BLOB data into a readable format we can transform it by using the TO_APPROX_xxx function as follows:

SELECT
t.calendar_year,
t.calendar_quarter_number AS qtr,
p.prod_category_desc AS category_desc,
p.prod_subcategory_desc AS subcategory_desc,
TO_APPROX_COUNT_DISTINCT(APPROX_COUNT_DISTINCT_DETAIL(s.cust_id)) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;

this creates the following results
shows use of TO_APPROX function to transform blob data
alternatively we coult create a table using the above query and then simply pass the BLOB column directly into the TO_APPROX function as follows:
CREATE TABLE agg_cd AS
SELECT

 t.calendar_year,
 t.calendar_quarter_number AS qtr,
 p.prod_category_desc AS category_desc,
 p.prod_subcategory_desc AS subcategory_desc,
 APPROX_COUNT_DISTINCT_DETAIL(s.cust_id) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;
using this table we can simplify our query to return the approximate number of distinct customers directly from the above table:
SELECT
 calendar_year,
 qtr,
 category_desc,
 subcategory_desc,
 TO_APPROX_COUNT_DISTINCT(acd_agg)
FROM agg_cd
ORDER BY calendar_year, qtr, category_desc, subcategory_desc;
which returns the same results as before - as you would expect!

using TO_APPROX function directly against BLOB colum

using the aggregated table as our source we can now change the levels that we wish to calculate without having to go back to the original source table and again scan all the rows. However, to extract the new aggregations we need to introduce the third function APPROX_COUNT_DISTINCT_AGG to our query and wrap this within the TO_APPROX_COUNT_DISTINCT function to see the results:
SELECT
 calendar_year,
 subcategory_desc,
 TO_APPROX_COUNT_DISTINCT(APPROX_COUNT_DISTINCT_AGG(acd_agg))
FROM agg_cd
GROUP BY calendar_year, subcategory_desc
ORDER BY calendar_year, subcategory_desc;
will return the following results based only on the new combination of levels included in the GROUP BY clause:

Using higher level aggregations


Summary

This post has reviewed the three new functions that we have introduced in Database 12c Release 2 that allow you to reuse aggregated approximate result sets:
  • APPROX_xxxxxx_DETAIL
  • APPROX_xxxxxx_AGG
  • TO_APPROX_xxxxxx
 Database 12c Release 2 makes it possible to intelligently aggregate approximations. In the next post I will explore how you can combine approximate processing with existing query rewrite functionality so you can have intelligent approximate query rewrite.

Technorati Tags: , , , ,

Wednesday, 18 January 2017

Dealing with very very long string lists using Database 12.2


Lots Of Pieces String

Oracle RDBMS 11gR2 introduced the LISTAGG function for working with string values. It can be used to aggregate values from groups of rows and return a concatenated string where the values are typically separated by a comma or semi-colon - you can determine this yourself within the code by supplying your own separator symbol.

Based on the number of posts across various forums and blogs, it is widely used by developers. However, there is one key issue that has been highlighted by many people: when using LISTAGG on data sets that contain very large strings it is possible to create a list that is too long. This causes the following overflow error to be generated:
ORA-01489: result of string concatenation is too long.
Rather annoyingly for developers and DBAs, it is very difficult to determine ahead of time if the concatenation of the values within the specified LISTAGG measure_expr will cause an ORA-01489 error. Many people have posted workarounds to resolve this problem - including myself. Probably the most elegant and simple solution has been to use the 12c MATCH_RECOGNIZE feature, however, this required use of 12c Release 1 which was not always available to all DBAs and/or developers.

If you want to replicate the problem and you have access to the sample SH schema then try executing this query:
SELECT
g.country_region,
LISTAGG(c.cust_first_name||' '||c.cust_last_name, ',') WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;
All the samples in this post use our sample SH schema. Once we release the on-premise version of 12.2 you will be able to download the Examples file for your platform from the database home page on OTN. I have a published a tutorial on LiveSQL so you can play with all the new keywords covered in this post.

What have we changed in 12.2?

One way of resolving ORA-01489 errors is to simply increase the size of VARCHAR2 objects.

Larger object sizes

The size limit for VARCHAR2 objects is determined by the database parameter MAX_STRING_SIZE. You can check the setting in your database using the following command:
show parameter MAX_STRING_SIZE
in my demo environment this returns the following:
NAME            TYPE   VALUE
--------------- ------ --------
max_string_size string STANDARD
Prior to Oracle RDBMS 12.1.0.2 the upper limit for VARCHAR2 was 4K. With Oracle RDBMS 12.1.0.2 this limit has been raised to 32K. This increase may solve a lot of issues but it does require a change to the database parameter MAX_STRING_SIZE. By setting MAX_STRING_SIZE = EXTENDED this enables the new 32767 byte limit.
ALTER SYSTEM SET max_string_size=extended SCOPE= SPFILE;
However, with the increasing interest in big data sources it is clear that there is still considerable potential for ORA-01489 errors as you use the LISTAGG feature within queries against extremely large data sets.

What is needed is a richer syntax within the LISTAGG function and this has now been implemented as part of Database 12c Release 2.

Better list management

With 12.2 we have made it easier to manage to lists that are likely to generate an error because they are too long. There are a whole series of new keywords that can be used:
  • ON OVERFLOW ERROR
  • ON OVERFLOW TRUNCATE
  • WITH COUNT vs. WITHOUT COUNT
Let’s look a little closer at each of these features…..


1. Keeping Pre-12.2 functionality
If you want to your existing code to continue to return an error if the string is too long then the great news is that this is the default behaviour. When the length of the LISTAGG string exceeds the VARCHAR2 limit then the standard error will be returned:
ERROR at line xxx:
ORA-01489: result of string concatenation is too long

However, where possible I would recommend adding “ON OVERFLOW ERROR” to your LISTAGG code to make it completely clear that you are expecting an error when an overflow happens:
SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW ERROR) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;
So it’s important to note that by default the truncation features are disabled and you will need to change any existing code if you don’t want to raised an error.


2. New ON OVERFLOW TRUNCATE… keywords
If you want to truncate the list of values at the 4K or 32K boundary then you need to use the newly added keywords ON OVERFLOW TRUNCATE as shown here:
SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW TRUNCATE) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

when truncation occurs we will truncate back to the next full value at which point you can control how you tell the user that the list has been truncated. By default we append three dots ‘…’ to the string as indicator that truncation has occurred but you can override this as follows:
SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW TRUNCATE ‘***) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

If you want to keep the existing pre-12.2 behaviour where we return an error if the string is too long then you can either rely on the default behaviour or explicitly state that an error should be returned (always a good idea to avoid relying on default behaviour in my opinion) by using the keywords:
SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW ERROR) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;
which will now generate the normal error message - i.e. replicates the pre-12.2 behaviour:
ORA-01489: result of string concatenation is too long
01489. 00000 - "result of string concatenation is too long"
*Cause: String concatenation result is more than the maximum size.
*Action: Make sure that the result is less than the maximum size.
 of course you can simply omit the new keywords and get the same behaviour:
SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,') WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;
which, as before, generates the normal error message - i.e. replicates the pre-12.2 behaviour:
ORA-01489: result of string concatenation is too long
01489. 00000 - "result of string concatenation is too long"
*Cause: String concatenation result is more than the maximum size.
*Action: Make sure that the result is less than the maximum size.

3. How many values are missing?
If you need to know how many values were removed from the list to make it fit into the available space then you can use the keywords ‘WITH COUNT’ - this is the default behaviour. Alternatively if you don’t want a count at the end of the truncated string you can use the keywords ‘WITHOUT COUNT’, which is the default behaviour.
SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW TRUNCATE ‘***’ WITH COUNT) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

4. Do we split values when truncation occurs?
No. When determining where to force the truncation we take into account the full length of each value. Therefore, if you consider the example that we have been using which creates a list of customer names within each country we will always include the customer full name  “Keith Laker” (i.e. first name + last name). There has to be enough space to add the complete string (first+last name) to the list otherwise the whole string, “Keith Laker” is removed and the truncation indicator is inserted. It’s not possible for the last value in the string to be only the first name where the last name has been truncated/removed.

5. How do we calculate the overall length of the string values?
The characters to indicate that an overflow has occurred are appended at the end of the list of values, which in this case if the default value of three dots “. . .”.  The overflow functionality traverses backwards from the maximum possible length to the end of the last complete value in the LISTAGG clause, then it adds the user-defined separator followed by the user defined overflow indicator, followed by output from the ’WITH COUNT’ clause which adds a counter at the end of a truncated string to indicate the number of values that have been removed/truncated from the list.

Summary
With Database 12c Release 2 we have tackled the ORA-01489 error in two ways: 1) increased the the size of VARCHAR2 objects to 32K and 2) extended functionality of LISTAGG to allow greater control over the management of extremely long lists. Specifically there are several new keywords:
  • ON OVERFLOW TRUNCATE
  • ON OVERFLOW ERROR (default behaviour)
  • WITH COUNT
  • WITHOUT COUNT  (default behaviour)
Hopefully this new functionality will mean that all those wonderful workarounds for dealing with “ORA-01489: result of string concatenation is too long“ errors that have been created over the years can now be replaced by standard SQL functionality. Don't forget to try the tutorial on livesql - it's a completely free service, all you need to do is register for an account.

Technorati Tags: , , ,

Monday, 16 January 2017

MATCH_RECOGNIZE - What should I include in the MEASURES clause?

Image courtesy of wikipedia

This post is the result of reviewing a post on stackoverflow.com: http://stackoverflow.com/questions/41649178/getting-error-ora-00918-when-using-match-recognize. Here is my version of the code which includes the same issues/errors as the original, however, I am using the TICKER schema table that I always use for the tutorials that I post on liveSQL :

SELECT symbol, tstamp, price 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY symbol, tstamp
MEASURES
a.symbol AS a_symbol,
a.tstamp AS a_date,
a.price AS a_price
ALL ROWS PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
);

The above example will not run because of the following error:

ORA-00918: column ambiguously defined
00918. 00000 - "column ambiguously defined"
*Cause:
*Action:
Error at Line: 1 Column: 8

So what is wrong with our code? As MATHGUY pointed out in his reply on stackoverflow.com - quite a lot actually! Let’s start by differentiating between “won’t run” and “wrong”. The ORA-918 error is easy to resolve if you stare long enough at the ORDER BY clause! It’s pointless to include the SYMBOL column as the partition by key and the order by key. If we change the ORDER BY clause as shown here then the code will run:

SELECT symbol, tstamp, price 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
MEASURES
a.symbol AS a_symbol,
a.tstamp AS a_date,
a.price AS a_price
ALL ROWS PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

which returns the following resultset (all 60 rows from our source ticker table):

All rows from the ticker table

 

No MEASURE clause

Okay so our code is running now what? If you look at the output you will notice that it contains the same rows and columns as the source table. What happens if we omit the MEASURES clause? Well it’s optional so the code should still run…

SELECT symbol, tstamp, price 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
ALL ROWS PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

and sure enough we get the same resultset (all 60 rows from our source ticker table):

All rows from the ticker table

 

ALL ROWS PER MATCH vs. ONE ROW

from the above we can assume that there is no need to list the source columns from your input table in the MEASURE clause because they are automatically included in the output. BUT this is ONLY true when you use ALL ROWS PER MATCH. If we change the output control to ONE ROW PER MATCH:

SELECT symbol, tstamp, price
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
ONE ROW PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

you will now get an error:

ORA-00904: "PRICE": invalid identifier

00904. 00000 - "%s: invalid identifier"
*Cause:
*Action:
Error at Line: 19 Column: 24

 

because when using ONE ROW PER MATCH the only columns that are automatically returned are those listed in the PARTITION BY clause. Therefore, we need to use either “SELECT * FROM …..” or “SELECT symbol FROM…” and list all the columns we want to be returned to get a working version of our code. Using “SELECT * FROM…” as follows:

SELECT * 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
ONE ROW PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

 

actually returns only one column (symbol) from the ticker table:

 

Only partition by column returned

So what should we include in the MEASURE clause?

Based on the query that was in the original post I think the following syntax would make it easier to understand what is happening within the pattern matching process and provide useful information about the data that matches the pattern:

SELECT symbol, tstamp, price, first_date, last_date, first_price, last_price, m_n, classi 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
MEASURES
FIRST(b.tstamp) AS first_date,
LAST(b.tstamp) AS last_date,
FIRST(b.price) AS first_price,
LAST(b.price) AS last_price,
match_number() AS m_n,
classifier() AS classi
ALL ROWS PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
); 

which gives us the following output based on ALL ROWS PER MATCH:

Output from Amended Measure Clause

 

and if we want to switch to using ONE ROW PER MATCH then we need to remove references to the columns tstamp and price and replace them with references to the pattern variable specific versions, or we can just remove the references all together. In this case as we only have two pattern variables we can NVL the references to return the required data:

SELECT symbol, o_tstamp, o_price, first_date, last_date, first_price, last_price, m_n, classi 
FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
MEASURES
nvl(a.tstamp, b.tstamp) as o_tstamp,
nvl(a.price, b.price) as o_price,
FIRST(b.tstamp) as first_date,
LAST(b.tstamp) as last_date,
FIRST(b.price) as first_price,
LAST(b.price) as last_price,
match_number() as m_n,
classifier() as classi
ONE ROW PER MATCH
PATTERN(A B*)
DEFINE
B AS (price < PREV(price))
);

 

the above query generates slightly fewer rows compared with the previous statement because we are only returning one for each match. However, note that this time we are referencing specific instances of tstamp and price.

New Measures For TSTAMP and PRICE

 

Summary

What have we learned:

Point 1: Check your PARTITION BY and ORDER BY clauses to ensure they make sense!

Point 2: there is no need to list the source columns from your input table in the MEASURE clause because they are automatically included BUT ONLY when you use ALL ROWS PER MATCH.

Point 3: Decide on your output method and match the columns listed in the SELECT clause with those returned by either ALL ROWS PER MATCH or ONE ROW PER MATCH.

Point 4: Always a good idea to check your pattern is being applied correctly by using the built-in  MATCH_NUMBER and CLASSIFIER() measures.

 

For more general information about how to get started with MATCH_RECOGNIZE follow these links to previous blog posts:

and checkout the growing library of tutorials on liveSQL.oracle.com. Hope this helps to throw some light on the workings of the MEASURE clause within MATCH_RECOGNIZE.

Technorati Tags: , , , ,