How to intelligently aggregate approximations
(Image above courtesy of http://taneszkozok.hu/)In previous posts I have discussed whether an approximate answer is just plain wrong and whether approximate answers really are the best way to analyze big data. As with the vast majority of data analysis at some point there is going to be a need to aggregate a data set to get a higher level view across various dimensions. When working with results from approximate queries, dealing with aggregations can get a little complicated because it is not possible to “reuse” an aggregated approximate result as a basis for aggregating the results to an even higher level across the various dimensions of the original query. To obtain a valid approximate result set requires a query to rescan the source data and compute the required analysis for the given combination of levels. Just because I have a result set that contains a count of the number of unique products sold this week at the county level does not mean that I can simply reuse that result set to determine the number of distinct products sold this week at the state level. In many cases you cannot just rollup aggregations of aggregations..
Until now….
With Database 12c Release 2 we have introduced a series of new functions to deal with this specific issue - the need to create reusable aggregated results that can be “rolled-up” to higher aggregate levels. So at long last you can now intelligently aggregate approximations! Here is how e do it….Essentially there are three parts that provide the “intelligence”:- APPROX_xxxxxx_DETAIL
- APPROX_xxxxxx_AGG
- TO_APPROX_xxxxxx
APPROX_xxxxxx_DETAIL
This function takes a numeric expression and builds summary result set containing results for all dimensions in GROUP BY clause. The output from this function is a column containing BLOB data. As with other approximate functions the results can be deterministic or non-deterministic depending on your requirements.
APPROX_xxxxxx_AGG
This function builds a higher level summary table based on results from the _DETAIL function. This means that it is not necessary to re-query base fact table in order to derive new aggregates results. As with the _DETAIL function, the results are returned as a blob.
TO_APPROX_xxxxxx
Returns results from the _AGG and _DETAIL functions in user readable format.
A worked example
Let’s build a working example using the sample SH schema. Our product marketing team wants to know within each year the approximate number of unique customers within each product category. Thinking ahead we know that once they have this result set we expect them to do further analysis such as drilling on the time and product dimension levels to get deeper insight. The best solution is to build a reusable aggregate approximate result set using the new functions in 12.2.SELECT
t.calendar_year,
t.calendar_quarter_number AS qtr,
p.prod_category_desc AS category_desc,
p.prod_subcategory_desc AS subcategory_desc,
APPROX_COUNT_DISTINCT_DETAIL(s.cust_id) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;
this returns my result set as a blob as shown below and this blob contains the various tuple combinations from my GROUP BY clause. As a result I can reuse this result set to answer new questions based around higher levels of aggregation.
and the explain plan for this query shows the new sort keywords (GROUP BY APPROX) which tells us that approximate processing has been used as part of this query.
If we want to convert the BLOB data into a readable format we can transform it by using the TO_APPROX_xxx function as follows:
SELECT
t.calendar_year,
t.calendar_quarter_number AS qtr,
p.prod_category_desc AS category_desc,
p.prod_subcategory_desc AS subcategory_desc,
TO_APPROX_COUNT_DISTINCT(APPROX_COUNT_DISTINCT_DETAIL(s.cust_id)) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;
this creates the following results
alternatively we coult create a table using the above query and then simply pass the BLOB column directly into the TO_APPROX function as follows:
CREATE TABLE agg_cd ASusing this table we can simplify our query to return the approximate number of distinct customers directly from the above table:
SELECT
t.calendar_year,
t.calendar_quarter_number AS qtr,
p.prod_category_desc AS category_desc,
p.prod_subcategory_desc AS subcategory_desc,
APPROX_COUNT_DISTINCT_DETAIL(s.cust_id) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;
SELECTwhich returns the same results as before - as you would expect!
calendar_year,
qtr,
category_desc,
subcategory_desc,
TO_APPROX_COUNT_DISTINCT(acd_agg)
FROM agg_cd
ORDER BY calendar_year, qtr, category_desc, subcategory_desc;
using the aggregated table as our source we can now change the levels that we wish to calculate without having to go back to the original source table and again scan all the rows. However, to extract the new aggregations we need to introduce the third function APPROX_COUNT_DISTINCT_AGG to our query and wrap this within the TO_APPROX_COUNT_DISTINCT function to see the results:
SELECTwill return the following results based only on the new combination of levels included in the GROUP BY clause:
calendar_year,
subcategory_desc,
TO_APPROX_COUNT_DISTINCT(APPROX_COUNT_DISTINCT_AGG(acd_agg))
FROM agg_cd
GROUP BY calendar_year, subcategory_desc
ORDER BY calendar_year, subcategory_desc;
Summary
This post has reviewed the three new functions that we have introduced in Database 12c Release 2 that allow you to reuse aggregated approximate result sets:- APPROX_xxxxxx_DETAIL
- APPROX_xxxxxx_AGG
- TO_APPROX_xxxxxx
Technorati Tags: Analytics, Database 12c, Oracle Database 12c, SQL, SQL Analytics
Comments
Post a Comment