How to create more sophisticated reports with SQL….

SQL Pivot Cube

This is a continuation of a series of posts that cover Oracle's SQL extensions for reporting and analysis. As reviewed in earlier blog posts (Part 1 and Part 2), We have extended SQL's analytical processing capabilities by introducing a family of aggregate and analytic SQL functions. Over the last couple of days I have been exploring how to use some of these SQL features to create tabular-style reports/views. In the past when I worked as a BI Beans product manager I would typically build these types of reports using OLAP cubes and/or Java code. It is been very interesting to work through some of my old report scenarios and transfer my Java aggregation processing back into the database by using SQL to do all the heavy lifting without having to resort to low-level coding.

In my quest to learn about SQL Analytics I started with the idea of creating a simple two dimension cross tabular report as shown below (I am using the SH sample schema if you want to take my code and run it yourself). A typical requirement would be to report total category sales by quarter and compute totals for the full-year (in this case 2001) across all product categories (all the SQL output is shown in scrollable boxes so use your mouse to view the full results):

 

QTR        'Software/Other' 'Hardware' 'Electronics' 'Photo'   'Peripherals and Accessories' 'Cat_Total'
---------- ---------------- ---------- ------------- ---------- ----------------------------- ----------
2001-01           860819.81 1301343.45    1239287.71 1370706.38                    1774940.09 6547097.44
2001-02           872157.38 1557059.59    1144187.90 1563475.51                    1785588.01 6922468.39
2001-03           877630.85 1651454.29    1017536.82 1607315.63                    2042061.04 7195998.63
2001-04           943296.36 1174512.68    1303838.52 1792131.39                    2257118.57 7470897.52
2001-Total        3553904.4 5684370.01    4704850.95 6333628.91                   7859707.71 28136461.98

 

This is typical of the type of the output I used to create to display in reports and dashboards. Of course the formatting could be better but then I would normally sort that out on the client side of the application. Which leaves the question of how to build this type of report? When developing a SQL statement I always try and build it up starting from a simple first stage of just getting the correct dataset. In this case my initial SQL statement was as follows:

SELECT 
t.calendar_quarter_desc,
p.prod_category,
sum(s.amount_sold) as amount_sold
FROM times t,
products p,
sales s
WHERE t.time_id = s.time_id
AND p.prod_id = s.prod_id
AND s.time_id BETWEEN TO_DATE('01-JAN-2001','dd-MON-yyyy')
AND TO_DATE('31-DEC-2001','dd-MON-yyyy')
GROUP BY calendar_quarter_desc, prod_category
ORDER BY calendar_quarter_desc, prod_category

which generated the following output

CALENDAR_QUARTER_DESC PROD_CATEGORY                                      AMOUNT_SOLD
--------------------- -------------------------------------------------- -----------
2001-01               Electronics                                         1239287.71
2001-01               Hardware                                            1301343.45
2001-01               Peripherals and Accessories                         1774940.09
2001-01               Photo                                               1370706.38
2001-01               Software/Other                                       860819.81
2001-02               Electronics                                         1144187.90
2001-02               Hardware                                            1557059.59
2001-02               Peripherals and Accessories                         1785588.01
2001-02               Photo                                               1563475.51
2001-02               Software/Other                                      872157.38
2001-03               Electronics                                        1017536.82
2001-03               Hardware                                           1651454.29
2001-03               Peripherals and Accessories                        2042061.04
2001-03               Photo                                              1607315.63
2001-03               Software/Other                                      877630.85
2001-04               Electronics                                        1303838.52
2001-04               Hardware                                           1174512.68
2001-04               Peripherals and Accessories                        2257118.57
2001-04               Photo                                              1792131.39
2001-04               Software/Other                                      943296.36

20 rows selected

 

this is my starting point. My next step is to just pivot the data so that I have the time periods in the first column and then separate columns showing total revenue(amount sold) for each product category showing total revenue (amount_sold). To create the new columns of data containing the revenue for each product category I used the PIVOT clause, as highlighted in bold below:

SELECT * FROM
(SELECT
t.calendar_quarter_desc,
p.prod_category,
s.amount_sold
FROM times t,
products p,
sales s
WHERE t.time_id = s.time_id
AND p.prod_id = s.prod_id
AND s.time_id BETWEEN TO_DATE('01-JAN-2001','dd-MON-yyyy')
AND TO_DATE('31-DEC-2001','dd-MON-yyyy')
ORDER BY calendar_quarter_desc)
PIVOT(sum(amount_sold)
FOR prod_category
IN ('Software/Other', 'Hardware', 'Electronics', 'Photo',
'Peripherals and Accessories'))
ORDER BY calendar_quarter_desc;

Notice the first 'SELECT * FROM' statement. This approach/framework is the way we implement some of the SQL analytical features, such as the 12c MATCH_RECOGNIZE clause. This first SQL statement allows us to place another layer of analysis, in this case the pivot operation, over our result set that is derived from the main SQL statement. The above SQL generated the following output:

 

calendar_quarter_desc 'Software/Other' 'Hardware' 'Electronics' 'Photo' 'Peripherals and Accessories'
--------------------- ---------------- ---------- ------------- ---------- -----------------------------
2001-01                      860819.81 1301343.45 1239287.71    1370706.38                    1774940.09
2001-02                      872157.38 1557059.59 1144187.90    1563475.51                    1785588.01
2001-03                      877630.85 1651454.29 1017536.82    1607315.63                    2042061.04
2001-04                      943296.36 1174512.68 1303838.52    1792131.39                    2257118.57

 

You will notice that the PIVOT clause has three components: there is the PIVOT key word which is the main container clause. This is then followed by the aggregation operation - in this case sum(amount_sold) - then the FOR element identifies the dimension or column that is going to be pivoted. Finally the IN element lists the dimension members that will form the individual columns. The way the IN clause is structured does seem to impose an unnecessary restriction in that you have to list the actual dimension members. It is not possible to have a dynamic list of members derived from say a SELECT DISTINCT… FROM statement. If you can overlook this limitation then the PIVOT clause is very powerful.

Now we have our basic cross tabular output, it would be nice if we could add a subtotal for the quarters which would give us the annual sales revenue for each product category. Fortunately, because we are using standard Oracle SQL syntax we can easily introduce other SQL analytic operations into our statement. Using the CUBE() feature we can create annual subtotals as follows:

SELECT * FROM
(SELECT
t.calendar_quarter_desc,
p.prod_category,
sum(s.amount_sold) as amount_sold
FROM times t,
products p,
sales s
WHERE t.time_id = s.time_id
AND p.prod_id = s.prod_id
AND s.time_id BETWEEN TO_DATE('01-JAN-2001','dd-MON-yyyy')
AND TO_DATE('31-DEC-2001','dd-MON-yyyy')
GROUP BY CUBE(calendar_quarter_desc), prod_category
ORDER BY calendar_quarter_desc)
PIVOT(sum(amount_sold)
FOR prod_category
IN ('Software/Other', 'Hardware', 'Electronics',
'Photo', 'Peripherals and Accessories'))
ORDER BY calendar_quarter_desc;

this now generates our subtotal, or annual total, for all quarters:

CALENDAR_QUARTER_DESC 'Software/Other' 'Hardware' 'Electronics' 'Photo' 'Peripherals and Accessories'
--------------------- ---------------- ---------- ------------- ---------- -----------------------------
2001-01                      860819.81 1301343.45    1239287.71 1370706.38                    1774940.09
2001-02                      872157.38 1557059.59    1144187.90 1563475.51                    1785588.01
2001-03                      877630.85 1651454.29    1017536.82 1607315.63                    2042061.04
2001-04                      943296.36 1174512.68    1303838.52 1792131.39                    2257118.57
                            3553904.40 5684370.01    4704850.95 6333628.91                    7859707.71
 

 

and you will notice that the last line has a null value for the quarter description. It would be nice if we could add a label to describe this row, something like '2001-Total'. The CUBE command does allows us to track which rows are data rows and which rows have been calculated and this is achieved using the GROUPING statement. We can wrap a DECODE statement around the t.calendar_quarter_desc line and test each row to see if it is a data point (where the GROUPING function will return "0") or a total (where the GROUPING function will return "1") as follows:

SELECT * FROM
(SELECT
DECODE(GROUPING(t.calendar_quarter_desc)
, 0, t.calendar_quarter_desc
, 1, '2001-Total') as calendar_quarter_desc,

p.prod_category,
sum(s.amount_sold) as amount_sold
FROM times t,
products p,
sales s
WHERE t.time_id = s.time_id
AND p.prod_id = s.prod_id
AND s.time_id BETWEEN TO_DATE('01-JAN-2001','dd-MON-yyyy')
AND TO_DATE('31-DEC-2001','dd-MON-yyyy')
GROUP BY CUBE(calendar_quarter_desc), prod_category
ORDER BY calendar_quarter_desc)
PIVOT(sum(amount_sold)
FOR prod_category
IN ('Software/Other', 'Hardware', 'Electronics',
'Photo', 'Peripherals and Accessories'))
ORDER BY calendar_quarter_desc;

 

now our null total line shows the value "2001-Total":

CALENDAR_QUARTER_DESC 'Software/Other' 'Hardware' 'Electronics' 'Photo' 'Peripherals and Accessories'
--------------------- ---------------- ---------- ------------- ---------- -----------------------------
2001-01                      860819.81 1301343.45 1239287.71    1370706.38                    1774940.09
2001-02                      872157.38 1557059.59 1144187.90    1563475.51                    1785588.01
2001-03                      877630.85 1651454.29 1017536.82    1607315.63                    2042061.04
2001-04                      943296.36 1174512.68 1303838.52    1792131.39                    2257118.57
2001-Total                  3553904.40 5684370.01 4704850.95    6333628.91                    7859707.71

 

As the CUBE function makes it easy to add totals as part of our report we can repeat this process for the product category dimension so our CUBE clause looks like this,:

GROUP BY CUBE(calendar_quarter_desc, prod_category)

this will now add column totals to our report and a grand total in the bottom right corner of our grid of data. To create the cross tabular report totals for both quarters and product categories we simply extend the CUBE statement to include the product_category and tidy up some of the column labels:

SELECT * FROM
(SELECT
DECODE(GROUPING(t.calendar_quarter_desc)
, 0, t.calendar_quarter_desc
, 1, '2001-Total') as "Time",
DECODE(GROUPING(p.prod_category)
, 0, p.prod_category
, 1, 'Cat_Total') AS prod_category,

sum(s.amount_sold) as amount_sold
FROM times t,
products p,
sales s
WHERE t.time_id = s.time_id
AND p.prod_id = s.prod_id
AND s.time_id BETWEEN TO_DATE('01-JAN-2001','dd-MON-yyyy')
AND TO_DATE('31-DEC-2001','dd-MON-yyyy')
GROUP BY CUBE(calendar_quarter_desc, prod_category)
ORDER BY calendar_quarter_desc)
PIVOT(sum(amount_sold)
FOR prod_category
IN ('Software/Other', 'Hardware', 'Electronics',
'Photo', 'Peripherals and Accessories', 'Cat_Total'))
ORDER BY "Time";

The key highlights here are the second DECODE() function which evaluates the prod_category data to determine if the row is a data point or a total line, we have extended the CUBE() function to include the prod_category which will generate the relevant subtotals and finally we have extended the list of categories included in the IN part of the PIVOT clause to pickup our report total, 'Cat_Total'. Now the output looks like this:

Time       'Software/Other' 'Hardware' 'Electronics' 'Photo'    'Peripherals and Accessories''Cat_Total'
---------- ---------------- ---------- ------------- ---------- ----------------------------- ----------
2001-01           860819.81 1301343.45    1239287.71 1370706.38                    1774940.09 6547097.44
2001-02           872157.38 1557059.59    1144187.90 1563475.51                    1785588.01 6922468.39
2001-03           877630.85 1651454.29    1017536.82 1607315.63                    2042061.04 7195998.63
2001-04           943296.36 1174512.68    1303838.52 1792131.39                    2257118.57 7470897.52
2001-Total       3553904.40 5684370.01    4704850.95 6333628.91                   7859707.71 28136461.98

 

As I started to expand this little code sample to include more data columns I ran into a small problem….Let's say we want to know how many orders were placed as well as the amount of revenue for each product category. To do that we need to add an additional column of data which means that we need to expand our PIVOT clause. In the code below I have added a COUNT() function to count the number of transactions (and I am making the assumption that each row in the fact table equates to a single transaction) and then included this column in the PIVOT clause:

 

SELECT * FROM
(SELECT
DECODE(GROUPING(t.calendar_quarter_desc)
, 0, t.calendar_quarter_desc
, 1, '2001-Total') as "Time",
DECODE(GROUPING(p.prod_category)
, 0, p.prod_category
, 1, 'Cat_Total') AS prod_category,
sum(s.amount_sold) as amount_sold,
count(s.amount_sold) as no_of_trans
FROM times t,
products p,
sales s
WHERE t.time_id = s.time_id
AND p.prod_id = s.prod_id
AND s.time_id BETWEEN TO_DATE('01-JAN-2001','dd-MON-yyyy')
AND TO_DATE('31-DEC-2001','dd-MON-yyyy')
GROUP BY CUBE(calendar_quarter_desc, prod_category)
ORDER BY calendar_quarter_desc)
PIVOT(sum(amount_sold) as Revenue,
sum(no_of_trans) as Transactions
FOR prod_category
IN ('Software/Other', 'Hardware', 'Electronics',
'Photo', 'Peripherals and Accessories', 'Cat_Total'))
ORDER BY "Time";

Notice that I have added a column alias for each calculated measure in the PIVOT clause as this is best practice and is recommended in the SQL Reference manual, see page 1635) states that "You can optionally provide an alias for each pivot column valueIf there is no alias, the column heading becomes a quoted identifier". See this extract from the 12c documentation: http://docs.oracle.com/cd/E16655_01/server.121/e17209/statements_10002.htm#SQLRF55233

 

However, even adding the column aliases we still generate the following error message:

“ORA-00918 Column ambiguously defined”

the above code is actually correct in terms of its syntax. As we followed the recommended guidance in the documentation does this code generate an ORA-00918 error? 

The reason why the above error message is generated is because we have exceeded the maximum length of column name. During the PIVOT operation we have to rename the columns so that the first pivoted column becomes our first product category,"Software/Other", + the name of our first data column, which is "Revenue".  The second column in our pivot table  becomes "Software/Other" + the name of our second data column, which is "Transactions". This means that the column names can get very long and in fact they get so long that when we get to the last product category in our list , 'Peripherals and Accessories', the automatic naming process generates a column name that is just too long and we get an error message - "ORA-00918  Column ambiguously defined”. This is a slightly misleading because the actual problem is that the column name, which is automatically generated, is too long.

If we remove the 'Peripherals and Accessories' product category from the in-list that is part of the PIVOT clause then the code will work: 

SELECT * FROM
(SELECT
DECODE(GROUPING(t.calendar_quarter_desc)
, 0, t.calendar_quarter_desc
, 1, '2001-Total') as "Time",
DECODE(GROUPING(p.prod_category)
, 0, p.prod_category
, 1, 'Cat_Total') AS prod_category,
sum(s.amount_sold) as amount_sold,
count(s.amount_sold) as no_of_trans
FROM times t,
products p,
sales s
WHERE t.time_id = s.time_id
AND p.prod_id = s.prod_id
AND s.time_id BETWEEN TO_DATE('01-JAN-2001','dd-MON-yyyy')
AND TO_DATE('31-DEC-2001','dd-MON-yyyy')
GROUP BY CUBE(calendar_quarter_desc, prod_category)
ORDER BY calendar_quarter_desc)
PIVOT(sum(amount_sold) as Revenue,
sum(no_of_trans) as Transactions
FOR prod_category
IN ('Software/Other', 'Hardware', 'Electronics', 'Photo', 'Cat_Total'))
ORDER BY "Time";

and this will now generate the following output:

 

Time 'Software/Other'_REVENUE 'Software/Other'_TRANSACTIONS 'Hardware'_REVENUE 'Hardware'_TRANSACTIONS 'Cat_Total'_REVENUE 'Cat_Total'_TRANSACTIONS
---------- ------------------------ ----------------------------- ------------------ ----------------------- ------------------- ------------------------
2001-01 860819.81 24985 1301343.45 1029 6547097.44 60608
2001-02 872157.38 25469 1557059.59 1281 6922468.39 63292
2001-03 877630.85 25904 1651454.29 1379 7195998.63 65769
2001-04 943296.36 26217 1174512.68 1057 7470897.52 69749
2001-Total 3553904.40 102575 5684370.01 4746 28136461.98 259418

 

 

of course this is not really a workable solution since we have lost a column of data! What is missing from the code (and from the documentation - which will be fixed as soon as possible) is the inclusion of aliases for each of our product category names in the IN() clause. Therefore, if we amend the IN clause as follows:

IN ('Software/Other' as "S", 'Hardware' as "H", 'Electronics' as "E",
'Photo' as "P",  'Peripherals and Accessories' as "P-A", 'Cat_Total'))

then the code would look like this:

SELECT * FROM
(SELECT
DECODE(GROUPING(t.calendar_quarter_desc)
, 0, t.calendar_quarter_desc
, 1, '2001-Total') as "Time",
   DECODE(GROUPING(p.prod_category) 
, 0, p.prod_category
, 1, 'Cat_Total') AS prod_category,
sum(s.amount_sold) as amount_sold,
count(s.amount_sold) as no_of_trans
FROM times t,
products p,
sales s
WHERE t.time_id = s.time_id
AND p.prod_id = s.prod_id
AND s.time_id BETWEEN TO_DATE('01-JAN-2001','dd-MON-yyyy')
AND TO_DATE('31-DEC-2001','dd-MON-yyyy')
GROUP BY CUBE(calendar_quarter_desc, prod_category)
ORDER BY calendar_quarter_desc)
PIVOT(sum(amount_sold) as Revenue,
sum(no_of_trans) as Transactions
FOR prod_category
IN ('Software/Other' as "S", 'Hardware' as "H", 'Electronics' as "E",
'Photo' as "P", 'Peripherals and Accessories' as "P-A", 'Cat_Total'))
ORDER BY "Time";
 

and would generate the following output: 

Time       S_REVENUE  S_TRANSACTIONS H_REVENUE  H_TRANSACTIONS E_REVENUE  E_TRANSACTIONS P_REVENUE  P_TRANSACTIONS P-A_REVENUE P-A_TRANSACTIONS 'Cat_Total'_REVENUE 'Cat_Total'_TRANSACTIONS
---------- ---------- -------------- ---------- -------------- ---------- -------------- ---------- -------------- ----------- ---------------- ------------------- ------------------------
2001-01     860819.81          24985 1301343.45           1029 1239287.71          10063 1370706.38           8004  1774940.09            16527          6547097.44                   60608
2001-02     872157.38          25469 1557059.59           1281 1144187.90          12074 1563475.51           8408  1785588.01            16060          6922468.39                   63292
2001-03     877630.85          25904 1651454.29           1379 1017536.82          11152 1607315.63           8651  2042061.04            18683          7195998.63                   65769
2001-04     943296.36          26217 1174512.68           1057 1303838.52          12859 1792131.39           9724  2257118.57            19892          7470897.52                   69749
2001-Total 3553904.40         102575 5684370.01           4746 4704850.95          46148 6333628.91          34787  7859707.71            71162         28136461.98                  259418
 

 

Now I can see how much revenue each category generated each quarter and for the full year along with the number of orders that were placed. Very nice and very simple!

Apart form finding an obscure but interesting SQL bug, what have I learned here? What is amazing here, at least for me, is the power of SQL analytics. Using features such as CUBE and PIVOT you can push a lot of application processing back inside the database and make your application code a lot simpler. You don't have to create temporary tables or pull data into the middle-tier server to then walk-through the data points to calculate row and column totals. SQL analytics will manage all this for you. Your application code will be simpler and it is very likely that the performance of your system will improve as well.

I hope this is useful and if you have any questions then let me know. 

 

Technorati Tags: , ,

Comments

Popular posts from this blog

The Complete Guide To Data Warehousing and Big Data at Oracle OpenWorld 2016

Your Essential Online Session and Hands-on Lab Calendars for #oow16

Thursday's Top Picks at OpenWorld for Data Warehousing and Big Data