SQL Analytics Part 2- Key Concepts
This post continues on from my first post on analytical SQL "introduction to SQL for reporting and analysis" which looked at the reasons why it makes sense to use analytical SQL in your data warehouse and operational projects. In this post we are going to examine the key processing concepts behind analytical SQL.
One of the main advantages of Oracle's SQL analytics is that the key concepts are shared across all functions - in effect we have created a unified SQL framework for delivering analytics. These concepts build on existing SQL features to provide developers and business users with a framework that is both flexible and powerful in terms of its ability to support sophisticated calculations. There are four key concepts that you need to understand when implementing features and functions relating to SQL analytics:
- Process order
- Result-set Partitions
- Windows
- Current Row
Let's look at each of these topics in more detail.
1) Processing order.
The execution workflow for SQl statements containing analytical SQL is relatively simple: first all the HAVING, GROUP BY and JOIN predicates are processed. The output from this step is then passed to the analytical functions so all the calculations can be applied. This typically involves the use of window functions which are applied based on the partitions that have been defined with analytic functions applied to each row in each partition. Finally the ORDER BY clause is processed to provide control over the final output. It is useful to keep this workflow in your mind when you are building your analytical SQL because it will help you understand the inputs flowing into your analytical functions and the resulting output.
2) Result-set partitions
Oracle's analytic functions allow the input data set to be divided into groups of rows which are referred to as "partitions". It is important to note that in this context the term "partition" is completely unrelated to the table partition feature.
These analytical partitions are created after the groups defined with GROUP BY clauses and are can be used by any analytical aggregate functions such as sums and averages. The partitions can be based on any column that is part of the the input data set and individual partitions can be any size. It is quite possible to create a single partition contain all the rows from the initial query result set or create a small number of very large partitions or a large number of very small partitions where each partition just contains a few rows.
3) Windows
For each row in a partition it is possible to define a window over the data which determines the range of rows used to perform the calculations for the current row (the next section will explain the concept of the "current row")/ The size of a window can be based on either a physical number of rows or a logical interval, which is typically time-based. The window has a starting row and an ending row and depending on how the window is defined it may move at only one end or, in some cases, both ends.
Physical windows
For example a cumulative sum function would have its starting row fixed at the first row in the partition and the ending row would then slide from the starting row all the way to the last row of the partition to create a running total over the rows in the partition.
SELECT Qtrs
, Months
, Channels
, Revenue
, SUM(Revenue) OVER (PARTITION BY Qtrs) AS Qtr_Sales
, SUM(Revenue) OVER () AS Total_Sales
FROM sales_table
Logical windows
f the data set contains a date column then it is possible to use logical windows by taking advantage of Oracle’s built-in time awareness. A good example of window where the start row changes is the calculation of a moving average. In this case both the starting and end points slide so that a constant physical or logical range is maintained during the processing. The example below creates a four-period moving average and the images show the current-row, which is identified by the arrow, and the moving window, which is marked as the pink area :
|
The concept of a "window" is very powerful and provides a lot of flexibility in terms of being able to interact with the data. A window can be set as large as all the rows in a partition. At the other extreme it could be just a single row. Users may specify a window containing a constant number of rows, or a window containing all rows where a column value is in a specified numeric range. Windows may also be defined to hold all rows where a date value falls within a certain time period, such as the prior month.
When using window functions, the current row is included during calculations, so you should only specify (n-1) when you are dealing with n items - see the next section for more information….
4) Current Row
Each calculation performed with an analytic function is based on a current row within a partition. The current row serves as the reference point and during processing it begins at the starting row, moves throw the following rows until the end row of the window is reached. For instance, a centered moving average calculation could be defined with a window that holds the current row, the six preceding rows, and the following six rows. In the example below the calculation of a running total would be the result of the current row plus the values from the preceding two rows. At the end of the window the running total will be reset. The example shown below creates running totals within a result set showing the total sales for each channel within a product category within year:
SELECT calendar_year
, prod_category_desc
, channel_desc
, country_name
, sales
, units
, SUM(sales) OVER (PARTITION BY calendar_year, prod_category_desc, channel_desc order by country_name) sales_tot_cat_by_channel
FROM . . .
Summary
This post has outlined the four main processing concepts behind analytical SQL. The next series of posts will provide an overview of the key analytical features and functions that use these concepts. In the next blog post we will review the analytical SQL features and techniques that are linked to enhanced reporting which includes: windowing, lag-lead, reporting aggregate functions, pivoting operations and data densification for reporting and time series calculations. Although these topics will be presented in terms of data warehousing, they are actually applicable to any activity needing analysis and reporting.
If you have any questions or comments about analytical SQL then feel free to contact me via this blog.
Technorati Tags: Analytics, Data Warehousing, Database 12c, Oracle Database 12c, SQL Analytics
Comments
Post a Comment