Introduction to SQL for reporting and analysis

I have decided to start a series of blog posts on Oracle's in-database SQL analytics. The aim of this series of posts is to provide some historical background to show how SQL analytics has evolved and then start to explore some of the core concepts behind our SQL analytics followed by an examination of some of the key features and functions. The reason for creating this series of posts is that in talking to customers and partners it is clear to me that many project teams are just not aware of all the SQL features inside the Oracle Database that support enhanced reporting, data sampling, advanced aggregations, statistical analysis, pattern matching and spreadsheet-like data modelling.

Today, data warehouse and operational development teams need to quickly deliver new and innovative applications that provide significant competitive advantage and drive additional revenue streams. The challenge facing many teams is to find the right platform and language to securely and efficiently manage the data and analytical requirements while at the same time supporting the broadest range of tools and applications to maximize the investment in existing skills.

While many teams opt to use external analytical engines and/or use bespoke application code, Oracle believes that the most efficient way to process data is: a)inside the database (taking the analytics to the data) and b)using SQL as the analytical language. The following post provides some historical background to the evolution of our in-database analytics and explains why using SQL as the default development language for analytics delivers so many important benefits.

Flexible and Comprehensive Development Framework

Oracle’s in-database analytical SQL offers an agile analytical development platform that enhances both query performance and developer productivity. It supports all types of processing, ranging from real-time analysis to interactive decision support to operational batch reporting. Using Oracle Database both corporate developers and independent software vendors can take advantage of a rich set of SQL based analytical features within their projects and applications.

History

First some historical background information. Let’s look at how the list of SQL in-database analytics has evolved over time:

SQL A 1

In versions 5, 6 and 7 we laid down the basic foundations for SQL analytics with functions for aggregations and advance data processing. The development of what we today call SQL Analytics really started with the release of Oracle 8i. In this release we we extended the aggregation capabilities by creating the multi-dimensional-like CUBE and ROLLUP features along with analytic window functions (more on this in a future blog post).

As part of Oracle 9i we introduced the concept of grouping sets, hypothetical ranking (the rank or percentile that a row would hypothetically have if inserted into a specified data set which is useful for what-if analysis), first/last functions and the ability to create inverse distribution.
With 10g we added the SQL model clause that brings a spreadsheet-like approach to data processing (more on this in a future post). We added advanced analytical processing in the form of data mining features and functions.

In 11g we added the pivot/unpivot features that allows you to create spreadsheet-like crosstab reports directly from a table using simple SQL. We extended the capabilities of the window functions and we delivered the recursive WITH syntax that provides more flexibility for recursive processing compared to the CONNECT BY clause. And we added more data mining features and functions.

With 12c we have introduced generic top-n filtering along with big data features such as pattern matching. Obviously this is just a high level summary of core areas of functionality that we have delivered over the last 15-20 years. Here is a more detailed view of what is available in Oracle Database 12c - remember all these features and functions are delivered free of charge, ready to use straight out the box:

Ranking functions

rank, dense_rank, cume_dist, percent_rank, ntile

Window Aggregate functions (moving and cumulative)

Avg, sum, min, max, count, variance, stddev, first_value, last_value, Top-N, pivot, unpivot

LAG/LEAD functions

Direct inter-row reference using offsets

Reporting Aggregate functions

Sum, avg, min, max, variance, stddev, count, ratio_to_report, cube, rollup, grouping, grouping sets

Pattern Matching

match_number, classifier

SQL Model clause
Statistical Aggregates

Correlation, linear regression family, covariance

Linear regression

Fitting of an ordinary-least-squares regression
Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions

Descriptive Statistics

DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values

Correlations

Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric).

Cross Tabs

Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa

Hypothesis Testing

Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA

Distribution Fitting

Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential

As you can see there is a very large library of free SQL analytical functions available for DBAs and developers to incorporate into their workflows and applications.

What are the advantages of using SQL as the foundation for delivering sophisticated analytics?

The key benefits provided by Oracle’s in-database analytical SQL are:

Enhanced developer productivity
Improved manageability
Minimized learning effort
Investment protection (through ANSI SQL compliance)
Increased performance

1) Enhanced developer productivity

Using the latest built-in analytical SQL capabilities, developers can simplify their application code by replacing complex code – written in SQL and/or other languages - with analytical-SQL that is much clearer and more concise. Tasks that in the past required the use of procedural languages or multiple SQL statements can now be expressed using single, simple SQL statements. This simplified SQL (analytical SQL) is quicker to formulate, maintain and deploy compared to older approaches, resulting in greater developer productivity. The productivity benefits also extend to SQL-literate business users who are now able to write their own reports and workflows and manage their own enhancement requests.

In the example shown below, the left box shows the SQL used to calculate the average wait time between two specific events: 1001 and 1002. It requires the use of multiple select statements to extract the data for each event and create two result sets with the data correctly ordered so the main SQL statement can compute the average wait. The box on the right replaces the multiple select statements with a simple LAG function. This makes the code much cleaner, quicker to develop, simpler to test and it is much easier to understand.

2) Improved Manageability

When computations are centralized close to the data then the inconsistency, lack of timeliness and poor security of calculations scattered across multiple specialized processing platforms completely disappears. The ability to access a consolidated view of all your data is simplified when applications share a common relational environment rather than a mix of calculation engines with incompatible data structures and languages.

Oracle’s approach allows developers to efficiently layer their analysis using SQL because it can support a broad range of business requirements. For example, a requirement to select all the regions contributing more than 50% of total sales and where the salesperson is contributing more than 18% of regional sales can all be expressed in a single, simple SQL statement using analytical SQL:

SELECT person, region, sales FROM

 (SELECT person, region, sales,

         SUM (sales) OVER (PARTITION BY region) r_sales

         SUM (sales) OVER () t_sales

  FROM sales_table)

WHERE r_sales > 0.5 * t_sales AND sales > 0.18 * r_sales;

Some of the key analytical SQL features are shown above in bold and these will be explored in more detail in the following series of blog posts that will rollout over the coming months.

3) Minimized Learning Effort

The amount of effort required to understand analytical SQL is minimized through the use of careful syntax design. Built-in analytical SQL constructs typically leverage existing SQL constructs, such as the aggregate functions SUM and AVG, to extend these well-understood keywords.

Most developers and business users have a reasonable level of proficiency with SQL and can quickly adopt and integrate analytical features, such as pareto-distributions, pattern matching, cube and rollup aggregations into their applications and reports. The amount of time required for enhancements, maintenance and upgrades is minimized: more people will be able to review and enhance the existing SQL code rather than having to rely on a few key people with specialized programming skills.

4) ANSI SQL compliance

Most of Oracle’s analytical SQL is part of the ANSI SQL standard. This ensures broad support for these features and rapid adoption of newly introduced functionality across applications and tools – both from Oracle’s partner network and other independent software vendors. Oracle is continuously working with its many partners to assist them in exploiting the expanding library of analytic functions. Already many independent software vendors have integrated support for the new Database 12c in-database analytic functions into their products.

5) Improved performance

Oracle’s in-database analytical SQL enables significantly better query performance. Not only does it remove the need for specialized data-processing silos but also the internal processing of these purpose-built functions is fully optimized. Using SQL unlocks the full potential of the Oracle database - such as parallel execution - to provide enterprise level scalability unmatched by external specialized processing engines.

In many cases the use of analytical SQL can improve performance by removing the need to create self-joins that result in large fact tables being accessed twice. By using analytical SQL the optimiser can create a much simpler query plan with zero joins, as shown below:

Summary

This section has outlined how Oracle’s in-database analytical SQL provides IT and application development teams with a robust and agile analytical development platform that enhances both query performance and developer productivity while providing investment protection by building on existing standards based skills.

The following sections will explore some of the key concepts behind Oracle’s analytical SQL and provides an overview of the key analytical features and functions that use these key concepts.

Using SQL to solve business problems

These can help you answer typical business questions such as:

Who are the top ten sales-reps in each region?

We can use the Rank function

What is the 90-day moving average of stock price?

We can create a moving average by using the Window feature

What is the percentage growth of Jan-2010 sales over Jan-2009?

We can use a period-over-period comparisons

What are January’s sales as a percentage of the entire year’s?

Compare aggregates on different levels

What is next...

Over the next few months I will be posting a series of articles on the processing concepts behind analytical SQL followed by reviews of the main types in-database functions and features that are part of Oracle Database 12c:

Enhanced reporting
Data Sampling
Advanced aggregations
Statistical analysis
Pattern matching
Spreadsheet-like data modelling

Technorati Tags: Analytics, Database 12c, Oracle Database 12c, SQL Analytics, Statistics

Search This Blog

SQL for Data Warehousing and Analytics