Google Analytics

Friday, 7 August 2015

Must-See Session Guide for Data Warehousing at #oow15

 









Your Must-See Guide for Data Warehousing and Big Data #OOW2015

There's so much to learn at Oracle OpenWorld - it provides more educational and networking opportunities than any other conference dedicated to Oracle business and technology users. To help you get the most from this year's event I have prepared a comprehensive guide which lists all the must-see data warehousing and big data sessions. It is divided into the following chapters:
  • Welcome to OpenWorld
  • Key Highlights from 2014
  • Overview of Oracle Cloud and Oracle Database 12c
  • List of Presenters
  • Must-See Sessions
  • Links to product web pages, blogs, social media sites
  • Information about the #oow15 smartphone app
  • Maps to help you find your way around
This guide is now available in both Apple iBook and PDF formats. The Apple iBook format will now open on iPads and Mac computers via the relevant iBook app. Please refer to the Apple website for more information. The PDF format will work on any smartphone, tablet and/or computer.

Please use the following links to download your preferred version of the must-see guide:

If you have any questions about these guides then please feel free to contact me at keith.laker@oracle.com

 

Technorati Tags: , , , , , ,

Wednesday, 15 April 2015

Managing overflows in LISTAGG

IMG 0196 This is an interesting problem that has come up a few times in discussions (and I think it has been mentioned on the SQL forums as well).  When using LISTAGG on very large data sets you can sometimes create a list that is too long and consequently get an ORA-01489: result of string concatenation is too long  error.

Obviously, it is possible to determine in advance if the error is going to occur using some bespoke PL/SQL code and then take appropriate action within your application to manage the rows that contain stings that exceed the VARCHAR2 limit.

Many customers have implemented workarounds to overcome the ORA-01489 error, however, this has mostly involved the use of complex code which has impacted performance.


Wouldn’t it be great if there was a simple yet elegant way to resolve this issue? Actually there is and we can use a few of the most recent analytical SQL functions. If you are using Database 12c you can make use of the MATCH_RECOGNIZE function to effectively create chunks of strings that do not exceed the VARCHAR2 limit.

If you are using an earlier version of the Oracle Database then we can make use of the powerful SQL Model clause that was introduced in Database 10g.

In fact the beauty of the workarounds outlined in this blog post is that you can define your own maximum size for the string returned by LISTAGG so you have complete control over the size/length of the result set that is returned.

For example, let’s assume that we have the following statement (to keep things relatively simple let’s use the EMP table in the schema SCOTT)
SELECT
deptno,
LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno) AS namelist

FROM emp
GROUP BY deptno;
this statement will generate the following output:

DEPTNO NAMELIST
---------- ----------------------------------------
10 CLARK;KING;MILLER
20 SMITH;JONES;SCOTT;ADAMS;FORD
30 ALLEN;WARD;MARTIN;BLAKE;TURNER;JAMES 
 Let’s assume that the above statement does not run and that we have a limit of 15 characters that can be returned by each row in our LISTAGG function. We can use the Database 12c SQL pattern matching function, MATCH_RECOGNIZE, to return a list of values that does not exceed 15 characters. First step is to wrap the processing in a view so that we can then get data from this view to feed our LISTAGG function. Here is the view that contains the MATCH_RECOGNIZE clause:

CREATE OR REPLACE VIEW emp_mr AS
SELECT *
FROM emp MATCH_RECOGNIZE(
   PARTITION BY deptno
   ORDER BY empno
   MEASURES match_number() AS mho
   ALL ROWS PER MATCH
   AFTER MATCH SKIP PAST LAST ROW
   PATTERN (S B*)
   DEFINE B AS LENGTHB(S.ename) + SUM(LENGTHB(B.ename) + LENGTHB(‘;’)) < = 15
);
You might well ask: why don’t we put the LISTAGG function inside the measure clause? At the moment it is not possible to include analytical functions such as LISTAGG in the measure clause. Therefore, we have put the LISTAGG function in a separate SQL statement:
SELECT
 deptno,
 LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno) AS namelist
FROM emp_mr
GROUP BY deptno, mno;
The above code will produce the following output:

DEPTNO NAMELIST
---------- ----------------------------------------
10         CLARK;KING
10         MILLER
20         SMITH;JONES
20         SCOTT;ADAMS
20         FORD
30         ALLEN;WARD
30         MARTIN;BLAKE
30         TURNER;JAMES
Using 12c MATCH_RECOGNIZE, we are able to get all the values by sending just one query. Now not everyone who is hitting this ORA-01489 error is running 12c, so what do you do if you are running an earlier version of the database and need an efficient fix for this error? Prior to 12.1, the overflow can be handled using SQL Model clause. This approach is a little more complicated. First we create a view that returns the row number:
CREATE OR REPLACE VIEW emp_sqm_view AS
SELECT
  emp.*,
  row_number() OVER (PARTITION BY deptno ORDER BY empno) as rn
FROM emp;
the above statement generates the following output with the row number added for each employee within each department. Thinking about this in Excel terms, we will use this view to feed our SQL Model clause, therefore, the view will allow us to create a “worksheet” for each department and then list the employees for each department as a row on that “worksheet”:
EMPNO ENAME  JOB       MGR   HIREDARE  SAL  COMM DEPTNO  RN
===== ====== ========= ===== ========= ==== ==== ======= ==
 7782 CLARK  MANAGER    7839 09-JUN-81 2450      10      1

 7839 KING   PRESIDENT       17-NOV-81 5000      10      2
 7934 MILLER CLERK      7782 23-JAN-82 1300      10      3
 7369 SMITH  CLERK      7902 17-DEC-80 800       20      1
 7566 JONES  MANAGER    7839 02-APR-81 2975      20      2
 7788 SCOTT  ANALYST    7566 19-APR-87 3000      20      3
 7876 ADAMS  CLERK      7788 23-MAY-87 1100      20      4
 7902 FORD   ANALYST    7566 03-DEC-81 3000      20      5
 7499 ALLEN  SALESMAN   7698 20-FEB-81 1600  300 30      1
 7521 WARD   SALESMAN   7698 22-FEB-81 1250  500 30      2
 7654 MARTIN SALESMAN   7698 28-SEP-81 1250 1400 30      3
 7698 BLAKE  MANAGER    7839 01-MAY-81 2850      30      4
 7844 TURNER SALESMAN   7698 08-SEP-81 1500    0 30      5
 7900 JAMES  CLERK     7698 03-DEC-81 950        30      6

Now the next step is to calculate the running total of the length of the text string of employee names within each department. The SQL Model clause allows us to calculate the running total and reset the total if it exceeds the number of characters that we have set as the limit - in this case 15 characters. This code is wrapped in the first of the views that we are going to build:

CREATE OR REPLACE VIEW emp_model_1 AS (
SELECT *
FROM emp_sqm_view
  MODEL
   PARTITION BY (deptno)
   DIMENSION BY (rn)
   MEASURES (empno, ename, LENGTHB(ename) len, 0 cum_len)
   RULES (
    cum_len[1] = LEN[1] + LENGTH(';'),
    cum_len[rn > 1] = CASE WHEN cum_len[cv()-1] + LEN[cv()] + LENGTHB(‘;’) <= 15
                      THEN cum_len[cv()-1] + LEN[cv()] + LENGTHB(';')
                      ELSE LEN[cv()] + LENGTH(';')
    END))
;

Next we need an id that we can use to group together the list of names that fit within our boundary of 15 characters. Again we will use the SQL Model clause to compute the break points and increment the counter within each department:
CREATE OR REPLACE VIEW emp_model_2 AS
SELECT *
FROM emp_model_1
MODEL PARTITION BY (dept no) DIMENSION BY (rn)
   MEASURES (empno, ename, len, cum_len, 0 sub_id)
   RULES (
     sub_id[1] = 1,
     sub_id[rn > 1] = CASE WHEN cum_len[cv()] = len[cv()] + LENGTHB(‘;’)
                           THEN sub_id[cv() - 1] + 1
                           ELSE sub_id[cv() - 1]
     END));

 now we have the following output:
DEPTNO RN EMPNO ENAME  LEN CUM_LEN SUB_ID
====== == ===== ====== === ======= ======
    10  2  7839 KING     4      11      1
    10  1  7782 CLARK    5       6      1
    10  3  7934 MILLER   6       7      2
    20  1  7369 SMITH    5       6      1
    20  5  7902 FORD     4       5      3
    20  4  7876 ADAMS    5      12      2
    20  3  7788 SCOTT    5       6      2
    20  2  7566 JONES    5      12      1
    30  1  7499 ALLEN    5       6      1
    30  2  7521 WARD     4      11      1
    30  3  7654 MARTIN   6       7      2
    30  6  7900 JAMES    5      13      3
    30  5  7844 TURNER   6       7      3
    30  4  7698 BLAKE    5      13      2

we can now use the column headed sub_id along with the department number to create GROUP BY clause which will finally generate our list of employees within each department where each row in the column namelist will contain a maximum of 15 characters:

SELECT
deptno,
LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno) AS namelist
FROM emp_model_2
group by deptno, sub_id;
The above query creates the following output which matches the final output we achieved using the MATCH_RECOGNIZE clause:

DEPTNO NAMELIST
---------- ----------------------------------------
10 CLARK;KING
10 MILLE
20 SMITH;JONES
20 SCOTT;ADAMS
20 FORD
30 ALLEN;WARD
30 MARTIN;BLAKE
30 TURNER;JAMES

If you are using the LISTAGG function and sometimes getting an ORA-01489: result of string concatenation is too long  error then hopefully the above workarounds are useful.

The code samples shown above were created by Lei Sheng who is a Principal Member of Technical Staff, Database Development. Lei is worked on many of the really cool analytical SQL features including the SQL Model clause and the new MATCH_RECOGNIZE clause. So big thanks to Lei for sharing these workarounds.

Technorati Tags: , , , , , ,

Friday, 27 March 2015

Open World 2015 call for papers - my simple guidelines

 

OOW Banner 2013

 

Most of you will already have received an email from the OpenWorld team announcing the call for papers for this year’s conference: https://www.oracle.com/openworld/call-for-proposals.html. Each year, a lot of people ask me how they can increase their chances of getting their paper accepted? Well, I am going to start by stating that product managers have absolutely no influence over which papers are accepted - even mentioning that a product manager will be co-presenting with you will not increase your chances! Yes, sad but true!

So how do you make sure that your presentation title and abstract catches the eye of the selection committee? Well, here is my list of top 10 guidelines for submitting a winning proposal: 

1) Read the "call-for-papers" carefully and follow its instructions - even if you have submitted presentations for lots of Oracle conferences it is always a good idea to carefully read the call for papers and to make sure you follow the instructions. There is an excellent section towards the end of the call-for-papers web page, "Tips and Guidelines"

2) Address the theme of the conference - If this is available when the call the for papers is announced then try to address the theme of the conference within your abstract.

3) Address the key data warehouse technical focus areas - for this year’s conference the key focus areas for data warehousing will be: big data information management architectures, the logical data warehouse, partitioning, analytical SQL, pattern matching, parallel execution, workload management, multitenant, in-memory and Big Data SQL. If possible try to include one or more of these focus areas within your abstract.

4) Have a strong biography - You need to use your biography to differentiate and build credibility. This is an important topic because it allows you to differentiate yourself from all the other presenters who are trying to get speaking slots. Your biography must explain why you are an authority on the topic you have chosen for your presentation and why people will want to listen to what you have to say.

5) Have a strong business case - build your presentation around a strong business case, relevant to your industry and/or your target audience (DBAs, developers, architects etc). Try to explain in clear and simple terms the problem you needed to solve, how you solved it using Oracle technology and the direct technical/business benefits.

6) Make the title and abstract interesting - Your title and abstract must be easy to read and make sure you introduce your main idea as early as possible. Review the titles and abstracts from previous conferences as a guide. Ideally make the issue relevant to the delegates attending OWW, get to the point, and make sure it is easy to read.

7) Look at previous presentations - the content catalog for last year’s conference is available online,see here. You can review all the titles and abstracts that were accepted and use them as guidelines for creating your own title and abstract for this year's conference.

8) Write clear outcomes - The majority of the best presentations have clearly stated outcomes. What do you expect that conference attendees will be able do or know at the end of your session? Consider including a sentence at the end of your abstract such as the following: “At the end of this presentation, participants will be able to . . . .”

9) Don’t submit your paper right away - Once you have a title and abstract show it to a few colleagues. Get some feedback. You probably know many people who’d be happy to give you ideas on making your proposal better.

10) Keep number of submissions low - You do not increase your chances of getting a paper accepted by submitting lots of different sessions.

 

I have collected all the above and posted it in a handy pocket-size PDF format so you can read all these great tips offline. The PDF is available for download from here: https://googledrive.com/host/0B40aDGvPy4WbT0R1T01TUFg4V3c/OOW2015 Call-For-Papers Guidelines.pdf

Obviously I cannot guarantee you success if you follow these guideline but I hope they prove helpful. Good luck with your submission(s) and I look forward to seeing at you at this year’s OpenWorld conference in the beautiful city of San Francisco. 

OOW2014

 

Technorati Tags: , ,

Tuesday, 3 March 2015

Why SQL Part 3 - Simple optimization makes life simpler

In the second part of this series of posts on (Why SQL Part 2 - It has a powerful frameworkof this series of blog posts I explained the underlying framework that SQL is based on. That framework is referred to as “relational algebra” and it was first described in 1970 by E.F. Codd while at IBM . It provides well-founded semantics for both modelling and querying a set of data. Codd's framework established a series of basic principles that govern the construction and execution of a query and these can be summarized as follows:

  1. Projection
  2. Filter
  3. Join
  4. Aggregate

Now that we have examined the basics of the relational model let’s now move on to how the language actually determines the best way to answer a question based on its use of simplified and transparent optimizations. To understand why simplified and transparent optimization is so important it is useful to consider two of the most common programming paradigms and these are: procedural and declarative. Let’s explore these two approaches:

The Procedural approach to programming

This approach aims to break down a specific task into a collection of variables, procedures, functions, corresponding data structures and conditional operators. It uses a list of instructions, wrapped up in procedures and subroutines, to tell a computer what to do step-by-step. A program (or query) written in a procedural language will contain one or more procedures and these procedures can be invoked anywhere within the program hierarchy as well as by other procedures. Examples of a procedural programming language include C/C++, Java, ColdFusion and PASCAL. The example below shows how, using a procedural language such as Java, to return a sorted list of names from a simple table (this example refers to a data set based on a list of employees):

 

public class Employee implements Comparable{
  private int empno;
  private String ename;
  private int hiredate;

public Employee(int empno, String name, int hiredate) {
  this.empno=empno;
  this.ename=ename;
  this.hiredate=hiredate;
}

// Sort in descending order by Employee name. If name is same then sort in descending order by age.
// If age is same then sort in descending order by Empid

  public int compareTo(Employee o){
   if (o.ename.equals(this.ename))
    if (o.hiredate-this.hiredate == 0)
    return o.empno - this.empno;
    else return o.hiredate-this.hiredate;
   else return o.ename.compareTo(this.ename);
  }

  public int getHiredate(){
    return this.hiredate;
  }

  public int getEmpNo(){
    return this.empno;
  }

  public String getEname(){
    return this.ename;
  }
}

public class TestEmployeeSort {

    public static void main(String[] args) {
    List coll = Util.getEmployees();
    Collections.sort(coll); // sort method
    printList(coll);
    System.out.println("---------------------");
    Collections.sort(coll,new SortEmployeeByName());
    printList(coll);
  }

  private static void printList(List list) {
    System.out.println("EmpId\tName\tAge");
    for (Employee e: list) {
       System.out.println(e.getEmpId() + "\t" + e.getName() + "\t" + e.getAge());
    }
  }
}

public class SortEmployeeByName implements Comparator{
   public int compare(Employee o1, Employee o2){
      return o1.getName().compareTo(o2.getName());
   }
}

(this is an example of how to sort a list in Java using Comparator. The code is available here: https://sites.google.com/site/hadoopandhive/home/dsd)

The code snippet above tells the computer precisely the order of processing in terms of how to sort the data. Other procedures load the data into an array for processing, and output the sorted list of names. The full code listing is available on the Google Hadoop & Hive site: https://sites.google.com/site/hadoopandhive/home/dsd

 

The Declarative approach to programming

The declarative approach allows the developer to simply build the structure and elements of the required program in a way that expresses just the logic of the computation without having to describe the actual control flow. In simple terms it describes “what” the program, or query, should accomplish rather than describing “how” to actually accomplish the desired outcome. The “how” part of the processing is managed transparently by the underlying platform. The most widely used declarative programming language is SQL. Referring back to our problem of returning a sorted list of employees, the SQL, declarative-based, code is much simpler.  Using our simple example of returning a sorted list of employees, highlights the key differences between SQL’s declarative-based code and the more workflow-orientated procedural languages such as Java:

SELECT
  empno,
  ename,
  hiredate
FROM emp
ORDER BY empno, hiredate, ename

As you can see SQL allows the developer to simply build the structure and elements of the required program in a way that expresses just the logic of the computation without having to describe the actual control flow. In simple terms it describes “what” the program, or query, should accomplish rather than describing “how” to actually accomplish the desired outcome. The “how” part of the processing is managed transparently by the underlying platform. This contrasts with the procedural approach to programming, which aims to break down a specific task into a collection of variables, procedures, functions, corresponding data structures and conditional operators. It uses a list of instructions, wrapped up in procedures and subroutines, to tell a computer what to do, step-by-step. Many of today’s big data languages adopt this procedural approach.

 

Summary

With declarative languages, such as SQL, the developer only has to understand the relationships between entities, the final output and transformations needed to support the output. From these two code samples it is obvious that the declarative approach is more compact, significantly less complex and much easier to maintain and enhance. This explains another of the key reasons why SQL is becoming the go-to language for data analysis.

Key characteristics Procedural Coding Declarative Coding
Developer needs to understand… how to organize and perform tasks, must track changes in state relationships between entities, the final output and transformations needed to support the output
Flow control is implemented using… looping, conditional branching and developer-defined function calls Function calls, high-level recursion
Management of application and system state changes... has to be controlled and managed by the developer N/A
Order of execution... very important N/A

 

 

Simplified Optimizations

The optimization techniques for queries tend to focus on minimizing the level of network traffic, reducing I/O operations, minimizing looping operations and/or eliminating redundant calculations and operations. Each application language has it is own unique set of tuning issues so using lots of different languages to support specific sets of business requirements creates a huge overhead for project teams and reduces the overall agility of the project. Trying to implement and then optimize change requests from the business can become a time-consuming task.

Many big data languages are procedural in their approach where the developer is responsible for implementing optimization techniques to support their workload. These languages provide a series of structures to manage the resources consumed by a job. For example each Java based MapReduce job requires a job configuration file that contains a number of parameters: maximum virtual memory, the cumulative size of the serialization and accounting buffers, the number of segments on disk to be merged at the same time and the percentage of memory relative to the maximum heap-size in which map outputs may be retained during the reduce. Each developer optimizes his/her own specific code/process but does so in complete isolation from the overall workload.

This granular level of control provides tremendous flexibility but can create conflicts when there are multiple applications executing similar queries.The correct level of parallelism for big data maps, within a MapReduce job, seems to be around 10-100 maps per-node, although higher numbers for CPU-light map tasks is possible. However, there is nothing to stop the developer from taking even more resources from the pool to run their code. For more information on the various fine-grained parameters for implementing, configuring and tuning MapReduce jobs refer to the MapReduce tutorial  on the Apache Hadoop website.

The fact that SQL is a declarative language means the developer is shielded from the complexities of the underlying query optimization techniques. Data discovery and other analytical requirements that are implemented using the Oracle Database as the processing platform have a rich set of optimization techniques covering parallelization, query transformations, indexing and join algorithms.  

Building an Optimal Query Plan

The Oracle query optimiser is at the core of Oracle’s database software and it determines the most efficient method for a SQL statement to access requested data. Due to the many internal statistics and tools within the database, the Oracle Optimiser is usually in a better position than the developer to determine the best method of statement execution. During the process of determining the best plan, the optimizer examines multiple access methods, such as full table scan or index scans, and different join methods such as nested loops and hash joins before arriving at the “best execution plan”. For this reason, all SQL statements use the Oracle Optimiser.

It attempts to generate the best execution plan for a given SQL statement. An “execution plan” describes a recommended method of execution for a SQL statement. Each plan details the combination of steps that the Oracle Database must use to execute a SQL statement. The definition of the “best execution plan” is the plan with the lowest cost among all considered candidate plans. The calculation of the overall cost takes into account key factors of query execution such as I/O, CPU, and communication. With declarative languages all these factors have to be coded, configured and managed by the developer.

Consider a user who queries records for employees who are managers. If the database statistics indicate that 80% of employees are managers, then the optimizer may decide that a full table scan is most efficient. However, if statistics indicate that very few employees are managers, then reading an index followed by a table access by rowid may be more efficient than a full table scan. This type of optimization is impractical to implement in MapReduce, since the Java developer would not only have to now write the query code, but also write an additional set of code to gather and maintain statistics about the underlying data.

 

 

 

ExecutionPlan1

 

 

Parallel Execution

The most complicated area of optimization is typically in the area of parallel execution. The objective of parallel execution is to reduce the total execution time of an operation by using multiple resources concurrently. Resource availability is the most important prerequisite for scalable parallel execution.  For procedural languages developers have to understand a complex set of rules and very carefully consider the best way to parallelize the operations within their code. As mentioned before, this process typically takes place in isolation so that the developer is unaware of other jobs that might be running at the same time and the level of resources that might be available at any given point in time.

The Oracle Database provides a powerful SQL parallel execution engine. It can decide autonomously if and how parallel execution should be enabled for a query. By default the Oracle Database is enabled for parallel execution and when Automatic Degree of Parallelism (Auto DOP) is active, the database decides if a statement should execute in parallel or not and what degree of parallelism (DOP) should be used. The decision when to use parallel execution and the DOP chosen are both based on the resource requirements of a statement and the resources available at the time the query is submitted. 

After the optimizer determines the fastest execution plan for a statement then the parallel execution coordinator determines the parallel execution method for each operation within the plan. The parallel execution coordinator has to decide whether an operation can be performed in parallel and if it can it then has to determine how many parallel execution servers it should use.  At this point the database has parsed the query, calculated the overall cost and then determined the optimal DoP. Of course it is entirely possible for the cheapest plan to be a serial execution process, therefore, this is also an option. Finally, the Oracle Database will determine if the statement can be executed immediately or if it needs to be queued until more system resources become available.  The image below “Optimizer Calculation: Serial or Parallel?” illustrates this decision making process:

Dwhsg132

For more information see the following section in the Oracle Database 12c Data Warehousing Guide: https://docs.oracle.com/database/121/DWHSG/schemas.htm#DWHSG9066.

In general as more and more statements attempt to run in parallel it quickly becomes very important to manage the utilisation of the parallel processes available. This means some degree of intelligence is required about when to run a statement in parallel and verify whether the requested numbers of parallel processes are available. This is extremely difficult for a developer to do simply because they are working in isolation from the overall workload. The only way to implement this type of intelligent processing is to allow the database to control the whole process.

With SQL all this happens transparently as each SQL statement that is submitted. The developer does need to get involved in determining the degree of parallelization and this simplifies the overall application logic. It also means that as new optimization techniques become available the application source code can transparently reap the benefits of the new features.


Summary

The initial focus on developer centric optimization techniques within big data projects is now giving way to a more platform centric approach as the complexity of analysis demanded by business users and data scientists continues to increase. By letting the database perform the optimisations at source, where the data is located, it is possible to share tuning improvements across a wide spectrum of front-end systems (BI reports, dashboards, applications etc). The ability of the database to “own” the optimization and processing characteristics of how a SQL statement is executed is an important element in the success of SQL as a data analysis language.

By deferring the majority of operations to the database developers can simplify their application code and transparently inherit new features as they are released.

Technorati Tags: , , , ,

Thursday, 19 February 2015

Why SQL Part 2 - It has a powerful framework

In the first part (Why SQL is the natural language for data analysis) of this series of blog posts I explained why analytics is important to the business, how the volume of data along with the types of data is continuing to expand and why that makes it vital that you select the right language for data analysis. Many of us work with SQL every day and take for granted many of its unique features, its power, flexibility and the richness of the analytics. This familiarity with SQL means that sometimes we are a bit slow at preventing some of our projects investing in other big data languages such as MapReduce, Impala, Spark, Java and many of the new generation of SQL-like open source projects. While many of these new languages are considered “cool”, it is very easy to end up building new proprietary data silos or investing in a language that eventually is replaced by another open source project or eventually fails to deliver the required analytics.

One of the aims of this series of blog posts is to refresh your memory about why SQL has been so successful in the area of analytics. SQL has four unique features that make it perfect for data analysis:

  • Powerful framework
  • Transparent optimization
  • Continuous evolution
  • Standards based

The aim of this post is to explore the first of these features: powerful framework  

The reason for collecting data is because it needs to be interrogated: events needs to be dissected, customers need to be profiled, product sales evaluated. This interrogation process is typically framed around a set of data rather than a single data point or row of data and mathematical rules have been created to provide a precise framework for this analytical process. The framework is “relational algebra” which was first described in 1970 by E.F. Codd while at IBM . It provides well-founded semantics for both modeling and querying a set of data.

Codd's framework established a series of basic principles that govern the construction and execution of a query and these can be summarized as follows:

  1. Projection
  2. Filter
  3. Join
  4. Aggregate

The ideas behind relational algebra (set theory) are not just limited to the points that will be covered in this particular section. These concepts have implications that cascade right across the other unique features. While the elements of this basic framework are constant the way that these elements are implemented within many of the current big data related SQL-like languages differ based on the approach adopted by each of the open source projects/languages.

Let’s explore each the four basic principles starting with “Projections”.

1. Projections

When reviewing the concept of a “set of data” in the context of SQL it is important to remember that each source set of data contains two separate sets: a column set and a row set. When interrogating a set of data the first step is to determine which columns within the set are of interest., projections of the total set of columns.. When a database system does projections, it is simply determining which columns are necessary for a given analysis, and discards the other columns from the analysis.

The Oracle Database has a comprehensive metadata layer that supports the discovery of columns across a wide range of data sets: relational tables/views, XML documents, JSON documents, spatial objects, image-style objects (BLOBs and CLOBs), semantic networks etc.

The first part of the SELECT clause is used to list the columns of interest within the set of data and this can include all the columns rather than simply a subset. The syntax for selecting all columns within a set, in this case using a table called EMP, is:

SELECT * FROM emp;

The Oracle Database uses its metadata layer to establish the list of all column names associated with the set and then internally expands the query statement to include all the relevant column names.

 

Slide01

Alternatively the query can specify the required columns within the set such as:

SELECT ename, job, hiredate FROM emp;

Slide02

This level of sophistication (both metadata and automatic statement expansion) is missing in many data manipulation languages and this requires developers to add additional bespoke code to cope with these basic requirements. SQL also supports the concept of extended projections. This is where new data columns are created within the result set using arithmetic operations. For example, taking the columns SAL and COMM it is possible to compute the rate of commision by dividing SAL by COMM to create a new column:

SELECT
  ename,
  sal,
  comm,
  comm/sal*100 AS comm_rate
FROM emp;

Oracle’s Big Data SQL feature extends this metadata model so that it can encompass data stored inside NoSQL databases, JSON documents, and files stored on HDFS. This comprehensive metadata layer makes it very easy for developers to create dynamic SQL statements containing column projections based around a wide range of data sources.

 

2. Filters

The next stage within the query framework is to specify the rows that are of interest. Since these cannot be identified in the same way as columns, using names, a different approach is used for row-based projections. This approach requires the use of filters – i.e. describing the attributes associated with row sets that are of interest. Within the SQL language row filtering is part of the WHERE clause syntax (these filters are also often called predicates in relational terminology).

The developer can define specific filtering criteria that rows must meet to be included in the result set. For example using the employee data set it is possible to extend our previous example to limit the set of rows returned by a query to just those associated with the job type ‘CLERK’ :

SELECT
  empno,
  ename, 
  job,
  mgr,
  hiredate,
  sal, 
  comm, 
  deptno
FROM emp
WHERE job=”CLERK”;

Slide03

 

Applying filters cannot only reduce the returned data set based on attributes describing the data – such as the job type above – it can also reduce the returned data set to an absolute or relative subset of the result set.

For business-driven queries this process of limiting rows is essential. A query can be structured to return a specified number or percent of rows starting with the first row after the offset. The offset allows for modification of typical questions, so that the question about highest-paid employees might skip the top ten employees and return only those from eleventh to twentieth place in the salary rankings. In a similar manner it is possible to query the employees data set by salary, skip the top ten employees and then return the top 10% of the remaining employees.

The SQL language supports a wide range of filtering comparison operators to identify specific rows:

Operator Description
=, !=, <> Test for equal to, not equal to, not equal to
>, >=, Test for greater than, greater than or equal to, less than, less than or equal to
BETWEEN ...
AND ...
Checks for a range between and including two values
LIKE Searches for a match in a string, using the wildcard symbols % (zero or multiple characters) or _ (one character)
IN ( )
NOT IN ( )
Tests for a match, or not match, in a specified list of values
IS NULL
IS NOT NULL
Checks whether a value is null, is not null

 

While it is possible to filter rows based on values, for some types of application-driven queries this process of limiting rows can be extended. A query can be structured to return a specified number or percent of rows starting with the first row after the offset. For application developers this process of limiting rows using the SQL language is extremely flexible. For example, it can be used to aid testing or provide a pagination feature for passing data to a front-end visualization. The Oracle Database supports a number of techniques to help developers meet this requirement such as:

  • FETCH FIRST N/N-PERCENT
  • RANK
  • ROW_NUMBER()

These features can be used within user interfaces to provide the first few rows of a data set for browsing. The SQL language has a very rich set of filtering techniques that are both simple to implement and to amend as requirements evolve over time.

 

3. Joins

Most query operations require the use of at least two data sets and this necessitates a “join” operation. At a simplistic level, a join is used to combine the fields within two or more data sets in a single query, based on common logical relationships between those various data sets. In our simple data example, suppose that there was a department data set in addition to the employee data set. A query might ‘join’ the department and employee data to return a single result combining data from both data sets. There are a number of ways that data sets can be joined:

  • Inner - returns all rows where there is at least one match in both tables
  • Full Outer - returns all rows from both tables, with matching rows from both sides where available. If there is no match, missing side will contain null.
  • Outer left - returns all rows from left table, and matched rows from right table
  • Outer right - returns all rows from right table, and matched rows from left table
  • Cross - returns a Cartesian product of source sets, i.e. all rows from left table for each row in the right table 

The process of defining and executing a SQL join is simple. The required type of join is implemented using the WHERE clause:

SELECT
  d.deptno,
  d.dname,
  e.empno,
  e.ename
FROM dept d
INNER JOIN emp e ON (e.deptno = d.deptno);

Note that there is nothing in this SQL query that describes the actual join process for the two data sets.

In many procedural languages the process of joining two sets of data can be complicated due to the need to explicitly code each join structure and join algorithm for each combination of columns across the various data sets. The level of complication builds as additional data sets are added to the query; different join operations are required for combinations of data sets such as taking into account missing values within data sets. The join order of multiple data sets has to be decided on and coded procedurally. It is very easy for the amount and complexity of code to increase dramatically as the number of data sets increases. This makes data processing validation and debugging very challenging.  Once additional requirements are applied, such as aggregating specific data points, the level of code complexity within procedural languages can escalate very quickly. 

For example when we join the two tables EMP and DEPT there could be departments that contains no employees. Using an inner join, the departments with zero employees are not be returned. If a query needs to return a count of the number of employees in every department, including the ‘empty’ departments containing zero employees, then an outer join is required as shown here:

 

SELECT
d.deptno,
count(e.empno)
FROM dept
LEFT OUTER JOIN emp e ON (e.deptno = d.deptno)
GROUP BY d.deptno ORDER BY d.deptno;

 

 Slide04

 

Hopefully, if you study the above code samples It is clear that SQL’s join code is easily readable – and code-able. The developer only specifies the semantic join condition and leaves the processing details - such as the order of the joins - to the SQL engine. 

 

4. Aggregate

Aggregation is an important step in the process of analyzing data sets. Most operational, strategic and discovery-led queries rely on summarizing detailed level data. According to a TDWI report “Data Aggregation - Seven Key Criteria to an Effective Aggregation Solution” - up to 90% of all reports contain some level of aggregate information. Therefore, the ability to simply and efficiently aggregate data is a key requirement when selecting a language. If the aggregation process is correctly implanted it can generate significant performance benefits, which creates new opportunities for organizations to boost their overall analysis and reporting capabilities.

The types of aggregation applied to a data set can vary from simple counts to sums to moving averages to statistical analysis such as standard deviations. Therefore, the ability to simply and efficiently aggregate data is a key requirement for any analytical data language. Procedural languages, such as the Java based MapReduce, are more than capable of taking a data set and aggregating it, or reducing it, to provide a summary result set. The approach is adequate for most simple data discovery style queries, i.e. those that include basic counts. However, adding more complex aggregation requirements quickly increases the amount of code required to manage the computations. SQL has a rich set of data aggregation capabilities that make it easy to work on all rows in a set. For example, it is possible to sum all rows within a single column as follows:

SELECT
  SUM(sal) AS total_salary
FROM emp;

It is easy to extend this query to accommodate new requirements such as a count of the number of employees and the average salary:

SELECT
  COUNT(empno) AS no_of_employees,
  SUM(sal) AS total_salary,
  AVG(sal) As average_salary
FROM emp;

Taking these examples even further, it is a simple step to group rows into specific categories using the GROUP BY clause. The aggregate functions and GROUP BY clause group can be used to group the data and then apply the specific aggregate function(s) to count the number of employees, sum the salary and calculate the average salary in each department within a single query as shown below:

SELECT
  deptno,
  COUNT(empno) AS no_of_employees,
  SUM(sal) AS total_salary,
  AVG(sal) AS average_salary
FROM emp
GROUP BY deptno;

The ANSI SQL:2003 standard (more on this towards the end of this paper) extended the process of aggregating data by introducing the concept of analytic functions. These functions divide a data set into groups of rows called partitions making it is possible to calculate aggregates for each partition and for rows within each partition.

The use of additional keywords define the how analytical functions, such as average, sum, min, max etc., will be evaluated within each partition. Using the previous example, the statement below creates analytical reporting totals such as: total salary within each department, moving average salary within each department and average salary:

SELECT
  d.deptno,
  d.dname,
  e.ename,
  e.sal AS sal,  SUM(e.sal) OVER (ORDER BY e.deptno)) AS dept_sal,
  ROUND(AVG(e.sal) OVER (PARTITION BY e.deptno ORDER BY e.empno)) AS moving_avg_sal,
  ROUND(AVG(e.sal) OVER (ORDER BY e.deptno)) AS avg_dept_sal
FROM dept d
LEFT OUTER JOIN emp e ON (e.deptno = d.deptno);

 

Compare the simplicity of the above SQL code for computing a moving average with the equivalent MapReduce code posted by Cloudera on Github as part of the blog post “Simple Moving Average, Secondary Sort, and MapReduce”. This “simple” example consists of twelve java program files to perform the moving average calculation, manage the job parameters and specify the framework for the associated workflow. Making changes to the calculations as business requirements evolve will be a significant challenge given the amount of code within this “simple” project.

SQL offers an additional ability to apply restrictions using columns that are returned as part of the result set. The HAVING clause filters results based on calculations in the SELECT clause and/or aggregations derived from the GROUP BY processing, in the same way that <br> filtering clause is applied, for example: 

SELECT
deptno AS department,s
COUNT(empno) AS no_employees,
AVG(sal) AS avg_sal,
SUM(sal) AS tot_sal
FROM emp
GROUP BY deptno
HAVING avg(sal) > 2500;

 

Using SQL, developers and DBAs can leverage simple, convenient and efficient data aggregation techniques that require significantly less program code compared to using other analytical programming languages. The simplicity provided by SQL makes it easier and faster to construct, manage and maintain application code and incorporate new business requirements.

 

Summary

Hopefully within this blog post I have explained the primary operators that are part of relational algebra (set theory). Given that most of us use SQL every single day it is very easy to forget the power and sophistication that is going on under the covers. It all gets taken for granted until you find yourself working with a different language that takes a completely different framework.

Hopefully, it is clear that SQL offers a simple and efficient way to write queries compared to some of the other newer big data related languages that are emerging through the open source community. This concludes my look at the first of the four key reason as to why SQL is one of the most successful languages. In my next post, which will probably appear towards the end of this week, I will explore why SQL’s transparent optimization is such a critical feature. 

 

Technorati Tags: , , , , ,

Wednesday, 18 February 2015

Why SQL is the natural language for data analysis

Analytics is a must-have component of every corporate data warehousing and big data project. It is the core driver for the business: the development of new products, better targeting of customers with promotions, hiring of new talent and retention of existing key talent. Yet the analysis of especially “big data environments”, data stored and processed outside of classical relational systems, continues to be a significant challenge for the majority companies. According to Gartner, 72% of companies are planning to increase their expenditure on big data yet 55% state they don’t have the necessary skills to make use of it.

A report by Accenture and GE (How the Industrial Internet is Changing the Competitive Landscape of Industries) found that 87% of enterprises believe big data analytics will redefine the competitive landscape of their industries within the next three years and 89% believe that companies that fail to adopt a big data analytics strategy in the next year risk losing market share and momentum.

Additionally, a recent Cloudera webcast (Pervasive Analytics Gets Real ) noted that, while all businesses understand that analytics drive value, most organizations leverage only an average of 12% of their enterprise data for analytics. This implies that there is a significant amount of business value and opportunity that is being completely missed.

These type of market analysis highlights the huge analytical challenge that many businesses face today. While many companies are willing to invest in the critical area of big data technology to create new “data reservoirs”, for most of them, the same level of focus in relation to the analysis of these new data sources is missing. This means that many will struggle to find a meaningful way to analyze and realize the benefits of this vital investment strategy.

Many of the early adopters of big data are managing the analysis of their data reservoirs through the use of specialized programming techniques on the big data ecosystem, such as MapReduce. This is leading to data being locked inside proprietary data silos, making cross-functional cross-data store analysis either extremely difficult or completely impossible. IT teams are struggling to adapt complex data-silo-specific program code to support new and evolving analytical requirements from business users. In many cases these additional requirements force teams to implement yet more data processing languages. Overall, these issues greatly complicate the conversion of big data led discoveries into new business opportunities: driving the development of new products, capturing increased market share and/or launching into completely new markets. Most companies are searching for a single rich, robust, productive, standards driven language that can provide unified access over all their data and drive rich, sophisticated analysis.

Already, many companies are seeing the benefits of using SQL to drive analysis of their big data reservoirs. In fact, SQL is fast becoming the default language for big data analytics. This is because it provides a mature and comprehensive framework for both data access (so projects can avoid creating data silos) and rich data analysis.

12c

The objective of this series of articles, which will appear over the coming weeks, is to explain why SQL is the natural language for amy kind of data analysis including big data and the benefits that this brings for application developers, DBAs and business users.

Why SQL is so successful

Data processing has seen many changes and significant technological advances over the last forty years. However, there has been one technology, one capability that has endured and evolved: the Structured Query Language or SQL. Many other technologies have come and gone but SQL has been a constant. In fact, SQL has not only been a constant, but it has also improved significantly over time. What is it about SQL that is so compelling? What has made it the most successful language? SQL’s enduring success is the result of a number of important and unique factors:

  • Powerful framework
  • Transparent optimization
  • Continuous evolution
  • Standards based

Over the coming weeks I will explore each of these key points in separate posts and explain what makes SQL such a compelling language for data analysis. So stay tuned…..

 

Technorati Tags: , ,

Wednesday, 28 January 2015

January Edition of Oracle DW and Big Data Magazine

 

Follow us on
Oracle Data Warehouse and Big Data Magazine JANUARY Edition for Customers + Partners


The latest edition of our monthly data warehouse and big data magazine for Oracle customers and partners is now available. It brings together all the most important announcements and videos taken from our data warehouse and big data Oracle product management blogs, Oracle press releases, videos posted on Oracle Media Network and Oracle Facebook pages. Click here to view the JANUARY Edition

FB-Jan-2015

Please share this link http://flip.it/fKOUS with colleagues and via your social media sites.


This magazine is optimized for display on tablets and smartphones using the Flipboard App which is available from the Apple App store and Google Play store