Why SQL Part 3 - Simple optimization makes life simpler

In the second part of this series of posts on (Why SQL Part 2 - It has a powerful framework) of this series of blog posts I explained the underlying framework that SQL is based on. That framework is referred to as “relational algebra” and it was first described in 1970 by E.F. Codd while at IBM . It provides well-founded semantics for both modelling and querying a set of data. Codd's framework established a series of basic principles that govern the construction and execution of a query and these can be summarized as follows:

Projection
Filter
Join
Aggregate

Now that we have examined the basics of the relational model let’s now move on to how the language actually determines the best way to answer a question based on its use of simplified and transparent optimizations. To understand why simplified and transparent optimization is so important it is useful to consider two of the most common programming paradigms and these are: procedural and declarative. Let’s explore these two approaches:

The Procedural approach to programming

This approach aims to break down a specific task into a collection of variables, procedures, functions, corresponding data structures and conditional operators. It uses a list of instructions, wrapped up in procedures and subroutines, to tell a computer what to do step-by-step. A program (or query) written in a procedural language will contain one or more procedures and these procedures can be invoked anywhere within the program hierarchy as well as by other procedures. Examples of a procedural programming language include C/C++, Java, ColdFusion and PASCAL. The example below shows how, using a procedural language such as Java, to return a sorted list of names from a simple table (this example refers to a data set based on a list of employees):

public class Employee implements Comparable{
private int empno;
private String ename;
private int hiredate;

public Employee(int empno, String name, int hiredate) {
this.empno=empno;
this.ename=ename;
this.hiredate=hiredate;
}

// Sort in descending order by Employee name. If name is same then sort in descending order by age.
// If age is same then sort in descending order by Empid

public int compareTo(Employee o){
if (o.ename.equals(this.ename))
if (o.hiredate-this.hiredate == 0)
return o.empno - this.empno;
else return o.hiredate-this.hiredate;
else return o.ename.compareTo(this.ename);
}

public int getHiredate(){
return this.hiredate;
}

public int getEmpNo(){
return this.empno;
}

public String getEname(){
return this.ename;
}
}

public class TestEmployeeSort {

public static void main(String[] args) {
List coll = Util.getEmployees();
Collections.sort(coll); // sort method
printList(coll);
System.out.println("---------------------");
Collections.sort(coll,new SortEmployeeByName());
printList(coll);
}

private static void printList(List list) {
System.out.println("EmpId\tName\tAge");
for (Employee e: list) {
System.out.println(e.getEmpId() + "\t" + e.getName() + "\t" + e.getAge());
}
}
}

public class SortEmployeeByName implements Comparator{
public int compare(Employee o1, Employee o2){
return o1.getName().compareTo(o2.getName());
}
}

(this is an example of how to sort a list in Java using Comparator. The code is available here: https://sites.google.com/site/hadoopandhive/home/dsd)

The code snippet above tells the computer precisely the order of processing in terms of how to sort the data. Other procedures load the data into an array for processing, and output the sorted list of names. The full code listing is available on the Google Hadoop & Hive site: https://sites.google.com/site/hadoopandhive/home/dsd

The Declarative approach to programming

The declarative approach allows the developer to simply build the structure and elements of the required program in a way that expresses just the logic of the computation without having to describe the actual control flow. In simple terms it describes “what” the program, or query, should accomplish rather than describing “how” to actually accomplish the desired outcome. The “how” part of the processing is managed transparently by the underlying platform. The most widely used declarative programming language is SQL. Referring back to our problem of returning a sorted list of employees, the SQL, declarative-based, code is much simpler. Using our simple example of returning a sorted list of employees, highlights the key differences between SQL’s declarative-based code and the more workflow-orientated procedural languages such as Java:

SELECT
empno,
ename,
hiredate
FROM emp
ORDER BY empno, hiredate, ename

As you can see SQL allows the developer to simply build the structure and elements of the required program in a way that expresses just the logic of the computation without having to describe the actual control flow. In simple terms it describes “what” the program, or query, should accomplish rather than describing “how” to actually accomplish the desired outcome. The “how” part of the processing is managed transparently by the underlying platform. This contrasts with the procedural approach to programming, which aims to break down a specific task into a collection of variables, procedures, functions, corresponding data structures and conditional operators. It uses a list of instructions, wrapped up in procedures and subroutines, to tell a computer what to do, step-by-step. Many of today’s big data languages adopt this procedural approach.

Summary

With declarative languages, such as SQL, the developer only has to understand the relationships between entities, the final output and transformations needed to support the output. From these two code samples it is obvious that the declarative approach is more compact, significantly less complex and much easier to maintain and enhance. This explains another of the key reasons why SQL is becoming the go-to language for data analysis.

Key characteristics	Procedural Coding	Declarative Coding
Developer needs to understand…	how to organize and perform tasks, must track changes in state	relationships between entities, the final output and transformations needed to support the output
Flow control is implemented using…	looping, conditional branching and developer-defined function calls	Function calls, high-level recursion
Management of application and system state changes...	has to be controlled and managed by the developer	N/A
Order of execution...	very important	N/A

Simplified Optimizations

The optimization techniques for queries tend to focus on minimizing the level of network traffic, reducing I/O operations, minimizing looping operations and/or eliminating redundant calculations and operations. Each application language has it is own unique set of tuning issues so using lots of different languages to support specific sets of business requirements creates a huge overhead for project teams and reduces the overall agility of the project. Trying to implement and then optimize change requests from the business can become a time-consuming task.

Many big data languages are procedural in their approach where the developer is responsible for implementing optimization techniques to support their workload. These languages provide a series of structures to manage the resources consumed by a job. For example each Java based MapReduce job requires a job configuration file that contains a number of parameters: maximum virtual memory, the cumulative size of the serialization and accounting buffers, the number of segments on disk to be merged at the same time and the percentage of memory relative to the maximum heap-size in which map outputs may be retained during the reduce. Each developer optimizes his/her own specific code/process but does so in complete isolation from the overall workload.

This granular level of control provides tremendous flexibility but can create conflicts when there are multiple applications executing similar queries.The correct level of parallelism for big data maps, within a MapReduce job, seems to be around 10-100 maps per-node, although higher numbers for CPU-light map tasks is possible. However, there is nothing to stop the developer from taking even more resources from the pool to run their code. For more information on the various fine-grained parameters for implementing, configuring and tuning MapReduce jobs refer to the MapReduce tutorial on the Apache Hadoop website.

The fact that SQL is a declarative language means the developer is shielded from the complexities of the underlying query optimization techniques. Data discovery and other analytical requirements that are implemented using the Oracle Database as the processing platform have a rich set of optimization techniques covering parallelization, query transformations, indexing and join algorithms.

Building an Optimal Query Plan

The Oracle query optimiser is at the core of Oracle’s database software and it determines the most efficient method for a SQL statement to access requested data. Due to the many internal statistics and tools within the database, the Oracle Optimiser is usually in a better position than the developer to determine the best method of statement execution. During the process of determining the best plan, the optimizer examines multiple access methods, such as full table scan or index scans, and different join methods such as nested loops and hash joins before arriving at the “best execution plan”. For this reason, all SQL statements use the Oracle Optimiser.

It attempts to generate the best execution plan for a given SQL statement. An “execution plan” describes a recommended method of execution for a SQL statement. Each plan details the combination of steps that the Oracle Database must use to execute a SQL statement. The definition of the “best execution plan” is the plan with the lowest cost among all considered candidate plans. The calculation of the overall cost takes into account key factors of query execution such as I/O, CPU, and communication. With declarative languages all these factors have to be coded, configured and managed by the developer.

Consider a user who queries records for employees who are managers. If the database statistics indicate that 80% of employees are managers, then the optimizer may decide that a full table scan is most efficient. However, if statistics indicate that very few employees are managers, then reading an index followed by a table access by rowid may be more efficient than a full table scan. This type of optimization is impractical to implement in MapReduce, since the Java developer would not only have to now write the query code, but also write an additional set of code to gather and maintain statistics about the underlying data.

ExecutionPlan1

Parallel Execution

The most complicated area of optimization is typically in the area of parallel execution. The objective of parallel execution is to reduce the total execution time of an operation by using multiple resources concurrently. Resource availability is the most important prerequisite for scalable parallel execution. For procedural languages developers have to understand a complex set of rules and very carefully consider the best way to parallelize the operations within their code. As mentioned before, this process typically takes place in isolation so that the developer is unaware of other jobs that might be running at the same time and the level of resources that might be available at any given point in time.

The Oracle Database provides a powerful SQL parallel execution engine. It can decide autonomously if and how parallel execution should be enabled for a query. By default the Oracle Database is enabled for parallel execution and when Automatic Degree of Parallelism (Auto DOP) is active, the database decides if a statement should execute in parallel or not and what degree of parallelism (DOP) should be used. The decision when to use parallel execution and the DOP chosen are both based on the resource requirements of a statement and the resources available at the time the query is submitted.

After the optimizer determines the fastest execution plan for a statement then the parallel execution coordinator determines the parallel execution method for each operation within the plan. The parallel execution coordinator has to decide whether an operation can be performed in parallel and if it can it then has to determine how many parallel execution servers it should use. At this point the database has parsed the query, calculated the overall cost and then determined the optimal DoP. Of course it is entirely possible for the cheapest plan to be a serial execution process, therefore, this is also an option. Finally, the Oracle Database will determine if the statement can be executed immediately or if it needs to be queued until more system resources become available. The image below “Optimizer Calculation: Serial or Parallel?” illustrates this decision making process:

Dwhsg132

For more information see the following section in the Oracle Database 12c Data Warehousing Guide: https://docs.oracle.com/database/121/DWHSG/schemas.htm#DWHSG9066.

In general as more and more statements attempt to run in parallel it quickly becomes very important to manage the utilisation of the parallel processes available. This means some degree of intelligence is required about when to run a statement in parallel and verify whether the requested numbers of parallel processes are available. This is extremely difficult for a developer to do simply because they are working in isolation from the overall workload. The only way to implement this type of intelligent processing is to allow the database to control the whole process.

With SQL all this happens transparently as each SQL statement that is submitted. The developer does need to get involved in determining the degree of parallelization and this simplifies the overall application logic. It also means that as new optimization techniques become available the application source code can transparently reap the benefits of the new features.

Summary

The initial focus on developer centric optimization techniques within big data projects is now giving way to a more platform centric approach as the complexity of analysis demanded by business users and data scientists continues to increase. By letting the database perform the optimisations at source, where the data is located, it is possible to share tuning improvements across a wide spectrum of front-end systems (BI reports, dashboards, applications etc). The ability of the database to “own” the optimization and processing characteristics of how a SQL statement is executed is an important element in the success of SQL as a data analysis language.

By deferring the majority of operations to the database developers can simplify their application code and transparently inherit new features as they are released.

Technorati Tags: Analytics, Data Warehousing, Database 12c, Oracle, SQL Analytics

Search This Blog

SQL for Data Warehousing and Analytics