Connecting Apache Zeppelin to your Oracle Data Warehouse

Zeppelin NT DSC 9550

In my last posts I provided an overview of the Apache Zeppelin open source project which is a new style of application called a “notebook”. These notebook applications typically runs within your browser so as an end user there is no desktop software to download and install.

Interestingly, I had a very quick response to this article asking about how to setup a connection within Zeppelin to an Oracle Database. Therefore, in this post I am going to look at how you can install the Zeppelin server and create a connection to your Oracle data warehouse.

This aim of this post is to walk you through the following topics:
  • Installing Zeppelin
  • Configuring Zeppelin
  • What is an interpreter
  • Finding and installing the Oracle JDBC drivers
  • Setting up a connection to an Oracle PDB
Firstly a quick warning! There are a couple of different versions of Zeppelin available for download. At the moment I am primarily using version 0.6.2 which works really well. Currently, for some reason I am seeing performance problems with the latest iterations around version 0.7.x and this issue. I have discussed this a few people here at Oracle and we are all seeing the same behaviour - queries will run, they just take 2-3 minutes longer for some unknown reason compared with earlier versions, pre-0.7.x, of Zeppelin.

In the interests of completeness in this post I will cover setting up a 0.6.2 instance of Zeppelin as well as a 0.7.1 instance.

Installing Zeppelin

The first thing you need to decide is where to install the Zeppelin software. You can run on your own PC or on a separate server or on the same server that is running your Oracle Database. I run all my linux based database environments within Virtualbox images so I always install onto the same virtual machine as my Oracle database - makes life easier for moving demos around when I am heading off to user conference.

Step two is to download the software. The download page is here: https://zeppelin.apache.org/download.html.

Simply pick the version you want to run and download the corresponding compressed file - my recommendation, based on my experience, is to stick with version 0.6.2 which was released on Oct 15, 2016. I always select to download the full application - “Binary package with all interpreters” just to make life easy and it also gives me access the full range of connection options which, as you will discover in my next post, is extremely useful.

Installing Zeppelin - Version 0.6.2

After downloading the zeppelin-0.6.2-bin-all.tgz file onto my Linux Virtualbox machine I simply expand the file to create a “zeppelin-0.6.2-bin-all” directory. The resulting directory structure looks like this:

Fullsizeoutput 104a


Of course you can rename the folder name to something more meaningful, such as “my-zeppelin” if you wish….obviously, the underlying folder structure remains the same!

Fullsizeoutput 104c

Installing Zeppelin - Version 0.7.x

The good news is that if you want to install one of the later versions of Zeppelin then the download and unzip process is exactly the same. At this point in time there are two versions of 0.7, however, both 0.7.0 and 0.7.1 seem to suffer from poor query performance when using the JDBC driver (I have only tested the JDBC driver against Oracle Database but I presume the same performance issues are affecting other types of JDBC-related connections). As with the previous version of Zeppelin you can, if required, change the default directory name to something more suitable.
Now we have our notebook software unpacked and ready to go!

Configuring Zeppelin (0.6.2 and 0.7.x)

This next step is optional. If you have installed the Zeppelin software on the same server or virtual environment that runs your Oracle Database then you will need to tweak the default configuration settings to ensure there are no clashes with the various Oracle Database services. By default, you access the Zeppelin Notebook home page via the port 8080. Depending on your database environment this may or may not cause problems. In my case, this port was already being used by APEX, therefore, it was necessary to change the default port…

Configuring the Zeppelin http port

If you look inside the “conf” directory there will be a file named “zeppelin-site.xml.template”, rename this to “zeppelin-site.xml”. Find the following block of tags:
<property>
 <name>zeppelin.server.port</name>
 <value>8080</value>
 <description>Server port.</description>
</property>
the default port settings in the conf file will probably clash with the APEX environment in your Oracle Database. Therefore, you will need to change the port setting to another value, such as:
<property>
 <name>zeppelin.server.port</name>
 <value>7081</value>
 <description>Server port.</description>
</property>

Save the file and we are ready to go! It is worth spending some time reviewing the other settings within the conf file that let you use cloud storage services, such as the Oracle Bare Metal Cloud Object Storage service. For my purposes I was happy to accept the default storage locations for managing my notebooks and I have not tried to configure the use of an SSL service to manage client authentication. Obviously, there is a lot more work that I need to do around the basic setup and configuration procedures which hopefully I will be able to explore at some point in time - watch this space!

OK, now we have everything in place: software, check…. port configuration, check. It’s time to start your engine!

Starting Zeppelin

This is the easy part. Within the bin directory there is a shell script to run the Zeppelin daemon:
. ../my-zeppelin/bin/zeppelin-daemon.sh start

There is a long list of command line environment settings that you can use, see here: https://zeppelin.apache.org/docs/0.6.2/install/install.html. In my Virtualbox environment I found it useful to configure the following settings:
  • ZEPPELIN_MEM: amount of memory available to Zeppelin. The default setting is - -Xmx1024m -XX:MaxPermSize=512m
  • ZEPPELIN_INTP_MEM: amount of memory available to the Zeppelin Interpreter (connection) engine and the default setting is derived from the setting of ZEPPELIN_MEM
  • ZEPPELIN_JAVA_OPTS: simply lists any additional JVM options
 therefore, my startup script looks like this:
set ZEPPELIN_MEM=-Xms1024m -Xmx4096m -XX:MaxPermSize=2048m
set ZEPPELIN_INTP_MEM=-Xms1024m -Xmx4096m -XX:MaxPermSize=2048m
set ZEPPELIN_JAVA_OPTS="-Dspark.executor.memory=8g -Dspark.cores.max=16"
. ../my-zeppelin/bin/zeppelin-daemon.sh start
 
 Fingers crossed, once Zeppelin has started the following message should appear on your command line:
 Zeppelin start                                             [  OK  ]

Connecting to Zeppelin

Everything should now be in place to test whether your Zeppelin environment is up and running. Open a browser and type the ip address/host name and port reference which in my case is: http://localhost:7081/#/ then the home page should appear:

Fullsizeoutput 1050

The landing pad interface is nice and simple.

In the top right-hand corner you will see a green light which tells me that the Zeppelin service is up and running. “anonymous” is my user id because I have not enabled client side authentication. In the main section of the welcome screen you will see links to the help system and the community pages, which is where you can log any issues that you find.

The Notebook section is where all the work is done and this is where I am going to spend the next post exploring in some detail. If you are used using a normal BI tool then Zeppelin (along with most other notebook applications) will take some getting used to because it creating reports follows is more of scripting-style process rather than a wizard-driven click-click process you get with products like Oracle Business Intelligence. Anyway, more on this later,

What is an Interpreter?

To build notebooks in Zeppelin you need to make connections to your data sources. This is done using something called an “Interpreter”. This is a plug-in which enables Zeppelin to use not only a specific query language but also provides access to backend data-processing capabilities. For example, it is possible to include shell scripting code within a Zeppelin notebook by using the %sh interpreter. To access an Oracle Database we use the JDBC interpreter. Obviously, you might want to have lots of different JDBC-based connections - maybe you have an Oracle 11g instance, a 12cR1 instance and a 12c R2 instance. Zeppelin allows you to create new interpreters and define their connection characteristics.

It’s at this point that version 0.6.2 and versions 0.7.x diverge. Each has its own setup and configuration process for interpreters so I will explain the process for each version separately. Firstly, we need to track down some JDBC files…

Configuring your JDBC files

Finally, we have reached the point of this post - connecting Zeppelin to your Oracle data warehouse. But before we dive into setting up connections we need to track down some Oracle specific jdbc files. You will need to locate one of the following files to use with Zeppelin: ojdbc7.jar  (Database 12c Release 1) or ojdbc8.jar (Database 12c Release 2).

You can either copy the relevant file to your Zeppelin server or simply point the Zeppelin interpreter to the relevant directory. My preference is to keep everything contained within the Zeppelin folder structure so I have taken my Oracle JDBC files and moved them to my Zeppelin server. If you want to find the JDBC files that come with your database version then you need to find the jdbc folder within your version-specific folder. In my 12c Release 2 environment this was located in the folder shown below:

Fullsizeoutput 1052


alternatively, I could have copied the files from my local SQL Developer installation:

Fullsizeoutput 1054

take the jdbc file(s) and copy them to the /interpreter/jdbc directory within your Zeppelin installation directory, as shown below:

Fullsizeoutput 1057

Creating an Interpreter for Oracle Database

At last we are finally ready to create a connection to our Oracle Database! Make a note of the directory containing the Oracle JDBC file because you will need that information during the configuration process. There is a difference between the different versions of Zeppelin in terms of creating a connection to an Oracle database/PDB.

Personally, I think the process in version 0.7.x makes more sense but the performance of jdbc is truly dreadful for some reason. There is obviously been a major change of approach in terms of how connections are managed within Zeppelin and this seems to causing a few issues. Digging around in the documentation it would appear that 0.8.x version will be available shortly so I am hoping the version 0.7x connection issues will be resolved!

Process for creating a connection using version 0.6.2

Starting from the home page (just click on the word “Zeppelin” in the top left corner of your browser or open a new window and connect to http://localhost:7081/#/), then click on the username “anonymous” which will reveal a pulldown menu. Select “Interpreter” as shown below:

Fullsizeoutput 105e

this will take you to the home page for managing your connections, or interpreters. Each query language and data processing language has its own interpreter and these are all listed in alphabetical order.

Fullsizeoutput 105f


scroll down until you find the entry for jdbc:

Fullsizeoutput 1060

here you will see that the jdbc interpreter is already configured for two separate connections: postgres and hive. By clicking on the “edit” button on the right-hand side we can add new connection attributes and in this case I have removed the hive and postgres attributes and added new attributes
  • osql.driver
  • osql.password
  • osql.url
  • osql.user
the significance of the “osql.” prefix will become obvious when we start to build our notebooks - essentially this will be our reference to these specific connection details. I have added a dependency by including an artefact that points to the location of my jdbc file. In the screenshot below you will see that I am connecting to the example sales history schema owned by user sh, password sh, which I have installed in my pluggable database dw2pdb2. The listener port for my jdbc connection is 1521.

If you have access to SQL Developer then an easy solution for testing your connection details is to setup a new connection and run the test connection routine. If SQL Developer connects to your database/pdb using your jdbc connection string then Zeppelin should also be able to connect successfully. FYI…error messages in Zeppelin are usually messy and long listings of a Java program stack. Not easy to workout where the problem actually originates. Therefore, the more you can test outside of Zeppelin the easier life will be - at least that is what I have found!

Below is my enhanced configuration for the jdbc interpreter:

Fullsizeoutput 1061

The default.driver is simply the entry point into the Oracle jdbc driver which is oracle.jdbc.driver.OracleDriver. The last task is to add an artifact [sic] that points to the location of the Oracle JDBC file. In this case I have pointed to the 12c Release 1 driver stored in the ../zeppelin/intepreter/jdbc folder.

Process for creating a connection using version 0.7.x

As before, starting from the home page (just click on the word “Zeppelin” in the top left corner of your browser or open a new window and connect to http://localhost:7081/#/), then click on the username “anonymous” which will reveal a pulldown menu shown below:
Fullsizeoutput 105e

now with version 0.7.0 and 0.7.1 we need to actually create a new interpreter, therefore, just click on the “+Create” button:

Fullsizeoutput 1063

this will bring up the “Create new interpreter” form that will allow you to define the attributes for the new interpreter:

Fullsizeoutput 1067


I will name my new interpreter “osql” and assign it to the JDBC group:

Fullsizeoutput 1068

this will pre-populate the form with the default attributes needed to define a JDBC-type connection such as:
  • default.driver: driver entry point into the Oracle JDBC driver
  • default.password: Oracle user password
  • default.url: JDBC connection string to access the Oracle database/pDB 
  • default.user: Oracle username
the initial form will look like this:

Fullsizeoutput 1069

and in my case I need to connect to a PDB called dw2pdb2 on the same server accessed via the listener port 1521, the username is sh and the password is sh. The only non-obvious entry is the default.driver which is oracle.jdbc.driver.OracleDriver. As before, the last task is to add an artifact [sic] that points to the location of the Oracle JDBC file. In this case I have pointed to the 12c Release 2 driver stored in the ../zeppelin/intepreter/jdbc folder.

Once you have entered the configuration settings, hit Save and your form should look like this:

Fullsizeoutput 106b


Testing your new interpreter

To test the your interpreter will successfully connect to your database/pdb and run a SQL statement we need to create a new notebook. Go back to the home page and click on the “Create new note” link in the list on the left side of the screen.

Fullsizeoutput 1073

Enter a name for your new note:

Fullsizeoutput 1072

which will bring you to the notebook screen which is where you write your scripts - in this case SQL statements. This is similar in layout and approach as many worksheet-based tools (SQL Developer, APEX SQL Worksheet etc etc). If you are using version 0.6.x of Zeppelin then you can bypass the following…

If you are using version 0.7.x then we have to bind our SQL interpreter (osql) to this new note which will allow us to run SQL commands against the sh schema. To add the osql interpreter simply click on the gear icon in the top right-hand side of the screen:

Fullsizeoutput 1070

this will then show you the list of interpreters which are available to this new note. You can switch interpreters on and off by clicking on them and for this example I have reduced the number of interpreters to just the following: markup (md), shell scripting (sh), file management (file), our Oracle SH pdb connection (osq) and jdbc connections (jdbc). Once you are done, click on the “Save” button to return to the note.

I will explain the layout the of the note interface in my next post. For the purposes of testing the connection to my pdb I need to use the “osql” interpreter and give it a SQL statement to run. This is two-lines of code as shown here

Fullsizeoutput 1077

On the right side of the screen there is a triangle icon which is will execute or “Run” my SQL statement:
SELECT sysdate FROM dual

Fullsizeoutput 1079

note that I have not included a semi-colon (;) at the end of the SQL statement! In version 0.6.2 if you include the semi-colon (;) you will get a java error. Version 0.7x is a little more tolerant and does not object to having or not having a semi-colon (;).

Using my Virtualbox environment the first time I make a connection to execute a SQL statement the query takes 2-3 minutes to establish the connection to my PDB and then run the query. This is true even for simple queries such as SELECT * FROM dual. Once the first query has completed then all subsequent queries run in the normal expected timeframe (i.e. around the same time as executing the query from within SQL Developer).

Eventually, the result will be displayed. By default, output is shown in tabular layout (as you can see from the list of available icons, "graph-based layouts are also available"

Fullsizeoutput 1078

…and we have now established that the connection to our SH schema is working.

Summary

In this post we have covered the following topics:
  • How to install Zeppelin
  • How configure and start Zeppelin
  • Finding and installing the correct Oracle JDBC drivers
  • Set up a connection to an Oracle PDB and tested the connection
As we have seen during this post, there are some key differences between the 0.6.x and 0.7.x versions of Zeppelin in terms of the way interpreters (connections) are defined. Now we have a fully working environment (Zeppelin connected to my Oracle 12c Release 2 PDB which includes sales history sample schema).

Therefore, in my next post I am going to look at how you can use the powerful notebook interface to access remote data files, load data into a schema, create both tabular and graph-based reports, briefing books and even design simple dashboards. Stay tuned for more information about how to use Zeppelin with Oracle Database.

If you are already using Zeppelin against your Oracle Database and would like to share your experiences that would be great - please use the comments feature below or feel free to send me an email: keith.laker@oracle.com.

(image at top of post is courtesy of wikipedia)


Technorati Tags: , ,

Comments

Popular posts from this blog

My query just got faster - brief introduction to 12.2 in-memory cursor duration temp tables

SQL Pattern Matching Deep Dive - Part 1

SQL Pattern Matching Deep Dive - Part 6, state machines