Great news came in over the weekend: My session has been accepted by Oracle Open World!
Session ID: 03183
Session Title: An Easy, High-Speed Connect Between NoSQL Databases and Oracle Database
If you are at the conference, feel free to join the session 🙂
There are lots of tools around Performance Diagnostics and Analysis on the market, of course including those from Quest Software, the company I’m working for.
So why do I want to introduce you to another one? One that has just recently started off and therefore is still like a 1.0 version?
Well, because this one is probaby quite different from the ones you know. And these differences really make a difference!
- It’s free!
- You don’t need to install anything in order to use it (well, at least using parts of it 😉
- Neither does the tool decide what is good or bad performance nor do you have to interpret lots of numbers. Instead, it gives you a detailed picture of how your SQL Server is handling its workload, compared to how other SQL servers from throughout the world are handling the same or similar workloads!
- Did I mention it’s free?
Sounds magic? But it’s straight from reality, and it’s called “Project Lucy”, and it’s live: https://www.projectlucy.com
How does it work?
In a nutshell, it’s a Performance Advisory SaaS (Software-as-a-Service), combined with a data warehouse of anonymized real-world performance data, along with their associated workloads.
Be aware that YOU have to add some value in order to get some value back. Specifically, one way to use it is:
- Collect a SQL Server trace file.
- Upload the trace file in order to get a performance report back.
- Your uploaded trace data will be anonymized and added to the real-world performance data warehouse in order to make future reports even more real-world and accurate.
All of this stuff is web based, so – after successfully logging in – you can access your reports any time from any web browser, and there are also plans for providing detailed trend analysis based on regular uploads.
The mentioned trace files are a standard technique that has been around in the SQL Server world for ages. Project Lucy provides adequate Trace Templates to make this as few clicks as possible for you: https://www.projectlucy.com/Developers/SQLServerTraceTemplates
Also, there is a description how we can fully automate/schedule trace file collection and upload: https://www.projectlucy.com/Developers/AutomatingUploadsToProjectLucy
To be honest, in the current version Project Lucy returns a detailed, browsable analysis of the trace data, but not yet a comparison with the rest of the world. But this is planned as soon as there is a statistically reasonable amount of data available. So, please keep the traces coming…
Here is a sample view of a trace file analysis report:
Not all performance data can be reasonably collected by a trace, as each trace also creates some overhead and a vast amount of raw data. While a trace fits best for specific performance investigations, other interfaces like dynamic management views are better suited for getting a general overview of SQL Server performance. Again, the challenge is always to interpret the numbers and decide whether it’s expected behaviour or an indication of severe performance issues…
For the first part – collecting performance metric and counters – Quest Software’s Spotlight on SQL Server has been around for many years now. Especially in its Enterprise Edition it does a 24-by-7 job of collecting various performance metrics from SQL Server and Windows in order to provide real-time alerting and incident management.
In its latest and greatest version (v8.0) Spotlight offers integration with Project Lucy. What does that mean? Well, Spotlight typically archives its data anyway. Therefore it’s very easy to have it regularly upload select performance metrics once a day into Project Lucy.
Again, what’s the trick behind it?
- You give some anonymized real-world performance data to the community. Here is a detailed description of which data is uploaded: https://www.projectlucy.com/static/spotlightdataupload
- You get back a report on how your SQL Server is doing and how others with a similar workload are doing.
Here is a sample screenshot of what you get back (among various other things):
And here is another sample:
So, if you are already using Spotlight on SQL Server (either the standalone Enterprise Edition or the Desktop Edition which is part of Toad for SQL Server DBA Suite), you can easily leverage this to benefit from Project Lucy. It’s actually one the main new features of this latest Spotlight version, so make sure you don’t miss it 😉
Configuration inside Spotlight is as easy as the following screenshot:
How to start?
If you want to give it a try, here are some useful links how to start with it:
Project Lucy Homepage: https://www.projectlucy.com/
Video on howto explore a workload: http://www.youtube.com/v/yRzeyRKwloQ?version=3&rel=0&autoplay=1
Project Lucy on Twitter: http://twitter.com/qsftprojectlucy
Spotlight on SQL Server Trial Download: http://www.quest.com/spotlight-on-sql-server-enterprise/
It’s finally there: Amazon Web Services now provides Oracle databases as a platform service, optionally including an hourly-rate licensing model! It’s called Amazon RDS for Oracle Database.
Wait a minute… Amazon provides Oracle databases in the cloud for quite some time now. What’s so special and new about this RDS (Relational Database Service) thing?
In a nutshell, this is exactly the difference between IaaS (Infrastructure as a Service, Amazon EC2) and PaaS (Platform as a Service, Amazon RDS):
- An EC2 instance is nothing else than a virtual machine instance. It may be running anything, including a database server, maybe an Oracle database. It still needs someone doing all the stuff like upgrading/patching the OS, applying patchsets/patches/PSUs/CPUs to the Oracle database, as well as caring for things like backup, desaster recovery or replication.
- An RDS instance moves all this infrastructure related stuff into the provider’s responsibility. Specifically, with Oracle on Amazon RDS there is no OS login at all. You can either specify some maintenance intervals for DB upgrades or opt-out of any DB upgrades. Analogously, you specify backup intervals and a backup retention time, but don’t implement the actual backup. And so on… Amazon provides you with an Oracle database platform. That’s PaaS.
If you want to try out Amazon EC2 running an Oracle databases, you may use this blog series as a starting point.
If you are interested in the new RDS offering and how this actually works, read on and stay tuned for more posts on this topic to follow.
For today, let’s have a short look into how this can be set up:
Setup an Oracle Database on RDS
If you are new to Amazon Web Services, feel free to use this post as a starting point in order to register and sign-up for RDS.
After you finished signing up for RDS, go to AWS Management Console, click on the RDS tab, and choose the region where you want to have your database running. It should look something like this:
Clicking on the “Launch DB Instance” button starts a wizard. Until a week ago, the only choice here was to launch a MySQL instance. Now it looks a bit different as entries for Oracle Standard Edition One (SE1), Standard Edition (SE) and Enterprise Edition (EE) have been added (see here for specifications of the different Oracle licensing options). As of now, RDS offers an hourly-usage based licensing for SE1 only. The other editions only offer the BYOL model which means “Bring Your Own License”, i.e. you need an appropriate Oracle license.
In the next step, you can
- choose the licensing model (as I just mentioned, at the time being it’s a proper choice only for SE1).
- 188.8.131.52 is the first and currently only version to be offered.
- Multi-AZ Deployment would be quite important for desaster recovery purposes, but it’s not yet implemented, so “No” is the only choice currently.
- With “Auto Minor Version Upgrade” you can opt-in or opt-out of having patches applied automatically.
- “Allocated Storage” determines your maximum database size (anything between 10GB and 1TB) and – indirectly – your storage costs. You can change this afterwards. It includes data files, temp files, online redologs and archived redologs (your database will be in ARCHIVELOG mode).
- “DB Instance Identifier” is the Oracle SID
- “Master User” is the database user which you can use for login. This user has DBA role, but the DBA role has been a bit restricted. More on this in later post.
Next step asks for
- the database name (server parameter db_name) which is not especially relevant, but best practice is to have it identical to the DB instance identifier (Oracle SID) from the last step.
- the listener port. Currently it offers 3306 as default which seems to come from the MySQL world while Oracle default would normally be 1521. Feel free to choose any port you like.
- As Multi-AZ deployment is not yet available, so we have to select one AZ for deployment: I don’t think it makes any difference which one you choose.
- A “DB Parameter Group” is actually a set of init.ora parameters. In a later post I will talk about how to adjust them for an RDS database.
- The “DB Security Group” determines your firewall settings, i.e. which IP addresses (for clients outside EC2) or EC2 Security Groups (for clients inside EC2) can login to your Oracle instance.
The last step determines the backup retention period and the actual time windows for backups as well as other maintenance operations:
The final screen is just a summary. After another click it creates your instance.
Now go back to the overview dashboard, click on the “DB Instances” link to see the list of your RDS instances and look up the “Endpoint”, “Port” and “Instance Name” of your new instance. The endpoint should have the form:
Use these to setup a TNS name like this:
MYRDSDB = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = <endpoint>)(PORT = <port>)) ) (CONNECT_DATA = (SERVICE_NAME = <instance name>) ) )
You are ready to connect to your Oracle instance. And in the background it is already being backed up 🙂
More to come on this…
Use cases and samples
If you worked through the last step of this series, you probably ended up with some HBase, SimpleDB or other NoSQL store, attached the Toad Data Hub to it, mapped some data structures into tables and thereby opened it up to the SQL world.
Use case 1: Report data
Now we can just run a query like this:
SELECT p.* FROM myHBase.persons p WHERE p.surname LIKE 'A%'
myHBase is the MySQL database in the Data Hub, pointing to an HBase instance. “persons” is the name of the shadow table, pointing to an HBase table. So we can easily apply a well-known SQL construct to find all persons in our HBase store with a surname starting with “A”.
This query would work on any MySQL client. In addition to that, when using Toad for Cloud Databases, we could demonstrate our laziness by using its graphical Query Builder and drag/drop together queries like this (the table and columns in the screenshots are in German, my apologies for that…):
With the Toad client, all result sets can be exported to standard formats like Excel, CSV, XML, HTTP, PDF or as SQL insert script. It can even generate an Excel sheet with an auto refresh button in it that pulls in current data when clicking on it (so-called Excel Linked Queries).
Use case 2: Move data into NoSQL
We can also import data most of these formats, i.e. let Toad create INSERT commands that the Data Hub automatically translates into the corresponding NoSQL store’s API calls.
If the data is already in another database, it’s even easier, e.g. if we attached an HBase store as well as a standard Oracle database to the Data Hub, we could just run this:
INSERT INTO myHBase.persons SELECT * FROM myOracle.persons
Use case 3: Get the best from SQL and NoSQL world
NoSQL systems are strong when it comes to massive parallel scaling and map-reduce tasks, but they are often missing adequate reporting or data mining capabilities. Typically, there aren’t even indexes! Relational systems on the other hand with their sophisticated data indexing, joining and grouping capabilities are much better suited for these tasks. Therefore, it we want to report on the data, we need some ETL processes to load it over from the NoSQL store into a relational warehouse or reporting system. And it couldn’t be easier because all we need to do is:
INSERT INTO myOracle.persons SELECT * FROM myHBase.persons
There is also some wizardry for all these data export, import, moving around functions, e.g. this screenshot shows the two essential steps when moving data from an HBase into an Oracle table. If we need this regularly, we can save the whole wizard process into a template and easily recall it next time.
Use case 4: Feel free to join
Maybe you won’t need this on a daily basis, but it is possible and very straight-forward to even join together data from different data stores. This query joins a person table in HBase to their addresses which are located in Oracle and reports on the number of addresses per person:
SELECT p.id, COUNT(*) FROM myHBase.persons p INNER JOIN myOracle.addresses a ON (p.id = a.persid) GROUP BY p.id
Again, if you prefer the drag/drop style, use Toad’s query builder like this to get the same result:
So it all boils down to writing well-known SQL queries or DML commands. The data hub cares for the whole translation into the corresponding NoSQL API. Toad for Cloud Database and especially its Data Hub are like brokers between SQL and NoSQL world.
If this makes any sense to you, feel free to check it out here.
There are also some very useful demo videos floating around here.
And if you need some more detailed instructions on how to map HBase and Cassandra tables into relational tables, check these out:
Feel free to comment!
Talking SQL to NoSQL – Setting it up
In Part 1, we looked at what Cloud databases and NoSQL data stores are and why they are out there. The actual topic for this part is really Quest Software’s Toad for Cloud Databases which acts as a translator, making it possible to talk SQL to databases like Hadoop, Cassandra, SimpleDB, etc. that have no SQL engine at all.
Toad for Cloud Databases actually consists of two parts (you may also have a look at the architecture diagram):
- A “Data Hub” that does the translation job, i.e. it maps data structures from the NoSQL world to relational tables and translates SQL syntax to the NoSQL store’s specific API calls and vice versa.
- A graphical UI acting as client, connecting to the Data Hub, configuring the data hub and issuing SQL (depicted as a toad in the diagram 😉 )
As you can see, the data hub plays a central role in the whole concept. Technically spoken, it is a MySQL server. Why that? Because MySQL offers an easy way to plug in your own “storage engine”. Well-known and popular storage engines are e.g. MyISAM and InnoDB. Toad’s data hub uses its own, proprietary storage engine named “HUB”.
In this case, the name “storage engine” is a bit misleading because it stores no data at all, but only translates, i.e. it presents the NoSQL data structures as MySQL tables. I tend to call these “shadow tables” because they don’t actually contain any data, but just point to the data.
Querying a shadow table by an SQL SELECT makes the data hub pull the data from the NoSQL store, pipe it through the storage engine, thereby putting it into relational format and give back an SQL result set.
Issuing an SQL INSERT against a shadow table makes the data hub grab the values you want to insert, translate them into the NoSQL store’s format and execute the appropriate API calls. Similar for Updates and Deletes.
The data store can be connected to an arbitrary number of different cloud, NoSQL or even conventional databases (for conventional databases it uses a standard ODBC driver). The actual mapping is as follows:
- Each connected data store is mapped to one MySQL database in the data hub MySQL server.
- Each NoSQL data structure can be mapped to a shadow table in the corresponding MySQL database. The mapping can be influenced by the user, e.g. which MySQL table column maps to which HBase column/column family.
Thereby, the whole MySQL syntax is available for querying or manipulating data in the attached data stores (HBase, Cassandra etc.)
The data hub can run on Windows, Linux or inside an Amazon EC2 instance, given of course that it has connectivity to the data stores you want to connect to. If the data stores are in different network segments with no connectivity in between, different data hubs are needed.
If the data stores in question reside in Amazon EC2, e.g. SimpleDBs, it may be a good idea to have the data hub run in EC2 as well as this will keep network traffic and cost to a minimum (inter-instance network traffic is typically faster than between an EC2 instance and the internet, and Amazon doesn’t account for inter-instance network traffic inside the same EC2 region).
Registering the data stores and mapping the data structures into relational “shadow tables” can be done either with the Windows based graphical client (“Toad for Cloud Databases”) or with the Toad for Cloud Databases Eclipse plugin. Here is a screenshot how an HBase data store can be registered:
After finishing the registration and mapping of tables, any MySQL client can be used to access the shadow tables, and thereby the NoSQL data. So you could just use the standard MySQL command line. The Toad for Cloud Database graphical client offers a couple of useful features beyond pure connectvity and editing SQL, e.g. a graphical query builder, wizards for exporting, importing, comparing and syncing data as well as reporting tools.
In the next part, I will give some samples and use cases on how we can use this to build bridges between SQL databases and NoSQL stores.
Cloud databases and NoSQL data stores – what’s it about?
What is a cloud database, and why is it supposed to be useful? Actually, the idea behind it is quite similar to what is behind the rest of the cloud computing story: Providing databases (and hopefully an approriate SLA specifying availability and service quality) on a self-service basis, provisioning a whole database server in minutes instead of days, paying per actual usage. The user (let’s call him consumer) doesn’t necessarily need to care about how the SLA is achieved, e.g. which hardware is used, things like clustering, replication, database or OS patching and upgrading.
Sometimes this is described as DaaS (Database as a Service), and it is a special case of PaaS (Platform as a Service). Just as a side note: this is a lot different from IaaS (Infrastructure as a Service), e.g. running an Oracle database inside an Amazon EC2 instance. In this case, the DBA still has to do all the mentioned stuff like patching, upgrading etc.
SQL Azure would be an example of a cloud database. It is very similar to SQL Server (actually it is based on SQL Server, accessed through a gateway that abstracts away any infrastructure related stuff, see here for an in-depth article from Kalen Delaney). Another example would be Amazon’s RDS which offers MySQL instances as service and – announced for Q2/2011 – also Oracle instances as service.
That said, what is NoSQL about? Well, another quality that is expected from a cloud is scalability or “elasticity”, i.e. easy scale-up and scale-down. Any application that needs to scale and that is running on top of a database, will only scale well if the underlying database does so. Imagine a given SLA that can be fulfilled by starting up one standard database instance (Oracle,SQL Server,DB2,Sybase, whatever you like) and one application server instance. Now, if your application gets a real big success story and you have to meet the same SLA with 200 times the load, you will probably need two things:
- Startup 200 application server instances AND have a good load balancing
- Startup 200 database instances AND have a good load balancing
The latter point is where NoSQL comes in. Traditional databases are just not well equipped for scaling this much. Their clustering is more about high availability (failover cluster) or – like Oracle RAC – partial load balancing as it means having several instances, but still one single database.
NoSQL data stores are not prominent for not talking SQL, this is just a side effect (actually NoSQL is mostly interpreted as “Not Only SQL”). They are prominent because of their virtually unlimited scaling capabilities. You need more power? Start up 100 additional instances. The NoSQL store’s engine will take care of things like replication, redundant storage, load distribution. Under the bottom line, this is a real compute cluster, or a compute cloud.
Main drivers for NoSQL data stores in the past have been the various social media applications we see today, but also massive parallel online gaming, and other portals with 5-digit up to 9-digit user numbers. Traditional databases just have a very tough time when trying to cover these requirements with an acceptable pricing.
So, from a developer’s point of view, NoSQL data stores are just the right for the job. They are specialized, lightweight engines with very simple APIs (typically no complex SQL engine), but with extreme scalability. Each NoSQL store is highly optimized for its respective purpose. Administration efforts are supposed to be very low, e.g. some of these systems don’t have a backup concept at all as they provide highly redundant storage anyway.
- HBase and Cassandra are typically called “Columns Stores” or “Wide Column Stores” where the data is physically stored column-wise, not row-wise like traditional database systems.
- Amazon’s SimpleDB, Microsoft’s Azure Table Storage and the well-known BerkeleyDB are so-called “Key Value Stores”, i.e. they store simple key-values pairs, similar to Windows registry entries.
- MongoDB or CouchDB have specialized on storing documents, specifically those in JSON format, so they are commonly called “Document Stores”.
- “Graph Databases” like Neo4J are not yet that widely adopted, but are optimized for storing and processing graph related data, e.g. for GPS systems.
This site gives a great overview on the NoSQL universe.
In the next part, I will show how to use Quest’s Toad for Cloud Databases to talk SQL to NoSQL data stores. As SQL depends on at least some relational data structure, this is easier for Column Stores and Key-Value Stores than for Document Stores. Therefore the former two types of systems will be our focus.
This post serves as a starting point for a short blog series where I will show you how to talk SQL to NoSQL data stores.
Which sounds like a contradiction in itself, because most of the NoSQL stores don’t have an SQL interface at all (remember: NoSQL stands for “Not only SQL”, so this is true for most, not necessarily all NoSQL stuff).
While having no SQL interface may be quite ok or even positive from a developer’s perspective, I (with my DBA affinity) still like and prefer good old SQL, e.g. for quick checks what’s going on in the database (sorry: it’s called a “data store” when it comes to NoSQL…). Even in the NoSQL world you may want to join or compare data from different sources or move them around, from NoSQL to a standard SQL database or vice versa. In these cases, SQL comes handy as it allows you to do these things with SQL joins, INSERT commands and other well-known syntax.
A great tool that works like a translator between SQL and the correponding NoSQL data store’s API is the Toad for Cloud Databases from Quest Software (my employer, actually). You can find it here, and during the next posts I will give an overview of NoSQL and how Toad can connect SQL to NoSQL.
Enjoy and feel free to comment!