Using TPC-H metrics to explain the three-fold benefit of Cascade Analytics System (I)

Zeepabyte’s solution datasheet as well as the recent PR with IBM Power Systems qualified the “three-fold benefit” of using Cascade Engine on different types of server infrastructure, clustered or not, on premises or on the cloud:


  1. Groundbreaking performance

    2-10x more queries per hour means higher productivity, that is:  less wait time in business analysts’ schedule, more frequent insights from business operations data, faster detection of risks or intrusions

  2. revolutionary low cost of performance

    When such performance is achieved with 10-20 times fewer processor cores, not only the cost of infrastructure melts down, but a whole range of smaller devices become able to process data locally

  3. drastically reduced energy costs

    Extreme efficiency in using processor’s  computing powers means more units of “useful analytics work” done by one CPU and leads to 50-80% savings in energy costs to power hardware infrastructure.


Let’s now back these claims with concrete numbers obtained following the trusted methodology of the Transactions Processing Council (TPC) as per their TPC-H benchmark specification.

The latest TPC Benchmark™ H – Decision Support – Standard Specification Revision 2.17.1 (June 2013) defines the following primary metrics:

The TPC-H Composite Query-per-Hour Metric (QphH@Size)
The price-performance metric is the TPC-H Price/Performance ($/QphH/@Size)
The Availability Date of the system ( when all system components are Generally Available.)

Note: No other TPC-H primary metric exist. Surprisingly, TPC_Energy metric defined as the power per performance (Watts/KQphH@Size) is optional.

The TPC-H Composite Query-per-Hour Metric expresses the overall performance of a complete and commercially available analytic system hardware and software. The reported values tend to be very large numbers because the specification mandates multiplication with the Scale Factor (SF) or "Size" of the test.  SF is a proxy for the test database size, expressed in GB (i.e., SF = 1 means approximately 1GB size for a test database). Most recent TPC-H Top 10 performance results are clustered in the SF range 1000-10000 that is test database sizes of 1TB-10TB, hence TPC-H  Composite Query-per-Hour Metric gets into hundreds of thousands and millions.

Note: The maximum size of the test database for a valid performance test is currently set at 100000 (i.e., SF = 100,000).

Vendors such as Dell, Cisco, HP, Lenovo, IBM, Exasol, Huawei, Actian, Microsoft, or Oracle, combine their hardware and analytic database software to compete for the highest performance of the overall system. Notwithstanding their own internal data organization, database vendors must report the same performance and cost metrics, for the same standardized set of complex business requests, against the same structured  datasets containing the same number (from some millions to tens of billions) of rows of randomized and refreshed data..

Cascade Analytics System consistently proved better performance at any of the SF tested sofar and in partnership with different hardware vendors, with 2-3 times more queries per hour than the benchmark leaders,  and lower cost, with almost 10 times lower price per unit of performance than these leaders.

 Scale Factor  TPC-H Composite Query-per-Hour Metric (QphH@Size) TPC-H Price/Performance ($/QphH@Scale)
Database (~GB size) Dataset size (rows) Clustered  servers Best result by TPC Cascade System* Best results by TPC Cascade System*
100 600M N 420,092 1,416,786 0.11 0.01
100 600M Y 1,582,736 2,855,786 0.12 0.02
300 1.8B N 434,353 * 0.24 *
300 1.8B Y 2,948,721 5,683,743 0.12 0.02
1000 6B N 717,101 4,107,362 0.61 0.06
1000 6B Y 5,246,338 9,366,072 0.14 0.05
3000 18B N 2,140,307 4,262,032 0.38 0.09
3000 18B Y 7,808,386 * 0.15 *
10000 60B N 1,336,109 * 0.92 *
10000 60B Y 10,133,244 * 0.17 *
30000 180B N 1,056,164 * 2.04 *
30000 180B Y 11,223,614 * 0.23 *
1000000 600B N _ * _ *
1000000 600B Y 11,612,395 * 0.37 *
*Tests to benchmark at higher Scale Factors will occur as soon as opportunity presents itself and more infrastructure vendors enter in partnership for such tests.
"Any TPC-H result is comparable to other TPC-H results regardless of the number of query streams used during the test as long as the scale factors chosen for their respective test databases were the same."  (Clause 5.4.5.2 of TPC Benchmark™ H - Decision Support - Standard Specification Revision 2.17.1)

Following graphs show (bold shape)  how Cascade Analytics Systems fared in comparison with TPC-H – Top Ten Price/Performance Results Version 2 Results   As of 5-Aug-2017 at 3:06 AM  [GMT]

Most analytics systems participating in TPC-H benchmark are traditional non-clustered datacenter tower systems, increasing their resources by adding faster processors with more cores and larger sizes RAM.

However, cloudification, mobility and geographical distribution trends of IT infrastructure,  increase demand for clustered and even elastic systems which scale out (on demand) by adding nodes to the managed computing infrastructure.

Cascade Engine outperforms by far other database vendors on non-clustered and gives signs of  a healthy linear trend on clustered infrastructures (more to come on this behavior in next posts!)

These are very strong results!  But they are not enough.

More practical questions must be answered in relationship with these results:

  • How does “almost 10 million queries-per-hour against a 1TB TPC-H test compliant database” benefit business analysts whose reports shall take in less than a few seconds or who need to run ad-hoc queries for quick business insights without any idea how many computing resources they need for that type of query or how long the data refresh will take?
  • Will I get the same performance if my data is on Hadoop or on my corporate approved hardware/software?
  • Isn’t the extremely low cost-for-performance at TB of dataset sizes due to the analytic system  using open source operating system and ancillary functions?

Just like many other known DBMS providers, Cascade Analytics System, as benchmarked on TPC-H metrics, uses open source software components which are selected for their proven maturity, features performance, customer footprint and community support continuity.  Cascade Engine is written in Java and deploys on open source Linux, sofar either Ubuntu or CentOS are used.  Cluster operations of Cascade Analytic System are well supported by Apache HBase on top of Hadoop and HDFS and also by its own capabilities based on Apache Thrift RPC mechanism.  However, Cascade Engine does not need a proprietary cluster Operating System or a heavy  Database Management System (DBMS). Read all about Cascade architecture, functional components and core features on product documentation webpage.

 While Cascade is using many open source software components and runs with very high performance on even the cheapest lightest computing boards (see our results on Raspberry Pi2!),  the secret sauce for Cascade Engine extreme performance in speed and cost is its patent pending data encoding and retrieval methods.

But even with the TPC-H measured numbers in hand and showing how much higher is the performance and how many times lower price for that performance numbers,  there is still a long way until the derived business benefits become clear to our customers and investors.

Why is that?

Data warehouse (DWH) practitioners do not work with metrics such as “Query-per-hour-@-datasize”. Most of them have knowledge about  metrics such as :

(1) data size and its  growth per day/month/year

(2) the number of  queries business analysts need to run in one day

(3) the type of queries and the type and availability format of data sources

(4) for how long  people or applications can wait for the analytics system to process certain types of queries.

They may also know business related constraints such as : (5) how fast the data is refreshed or (6) what is the upload time for daily new operational data into the analytics system. These two latter metrics become important as DWH operations must include optimizations for “big data” characteristics, particularly large volumes, variety and velocity, for example by organically growing a lower cost Hadoop environment as part of its branded Database Management Systems (DBMS) infrastructure (see an interesting DWH Cost Saving Calculator)

How many vendors connected their highly touted TPC-H performance results with these practical attributes of DWH analytic systems? Most technical marketing materials of large analytic systems vendors build on use-cases: problems,  solutions based on the technology and the cost -benefit of the solution. Fetching a similar use-case out of a large portfolio is vendor’s best bet against the risk of getting a prospect customer confused by the metrics of TPC-H benchmark!

But is this handy similarity – and our very human characteristic of learning and judging by analogy – still valid at “large scale” (or large numbers)? Even experts of a domain have difficulties when thinking in very large numbers/sizes or making projections at very distant times. Extrapolation using simple linear models,  those which built the foundation of business management (in static economic environment) as well as of semiconductor electronics (amplifying small signals around gate’s opening voltage),  fails at large scale inputs.

“Making decisions about the performance characteristics of a big data analytics system  based on use-case analogy is extremely error prone. Making purchase decisions without understanding the nature of all sources of costs in the analytic system,  is suicidal.

Using TPC-H benchmark only for the greatness of its primary metrics and without revealing which system’s characteristics support and sustain performance at larger and larger datasizes, is a missed opportunity.  The next posts will explain how to translate these metrics into real data analytics business metrics and give concrete answers to DWH practitioners’ questions.

Stay tuned and do not forget to download and use the trial version of Cascade Engine!

 

Author: Lucia Gradinariu

On track with our mission to slash the cost of Business Intelligence in near real-time big data analytics use cases

Each member of Zeepabyte’s founding team has worked for more than 15 years with large scale databases and infrastructure, serving data driven business intelligence to operational teams and processes. Whether it has been about   Essbase, Informix or Oracle’s warehouses, each of us aligned data from relational and semi-structured data sources, built complex schemas, partitioned data and maintained indices or penetrated the guts of database engine optimizers and network protocols to speed-up business reports and execute ad-hoc queries right on time to support decision making.

We knew too well that our expert efforts added high operating costs to the heavy bills of software and hardware licensing and maintenance.  But we were able to solve the “cost of performance” equation and managed to improve response time for our most demanding customers in airline, financial or telecommunications industries.

However, our frustration with analytic database technologies grew over time because most of them approached “infinite costs” while query response time plateaued far from the near real-time (seconds).  Occasionally, more often than seldom, some important queries hit incomprehensible, insurmountable snags within the more and more distributed, interdependent and auto-magically managed software stacks.

The flows of data to be scrutinized for business intelligence multiplied, diversified and started to exhibit hard to tame dynamics.  Big data analytics extended and expanded traditional analytic database systems, pushing a stream of technology innovations eager to be mapped on market analysts’ maps, quadrants and other visuals for rapid categorization, sorting and ranking.  A good collection can be downloaded here.

The new “Business Intelligence Architect” role is in demand but also in pain. Today, our friends and former brothers in arms, business analysts and expert DBAs, holding such maps in their hands, pave the battlefield between the Governance guards overseeing enterprise data assets in the enterprise warehouse and the pressing business strategies in need for real-time actionable insights to compete, with increased agility, in crowded and dynamic marketplaces.

“What is the real cost of Business Intelligence and Real-Time Analytics? Let me tell you what it takes  to answer a critical business question these days” said one of my friends who builds OLAP cubes for Product Marketing BI in a major US telecommunications company.

“First I have to find the data sources for the BI report. There are hundred of databases around the company connected to the Data Warehouse and almost all the time I need access to one of them because either the data is newer or the data is missing from the Data Warehouse. It takes days to find the owner of a database and get that access.

Sometimes i  have to cross the swamp to get to the Data Lake, the innovation lab,  and create a new data source from social-media, data anonymizing engines or IoT edge devices.

Then I have to pass the scrutiny of the gatekeeper and, if lucky,  filter a data view with which to align the new data or the new dimension. I could remove or reuse some of the old cubes to save storage space but the time cost of getting the data access back and rebuild them, in this environment, is prohibitive.

I do all the query optimization work back in my swamps, because there is no tolerance for uncertainty over how long and how much memory and CPU resources will be taken out of the daily operating budget of the IT department (on premises or in the cloud). Something unexpected happens  the first time I run a new query.”

A BI Architect data swamp consists of numerous connectors into the Data Warehouse, thousands of OLAP cubes, hundreds of Data Marts, one tap into the big Data Lake hanging in the cloud with torrents of data coming from over uncontrollable edges.

Driving near real-time query performance in this data swamp feels as easy as speeding up Orinoco river’s flows in this aerial picture taken in 2001!

Zeepabyte was born on Alex’s innovative idea about how to encode and search data at blasting speeds without burning hundreds of CPUs. But mastering the variables of cost of performance equation across Business Intelligence analytics data swamps?  We needed an objective framework to measure expected performance metrics and all gauge sources of costs when running complex business queries against datasets which mirror the nature, organization and scale function of Business Intelligence and IoT analytics use cases.

TPC-H has reigned over benchmarking “the cost of performance” of analytic systems used by business organizations.

“[…]This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries. These aspects include the selected database size against which the queries are executed, the query processing power when queries are submitted by a single stream, and the query throughput when queries are submitted by multiple concurrent users. The TPC-H Price/Performance metric is expressed as $/QphH@Size.”

 

In less than two years, using TPC-H methodology, tools and metrics, we tested Zeepabyte’s first Cascade Analytics System implementation with large infrastructure partners such as Mellanox Technologies Lab in Silicon Valley and IBM’s Power Development Cloud.

We slashed the costs and achieved near real-time performance on Star Schema Benchmark early on. Recently, Cascade Zippy Analytics System, running Version 2.0 of Zeepabyte’s Cascade Engine queried 3TB of data in less than 3 seconds using IBM Power Systems.

We packaged version 2.0 of Zeepabyte’s Cascade Engine and published extensive documentation so you can download and try its three-fold benefit in your own use-case.

Let us know how you are doing! Zeepabyte Cascade Trial Forum is now open to support your experience and get feedback from you on our product.

Author: Lucia Gradinariu

Welcome!

Welcome to Zeepabyte Cascade Technology and Big Data Analytics blog!

Please meet Zeepabyte, a young Silicon Valley startup in the Big Data Analytics domain. The mission of Zeepabyte is to improve the data ecology and the world ecology.

Well, forget the Global Warming, but how about that local warming that hundreds or thousands of computers inevitably create in your data lab? You can battle it with expensive fans and ACs and desperately watch your hardware and energy bills grow. Or you can use Zeepabyte’s Cascade Analytics System to solve your data analysis problems in the most efficient manner. You will need less hardware than with other systems and you will be able to achieve much better performance.

Did you know that Zeepabyte’s Cascade has by far the lowest cost of performance in the industry? This means you can get answers to your questions faster and cheaper than competitors.

What does Cascade do, and how is it able to achieve revolutionary improvements in cost of performance in analytic tasks? Revisit this blog often to get a better idea of the technology breakthrough that Zeepabyte is bringing to the Big Data world.

Author: Alex Russakovsky