Lucia Gradinariu – Zeepabyte.com blog

August 28, 2017August 30, 2017

Using TPC-H metrics to explain the three-fold benefit of Cascade Analytics System (II)

The first blog on this topic introduced TPC-H Composite Query-per-Hour Performance metric (QphH@Size), a formula which included the size of the dataset against which vendors must benchmark their systems (hence the high numbers), and TPC-H Price/Performance metric, expressed as the ratio between the total cost of the analytic system and the QphH@Size.

TPC’s metrics are objective in comparing, by numbers, systems running the standard set of queries against same size and distribution datasets. By those numbers, Cascade Engine showed outstanding performance.

This post drills further into TPC-H methodology to recover metrics with more practical meaning for BI system users such as the averages in query response speed of a TPC-H benchmarked analytic system and the cost of resources to support performance at larger data scales.

Business analysts care about speed and its sustainability when data size or number of concurrent sessions go up. They expect the analytic system to maintain its response speed at scale (“scalability”).

Note: BI analysts have yet another concern – the speed of an “ad-hoc” query, a query that has not been ran before and, no matter how well tuned up the query processing engine was, it could default into a full scan of the data, the slowest most expensive operation in terms of time and resource consumption.

The query processing speed of a system, although not a “primary metric”, is embedded in the performance test required by the TPC-H benchmark methodology.

The performance test consists of two runs of a well defined, representative set of queries, issued serially within each user session: 
1. a "Power test", to measure the raw query execution power of the system when connected with a single active user.
2. a "Throughput test", to measure the ability of the system to process the most queries in the least amount of time, through two or more sessions, one session per 
query stream and each stream must execute queries serially 
(i.e., one after another).

The geometrical mean of query execution time is expressed by TPC-H Power per hour@SF, a secondary metric. This mean is calculated as TPC-H Power/(3600*SF).

The calculated speed and the computing resources (memory and processing power) supporting it is tabulated below. Cascade System is compared against the same TPC-H – Top Ten Price/Performance Results Version 2 (Results As of 25-Aug-2017 at 6:00 AM [GMT] ), taking the top cluster and non-cluster performer at each Scale Factor. At one glance, for the Scale Factors at which it was benchmarked, Cascade Engine was twice faster using two-five times fewer computing resources, without even considering its data encoding and compression advantage for the amount of RAM needed to keep all data “in-memory”.

			Best results by TPC				Cascade Engine System
SF (~GB size)	Database size(rows)	Cluster	Speed (sec.)	Total RAM (GB)	Total #proc.	Total #cores	Speed (sec.)	Total RAM (GB)	Total #proc.	Total #cores
100	600M	N	0.78	64	2	16	0.2	64	1	8
100	600M	Y	0.28	96	12	120	0.18	192	6	48
300	1.8B	N	2.09	128	2	16	*	*	*	*
300	1.8B	Y	0.47	288	24	240	0.26	256	8	64
1000	6B	N	4.09	512	2	44	0.87	960	2	20
1000	6B	Y	0.82	640	40	400	0.4	384+	10	80
3000	18B	N	5.02	3072	4	96	2.11	960+	2	20
3000	18B	Y	1.59	1792	56	560	*	*	*	*
10000	60B	N	20.27	6144	4	112	*	*	*	*
10000	60B	Y	3.91	5440	68	680	*	*	*	*
30000	180B	N	75.04	12800	8	144	*	*	*	*
30000	180B	Y	10.31	12800	80	800	*	*	*	*
1000000	600B	N	_		_		*	*	*	*
1000000	600B	Y	30.63	38400	100	1000	*	*	*	*

+ RAM shortage caused some data to be stored on the disc; calculated “speed” could be higher

*The query processing speed (geom. mean) in the above table is calculated as TPC-H Power/(3600*SF) and expressed in s.  TPC-H benchmark methodology requires a "power" test, to measure the raw query "execution power" of the system when connected with a single active user. The result is the geometrical mean of query execution time expressed TPC-H Power per hour@SF, a secondary metric.   
The other secondary metric is TPC-H Throughput, measured during throughput test. This metric's values demonstrate the ability of the system to process the most queries in the least amount of time. (performance). The arithmetic   SF*(S*22*3600)/TPC-H Throughput 
In performance comparisons, geometrical means are a more objective representation of system capabilities than arithmetic means are.
The TPC believes that comparisons of TPC-H results measured against different database sizes are misleading and discourages such comparisons”.
TPC-H Composite Query-per-Hour Metric (QphH@Size) is the geometrica

Top TPC-H benchmarked systems by average speed of query response time (seconds)

As we “unbundled” the system speed from TPC-H metrics, it becomes clear that Cascade Engine’s outstanding performance is supported by substantially fewer hardware resources (CPU and RAM). These resources are also the most expensive components of a server unit, therefore, using fewer of them yields the “lowest cost of performance” benefit claimed by Zeepabyte.

Each vendor participating in the TPC-H benchmark publishes the technical specifications of the hardware used. When plotting the TPC-H Composite Performance metric values “Query-per-Hour@Size” (equivalent Scale Factor) per unit of computation, either processor “core” or “thread”, Cascade Analytic System (Z) , demonstrates “sustainable scalability” with data volume and well-engineered multi-threaded implementation of its data search algorithm.

Why does “sustainable scalability” matter?

When business analysts are asked to deliver faster reports from more data, the pressure raises on the mid-level management responsible for business technology to procure more “scalable” analytic solutions. Under tight implementation deadlines there is no time to drill into arcane benchmark results from various vendors to design “scalable solutions” for their own business. Half a century of conditioned IT thinking in “support systems”, pushes for expanding access to centralized DWHs by building smaller datawarehouses for each department, which then spawn hundreds of datamarts and thousands of OLAP cubes.

Not surprising, this is an uncanny way to temporarily solve the “scalability” problem, because each of these smaller systems can be expanded with additional storage and computing resources at their own level (and departmental expenses!) up to a point of diminishing returns, where (or when) most of them become slower and too expensive to expand year over year. That is the point when business analysts start complaining and new analytic system is procured. At the “overall supporting system” level this is just another island of temporary performance relief because the reality is a wider and wider “data swamp”.

Data processing in analytic systems is algorithmic, through software. On a large data set, software can process different parts of it independently, on multiple CPUs. Therefore it is important to distinguish “scalability” achived by adding a certain amount of resources to sustain some system performance parameters (e.g. TPC-H QphH@Size) from the efficient use of those computing resources (e.g. software implementations).

This idea is brilliantly explained in "Scalability - but at what COST?"  ,  a paper presented at HotOS 2015 conference by 3 ex-Microsoft researchers who work on distributed computing systems and data parallelism. They  surveyed measurements of data-parallel systems touting impressive scalability features, implemented their data processing algorithms  in single thread computing and reproduced the benchmark conditions to what degree are the "scalable" parallel implementations truly improve performance, as opposed to parallelizing overheads that they themselves introduce. 
Defining "COST" of data processing with a given algorithm as the hardware 
conﬁguration required before the platform outperforms a competent 
single-threaded implementation, they found that many systems have surprisingly large COST, that is they used hundreds of cores to match the single threaded implementation performance.

In a natural analogy, a well engineered analytics system shall behave like watermills: the water wheel works faster, it yields more power, when more water flows through; an additional wheel shall be engaged when the previous or proximal one reaches maximum speed.

Cascade Analytic System, benchmarked in both cluster and non-cluster configurations, has these natural characteristics, The following graph illustrates its behavior on 3 different type and generations of processors.

The next post will drill further into the benefits of efficient use of computation power translated into real cost advantage at very large data volumes and analytic workloads.

Stay tuned and do not forget to download and use the trial version of Cascade Engine!

Author: Lucia Gradinariu

August 2, 2017August 26, 2017

Using TPC-H metrics to explain the three-fold benefit of Cascade Analytics System (I)

Zeepabyte’s solution datasheet as well as the recent PR with IBM Power Systems qualified the “three-fold benefit” of using Cascade Engine on different types of server infrastructure, clustered or not, on premises or on the cloud:

Groundbreaking performance

2-10x more queries per hour means higher productivity, that is: less wait time in business analysts’ schedule, more frequent insights from business operations data, faster detection of risks or intrusions
revolutionary low cost of performance

When such performance is achieved with 10-20 times fewer processor cores, not only the cost of infrastructure melts down, but a whole range of smaller devices become able to process data locally
drastically reduced energy costs

Extreme efficiency in using processor’s computing powers means more units of “useful analytics work” done by one CPU and leads to 50-80% savings in energy costs to power hardware infrastructure.

Let’s now back these claims with concrete numbers obtained following the trusted methodology of the Transactions Processing Council (TPC) as per their TPC-H benchmark specification.

The latest TPC Benchmark™ H – Decision Support – Standard Specification Revision 2.17.1 (June 2013) defines the following primary metrics:

The TPC-H Composite Query-per-Hour Metric (QphH@Size)
The price-performance metric is the TPC-H Price/Performance ($/QphH/@Size)
The Availability Date of the system ( when all system components are Generally Available.)

Note: No other TPC-H primary metric exist. Surprisingly, TPC_Energy metric defined as the power per performance (Watts/KQphH@Size) is optional.

The TPC-H Composite Query-per-Hour Metric expresses the overall performance of a complete and commercially available analytic system hardware and software. The reported values tend to be very large numbers because the specification mandates multiplication with the Scale Factor (SF) or "Size" of the test.  SF is a proxy for the test database size, expressed in GB (i.e., SF = 1 means approximately 1GB size for a test database). Most recent TPC-H Top 10 performance results are clustered in the SF range 1000-10000 that is test database sizes of 1TB-10TB, hence TPC-H  Composite Query-per-Hour Metric gets into hundreds of thousands and millions.

Note: The maximum size of the test database for a valid performance test is currently set at 100000 (i.e., SF = 100,000).

Vendors such as Dell, Cisco, HP, Lenovo, IBM, Exasol, Huawei, Actian, Microsoft, or Oracle, combine their hardware and analytic database software to compete for the highest performance of the overall system. Notwithstanding their own internal data organization, database vendors must report the same performance and cost metrics, for the same standardized set of complex business requests, against the same structured datasets containing the same number (from some millions to tens of billions) of rows of randomized and refreshed data..

Cascade Analytics System consistently proved better performance at any of the SF tested sofar and in partnership with different hardware vendors, with 2-3 times more queries per hour than the benchmark leaders, and lower cost, with almost 10 times lower price per unit of performance than these leaders.

Scale Factor			TPC-H Composite Query-per-Hour Metric (QphH@Size)		TPC-H Price/Performance ($/QphH@Size)
Database (~GB size)	Dataset size (rows)	Clustered servers	Best result by TPC	Cascade System*	Best results by TPC	Cascade System*
100	600M	N	420,092	1,416,786	0.11	0.01
100	600M	Y	1,582,736	2,855,786	0.12	0.02
300	1.8B	N	434,353	*	0.24	*
300	1.8B	Y	2,948,721	5,683,743	0.12	0.02
1000	6B	N	717,101	4,107,362	0.61	0.06
1000	6B	Y	5,246,338	9,366,072	0.14	0.05
3000	18B	N	2,140,307	4,262,032	0.38	0.09
3000	18B	Y	7,808,386	*	0.15	*
10000	60B	N	1,336,109	*	0.92	*
10000	60B	Y	10,133,244	*	0.17	*
30000	180B	N	1,056,164	*	2.04	*
30000	180B	Y	11,223,614	*	0.23	*
1000000	600B	N	_	*	_	*
1000000	600B	Y	11,612,395	*	0.37	*

*Tests to benchmark at higher Scale Factors will occur as soon as opportunity presents itself and more infrastructure vendors enter in partnership for such tests.
"Any TPC-H result is comparable to other TPC-H results regardless of the number of query streams used during the test as long as the scale factors chosen for their respective test databases were the same."  (Clause 5.4.5.2 of TPC Benchmark™ H - Decision Support - Standard Specification Revision 2.17.1)

Following graphs show (bold shape) how Cascade Analytics Systems fared in comparison with TPC-H – Top Ten Price/Performance Results Version 2 Results As of 5-Aug-2017 at 3:06 AM [GMT]

Most analytics systems participating in TPC-H benchmark are traditional non-clustered datacenter tower systems, increasing their resources by adding faster processors with more cores and larger sizes RAM.

However, cloudification, mobility and geographical distribution trends of IT infrastructure, increase demand for clustered and even elastic systems which scale out (on demand) by adding nodes to the managed computing infrastructure.

Cascade Engine outperforms by far other database vendors on non-clustered and gives signs of a healthy linear trend on clustered infrastructures (more to come on this behavior in next posts!)

These are very strong results! But they are not enough.

More practical questions must be answered in relationship with these results:

How does “almost 10 million queries-per-hour against a 1TB TPC-H test compliant database” benefit business analysts? All they care about is that their reports must take less than a few seconds or, that when they run an ad-hoc query they are not automatically kicked-out of the DWH because the query takes too many IT resources, slowing down core business operation processes.
Will we get the same breath-taking performance when I use Cascade Engine with my data on Hadoop or on other corporate approved hardware/software?
Isn’t the extremely low cost-for-performance due to Cascade Analytic System using open source operating system or some deep discount of your software hard to detect in TPC-H’s complex formulae?

Let’s clarify from start: just like many other known DBMS providers, Cascade Analytics System, as benchmarked on TPC-H metrics, uses open source software components which are selected for their proven maturity, features performance, customer footprint and community support continuity. Cascade Engine is written in Java and deploys on open source Linux, sofar either Ubuntu or CentOS are used. Cluster operations of Cascade Analytic System are well supported by Apache HBase on top of Hadoop and HDFS and also by its own capabilities based on Apache Thrift RPC mechanism. However, Cascade Engine does not need a proprietary cluster Operating System or a heavy Database Management System (DBMS). Read all about Cascade architecture, functional components and core features on product documentation webpage.

Our customers and multiple trials demonstrated that while Cascade is using open source OS-es and software components, the secret sauce for Cascade Engine extreme performance in speed and cost is its patent pending data encoding and retrieval methods. The truth and proof will be discussed at length in these posts, with data collected from implementations of Cascade Analytic Systems on branded or unbranded servers but also on very cheap and ligh computing boards (see our results on Raspberry Pi2!),

With the TPC-H measured numbers in hand to show how much higher is the performance and how many times lower the price of that performance numbers, there is still a long way until the derived business benefits become clear to our customers and investors.

Why is that?

Data warehouse (DWH) practitioners do not work with metrics such as “Query-per-hour-@-datasize”. Most of them have knowledge about metrics such as :

(1) data size and its growth per day/month/year

(2) the number of queries business analysts need to run in one day

(3) the type of queries and the type and availability format of data sources

(4) for how long people or applications can wait for the analytics system to process certain types of queries.

They may also know business related constraints such as : (5) how fast the data is refreshed or (6) what is the upload time for daily new operational data into the analytics system. These two latter metrics become important as DWH operations must include optimizations for “big data” characteristics, particularly large volumes, variety and velocity, for example by organically growing a lower cost Hadoop environment as part of its branded Database Management Systems (DBMS) infrastructure (see an interesting DWH Cost Saving Calculator)

How many vendors connected their highly touted TPC-H performance results with these practical attributes of DWH analytic systems? Most technical marketing materials of large analytic systems vendors build on use-cases: problems, solutions based on the technology and the cost -benefit of the solution. Fetching a similar use-case out of a large portfolio is vendor’s best bet against the risk of getting a prospect customer confused by the metrics of TPC-H benchmark!

But is this handy similarity – and our very human characteristic of learning and judging by analogy – still valid at “large scale” (or large numbers)? Even experts of a domain have difficulties when thinking in very large numbers/sizes or making projections at very distant times. Extrapolation using simple linear models, those which built the foundation of business management (in static economic environment) as well as of semiconductor electronics (amplifying small signals around gate’s opening voltage), fails at large scale inputs.

“Making decisions about the performance characteristics of a big data analytics system based on use-case analogy is extremely error prone. Making purchase decisions without understanding the nature of all sources of costs in the analytic system, is suicidal.“

Using TPC-H benchmark only for the greatness of its primary metrics and without revealing which system’s characteristics support and sustain performance at larger and larger datasizes, is a missed opportunity. The next posts will explain how to translate these metrics into real data analytics business metrics and give concrete answers to DWH practitioners’ questions.

Stay tuned and do not forget to download and use the trial version of Cascade Engine!

Author: Lucia Gradinariu

July 20, 2017July 24, 2017

On track with our mission to slash the cost of Business Intelligence in near real-time big data analytics use cases

Each member of Zeepabyte’s founding team has worked for more than 15 years with large scale databases and infrastructure, serving data driven business intelligence to operational teams and processes. Whether it has been about Essbase, Informix or Oracle’s warehouses, each of us aligned data from relational and semi-structured data sources, built complex schemas, partitioned data and maintained indices or penetrated the guts of database engine optimizers and network protocols to speed-up business reports and execute ad-hoc queries right on time to support decision making.

We knew too well that our expert efforts added high operating costs to the heavy bills of software and hardware licensing and maintenance. But we were able to solve the “cost of performance” equation and managed to improve response time for our most demanding customers in airline, financial or telecommunications industries.

However, our frustration with analytic database technologies grew over time because most of them approached “infinite costs” while query response time plateaued far from the near real-time (seconds). Occasionally, more often than seldom, some important queries hit incomprehensible, insurmountable snags within the more and more distributed, interdependent and auto-magically managed software stacks.

The flows of data to be scrutinized for business intelligence multiplied, diversified and started to exhibit hard to tame dynamics. Big data analytics extended and expanded traditional analytic database systems, pushing a stream of technology innovations eager to be mapped on market analysts’ maps, quadrants and other visuals for rapid categorization, sorting and ranking. A good collection can be downloaded here.

The new “Business Intelligence Architect” role is in demand but also in pain. Today, our friends and former brothers in arms, business analysts and expert DBAs, holding such maps in their hands, pave the battlefield between the Governance guards overseeing enterprise data assets in the enterprise warehouse and the pressing business strategies in need for real-time actionable insights to compete, with increased agility, in crowded and dynamic marketplaces.

“What is the real cost of Business Intelligence and Real-Time Analytics? Let me tell you what it takes to answer a critical business question these days” said one of my friends who builds OLAP cubes for Product Marketing BI in a major US telecommunications company.

“First I have to find the data sources for the BI report. There are hundred of databases around the company connected to the Data Warehouse and almost all the time I need access to one of them because either the data is newer or the data is missing from the Data Warehouse. It takes days to find the owner of a database and get that access.

Sometimes i have to cross the swamp to get to the Data Lake, the innovation lab, and create a new data source from social-media, data anonymizing engines or IoT edge devices.

Then I have to pass the scrutiny of the gatekeeper and, if lucky, filter a data view with which to align the new data or the new dimension. I could remove or reuse some of the old cubes to save storage space but the time cost of getting the data access back and rebuild them, in this environment, is prohibitive.

I do all the query optimization work back in my swamps, because there is no tolerance for uncertainty over how long and how much memory and CPU resources will be taken out of the daily operating budget of the IT department (on premises or in the cloud). Something unexpected happens the first time I run a new query.”

A BI Architect data swamp consists of numerous connectors into the Data Warehouse, thousands of OLAP cubes, hundreds of Data Marts, one tap into the big Data Lake hanging in the cloud with torrents of data coming from over uncontrollable edges.

Driving near real-time query performance in this data swamp feels as easy as speeding up Orinoco river’s flows in this aerial picture taken in 2001!

Zeepabyte was born on Alex’s innovative idea about how to encode and search data at blasting speeds without burning hundreds of CPUs. But mastering the variables of cost of performance equation across Business Intelligence analytics data swamps? We needed an objective framework to measure expected performance metrics and all gauge sources of costs when running complex business queries against datasets which mirror the nature, organization and scale function of Business Intelligence and IoT analytics use cases.

TPC-H has reigned over benchmarking “the cost of performance” of analytic systems used by business organizations.

“[…]This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries. These aspects include the selected database size against which the queries are executed, the query processing power when queries are submitted by a single stream, and the query throughput when queries are submitted by multiple concurrent users. The TPC-H Price/Performance metric is expressed as $/QphH@Size.”

In less than two years, using TPC-H methodology, tools and metrics, we tested Zeepabyte’s first Cascade Analytics System implementation with large infrastructure partners such as Mellanox Technologies Lab in Silicon Valley and IBM’s Power Development Cloud.

We slashed the costs and achieved near real-time performance on Star Schema Benchmark early on. Recently, Cascade Zippy Analytics System, running Version 2.0 of Zeepabyte’s Cascade Engine queried 3TB of data in less than 3 seconds using IBM Power Systems.

We packaged version 2.0 of Zeepabyte’s Cascade Engine and published extensive documentation so you can download and try its three-fold benefit in your own use-case.

Let us know how you are doing! Zeepabyte Cascade Trial Forum is now open to support your experience and get feedback from you on our product.

Author: Lucia Gradinariu

Groundbreaking performance

revolutionary low cost of performance

drastically reduced energy costs

Author: Lucia Gradinariu

Author: Lucia Gradinariu