The Dell/EMC merger, IBM and co storage revenues decline, … Traditional IT vendors are under attack! What is going on? What’s the bigger picture? How can they recover? I’ll be trying here to answer these questions and suggest some less “simplistic” answers.

We see several major trends which shape our life:

Most data is consumed and generated by mobile devices
Applications move to a SaaS based data-consumption model
Everyone from Retailers, Banks, to Governments depend on data based intelligence to stay competitive
We are on the verge of the IoT revolution with billions of sensors generating more data

New companies prefer using office 365 or Salesforce.com in the cloud and avoid installing and maintaining local exchange servers, ERP or CRM platforms, and are relieved from investing capital on hardware or software. If you are a twenty or even a hundred-employee firm, there’s no reason to own any servers – do it all in the cloud. But at the same time many organizations depend on their (exponentially growing) data and home-grown apps and are not going to ship it to Amazon anytime soon.

Enterprise IT is becoming very critical to the business. Let’s assume you are the CIO of an insurance company. If you don’t provide a slick and responsive mobile app and tiered pricing based on machine learning of driver records, someone else will, and will disrupt your business, i.e. the “Uber” effect. This forces enterprises to borrow cloud-native and SaaS concepts for on premise deployments, and design around agile and micro-services architecture. These new methodologies impose totally different requirements from the infrastructure, and have caught legacy enterprise software and hardware vendors unprepared.

newapps

The challenge for current IT vendors and managers is how do they transform from providing hardware/system solutions for the old application world (exchange, SharePoint, oracle, SAP, ..) which is now migrating to SaaS models, to become a platform/solutions provider for the new agile and data intensive applications which matters most for the business. Some analysts attribute EMC and IBM storage revenue decline to “Software Defined Storage” or Hyper-Converged Infrastructure, Well … it may have had some impact, but the big strategic challenge they have is the Data Transformation.

The above trends impact data in three dimensions: Locality, Scalability, and Data Awareness. As we go mobile and globally distributed, data locality (silos) is becoming a burden, with more data we must address scalability, complexity, and lower the overall costs. To serve the business we must be able to find relevant data in huge data piles, draw insights faster, and make sure it is secured, forcing us to build far more data aware systems.

New solutions have been designed for the cloud and replace traditional enterprise storage, as illustrated in the diagram below.

data-aware

We used to think of Cloud as IaaS (Infrastructure as a service), and had our Private IaaS Clouds using VMware or now OpenStack. But developers no longer care about infrastructure, they want data/computation services and APIs. Cloud Providers acknowledge that and have been investing most of their energy in the last few years in technologies which are easier to use, more scalable and distributed, and in gaining deeper data awareness.

Google developed new databases like Spanner which can span globally and handle semi-structured data consistently (ACID), new data streaming technologies (Dataflow), object storage technologies, Etc. Amazon is arming itself with fully integrated scale-out data services stack including S3 (Object), DynamoDB (NoSQL), Redshift (DW), Kinesis (Streaming), Aroura (Scale-out SQL), Lambda (Notifications), CloudWatch. And Microsoft is not standing still with new services like Azure Data Lake, and is shifting developers from MS SQL to a scale-out database engines.

What’s common to all those new technologies is the design around commodity servers, direct attached high-density disk enclosures or SSD, and fast networking. They implement highly efficient and data aware low-level disk layouts to optimize IOs and search performance, and don’t need a lot of the SAN, vSAN/HCI, or NAS baggage and features, resiliency, consistency, and virtualization is handled at higher levels.

Cloud data services have self-service portals and APIs and are used directly by the application developers, eliminating any operational overhead, no need to wait for IT in order to provision infrastructure for new applications, we consume services and pay per use.

Latest stats show that the majority of Big Data projects are done in the cloud, mainly due to the simplicity. And majority of the on premise deployments are not considered successful, in some cases have negative ROI (due to human resource drain, lots of pro-services, and expensive software licenses). So as long as you’re not concerned about storing your data in the cloud, it would be simpler and cheaper.

EMC, IBM, Dell, and HP have been building is the same non data aware legacy storage model, just slightly better. Many of the new entrants in the all flash, scale-out vSAN, Scale-out NAS, and hyper converged (HCI) are basically going after the same old resource intensive application world with somewhat improved efficiency and scalability. But they do not cater for the new world (read my post on cloud-native apps).

HP has Vertica which is good but a point solution. EMC has Pivotal with several elements of the stack. Many are partial and not too competitive, and they lack the overall integration. IBM has the richest stack including Softlayer, BigInsights, and the latest Cleaversafe acquisition, but not yet in an integrated fashion, and they need to prove they can be as agile and cost effective as the cloud guys.

They can build a platform using open source projects. The challenge is those are usually point solutions, not robust, need to be managed and installed individually, need to add glue logic and sort out dependencies. Most lack the required usability, security, compliance, or management features needed for the enterprise, those issues must be addressed to make it a viable platform. We need things to be integrated just like Amazon is using S3 to backup DynamoDB, Kinesis as a front end to S3, RedShift or DynamoDB, and Lambda as S3 events handler. Vendors may need to shift some focus from their IaaS and OpenStack efforts (IT orientation), to Micro-services, PaaS, and SaaS (DevOps Orientation) with modern, self-service, integrated, and managed data services.

There is still a huge market for on premise IT, many organizations like banks, healthcare, telco, large retailers, etc. have lots of data and computation needs and would not trust Amazon with it. Many Enterprise users are concerned about security and compliance, or are guided by national regulations, and would prefer to stay on premise. Not all Amazon users are happy with the bill they get at the end of the month, especially when they are heavy users of data and services.

But today cloud providers deliver significantly better efficiency than what IT can offer to the business units. If that won’t change we will see more users flowing to the cloud, or clouds coming to the users. Amazon and Azure are talking to organizations about building on premise clouds inside the enterprise, basically out-sourcing the IT function altogether.

Enterprise IT vendors and managers should better wake up soon, take actions, and stop whining about public clouds, or some of them will be left in the cold.

Recently Cloudera launched a new Hadoop project called Kudu. it is quite aligned with the points I made in my Architecting BigData for Real Time Analytics post, i.e. the need to support random and faster transactions or ingestion alongside batch or scan processing (done by MapReduce or Spark) on a shared repository.

Today users which need to ingest data from Mobile, IoT sensors, or ETL, and on the same time run analytics have to use multiple data storage solutions and copy the data just because some tools are good for writing (ingestion), some are good for random access, and some are good for reading and analysis. Beyond the fact this is rather inefficient it adds significant delay (coping terabytes of data) which hinders the ability to provide real-time results, and add data consistency and security challenges. Also not having true update capabilities in Hadoop limits its use in the high margin data warehouse market.

Cloudera wants to penetrate faster into the enterprise by simplifying the stack, and at the same time differentiate itself and maintain leadership. It sees how many native Hadoop projects like Pig, Hive, Mahout are becoming irrelevant with Spark or other alternatives, and it must justify its multi-billion dollar valuation.

But isn’t Kudu somewhat adding more fragmentation, and yet another app specific data silo? Are there better ways to address the current Hadoop challenge? more on that later …

What is Kudu?

You can read a detailed analysis here. In a nutshell, Kudu is a distributed Key/Value storage with column awareness (i.e. row value can be broken to columns, and column data can be compressed), somewhat like the lower layers of Cassandra or Amazon Red-Shift. Kudu also incorporates caching and can leverage memory or NVRAM. Kudu is implemented in C++ to allow better performance and memory management (vs HDFS in Java).

It provide a faster alternative to HDFS with Parquet files, Plus allow updating records in place instead of re-writing the entire dataset (updates are not supported in HDFS or Parquet). Note Hortonworks recently added update capabilities to Hive over HDFS, but in an extremely inefficient way (every update adds a new file which is linked to all the previous version files, due to HDFS limitations).

kudu-hbase-hdfs1

Source: Cloudera

What is it NOT?

Kudu is a point solution, it is not a full replacement for HDFS since it doesn’t support file or object semantics and it is slower for pure sequential reads, it is not a full NoSQL/NewSQL tool and other solutions like Impala or Spark need to be layered on top to provide SQL. Kudu is faster than HDFS but still measure transaction latency in milliseconds, i.e. can’t substitute in-memory DB, it also have higher insert (ingestion) overhead so not better than HBASE in write intensive loads.

Kudu is not “Unstructured” or “Semi-Structured”, you must explicitly define all your columns like in traditional RDBMS, somewhat against the NoSQL trend.

Kudu is a rather new project at Alpha or Beta level stability, and with very limited or no functionality when it comes to data security, compliance, backups, tiering, etc. It will take quite a bit of time until it matures to enterprise levels, and it’s not likely to be adopted by the other Hadoop distributions (MapR or HortonWorks) who work on their flavor.

What are the key challenges with this approach?

The APIs of Kudu do not map to any existing abstraction layer, this means Hadoop projects need to be modified quite a bit to make use of it, Cloudera will only support Spark and Impala initially.

It is not a direct HDFS or HBASE replacement as outlined in Cloudera web site:

“As Hadoop storage, and the platform as a whole, continues to evolve, we will see HDFS, HBase, and Kudu all shine for their respective use cases.

HDFS is the filesystem of choice in Hadoop, with the speed and economics ideal for building an active archive.
For online data serving applications, such as ad bidding platforms, HBase will continue to be ideal with its fast ability to handle updating data.
Kudu will handle the use cases that require a simultaneous combination of sequential and random reads and writes – such as for real-time fraud detection, online reporting of market data, or location-based targeting of loyalty offers”

This means the community has created yet another point solution for storing data. Now, in addition to allocating physical nodes with disks to HDFS, HBASE, Kafka, MongoDB, Etc. we need to add physical nodes with disks for Kudu, and each one of those data management layers will have its own tools to deploy, provision, monitor, secure, etc. Now users who are already confused would need to decide which to use for what use case, and would spend most of their time integrating and debugging OpenSource projects rather than analyzing data.

What is the right solution IMHO?

The challenge with Hadoop projects is that there is no central governess, layers, or architecture like in the Linux Kernel or OpenStack projects or any other successful open source projects I participated in. Every few weeks we can hear about a new project which in most cases overlaps with others. In many cases different vendors will support different packages (the ones they contributed). How do you do security in Hadoop? Well that depends if you ask Horton or Cloudera, Which tool has the best SQL? Again depends on the vendor, even file formats are different.

I think its ok to have multiple solutions for the same problem, Linux has an endless number of file systems as we know, OpenStack has several overlay network or storage provider implementations, BUT they all adhere to the same interfaces, have the same expected semantics, behavior, and management. Where we see much overlap (like with HDFS, Kudu, HBase) we create intermediate layers so components can be shared.

If there is a consensus that the persistent storage layers in Hadoop (HDFS, HBASE) are limited or outdated, and we may need new abstractions to support record or column semantics, improve random access performance, or if we all understand security is a mess, the best way is to first define and agree on the new layers and abstractions, and gradually modify the different projects to match that. If the layers are well defined it means different vendors can provide different implementations of the same component while enjoying the rest of the echo-system. Existing commercial products or already established OpenSource projects can be used with some adaptation and immediately deliver Enterprise resiliency and usability. We can grow the echo-system beyond the current 3 vendors with better solutions that complement each other rather than compete on the entire stack, and we can add more analytics tools (like Spark) which may want to access the data layer directly without being ties to the entire Hadoop package.

If we seek better Enterprise adoption in Hadoop, I believe the right way for Cloudera or the other vendors is to provide an open interface for partners to build solutions around it. Much like Linux, OpenStack, MongoDB, Redis or MySQL did alongside their own reference implementation. A better way for them to build value may be to improve the overall usability and focus on pre-integrating vertical analytics applications or recipes above the infrastructure

Adding another point solution like Kudu to a singular data model, and working for few years to make it Enterprise grade just makes Hadoop deployment more complex and make the life of data analysts even more miserable. Well … on second thought if most of your business is in pro services keeping things complicated might be a good idea J

	yaron haviv on The Next Gen Digital Transform…
	Hans-Juergen Rau on The Next Gen Digital Transform…
	DC/OS Enables Data C… on Cloud-Native Will Shake Up Ent…
	yaron haviv on Cloud-Native Will Shake Up Ent…
	Ade on Cloud-Native Will Shake Up Ent…

SDS Blog

Fed up with Outdated Arrays ? Read About SDS, BigData, and Cloud Storage

Month: October 2015

Cloudera Kudu: Yet Another Data Silo?

What is Kudu?

What are the key challenges with this approach?

What is the right solution IMHO?

Share this:

What is Kudu?

What are the key challenges with this approach?

What is the right solution IMHO?

Share this: