Apache Spark on Hortonworks? Kubernetes might be your solution.


From its humble beginnings in the AMPLab at U.C. Berkeley in 2009, Apache Spark has become one of the key big data distributed processing frameworks in the world and that is why we shall discuss it today...

Let’s have a look at Hortonworks and Cloudera today.

Over the years, Hadoop, the once high-flying open-source platform, gave rise to many companies and an ecosystem of vendors emerged. It was long believed that some major companies would emerge from the pack, and, sure enough, Hortonworks went public in 2014. Cloudera followed three years later, but the market slowed down and the two companies announced today they are merging in a deal worth $5.2 billion, based on the price at the close of business yesterday.

Part of the problem with Hadoop, even though anyone could download it, was the sheer complexity of it. That’s where companies like Hortonworks and Cloudera came in. They packaged it for IT departments that wanted the advantage of a big data processing platform, but didn’t necessarily want to build Hadoop from scratch.

These companies offered different ways of helping to attack that complexity, but over time, with all the cloud-based big data solutions, rolling a Hadoop system seemed futile, even with the help of companies like Cloudera and Hortonworks.

How much does it cost?

Initial Cloudera Enterprise software license costs of $180,634. In addition to the initial license cost, a recurring annual cost of $7,168 per node for ongoing maintenance and support was recognized.

Cloudera's spending isn't translating into growth. While the company's total revenue grew substantially last years thanks to the Hortonworks acquisition, Cloudera's new guidance calls for annualized recurring revenue, a metric that accounts for preacquisition Hortonworks contracts, to grow by just 0% to 10%. At the low end of that range, Cloudera's growth will flatline this year despite spending so much on sales and marketing.

Generally, if five LOB users use the data warehouse on a public cloud for eight hours a day for one month, you pay for the use of the service and the associated cloud hardware resources (compute and storage) for this period. So, for example, assume your data warehouse service costs $0.72 / hour per compute instance, the associated cloud hardware costs $1.44 / hour per compute instance and storage costs $15 / TB per month. If the five LOB users use ten instance units and 10 TB of data, you will be billed as follows:

- $1152 for data warehousing software = $(($0.72 / hour x (8 hours x 5 days x 4 weeks) x 10 instances)

- $2304 for the cloud hardware instances = $(($1.44 / hour x (8 hours x 5 days x 4 weeks) x 10 instances)

- $150 for storage use = $15 / TB / month x 10 TB

The bill totals $3606 on a monthly basis for just a small number of users of a moderately sized data warehouse service. Multiply this by the typical number of users (often in the hundreds) of a production-grade data warehouse. Clearly, the data warehouse service has a direct and major impact on the total costs.

What about Spark on Kubernetes.

Kubernetes is now proven technology to deploy and distribute modules quickly and efficiently. Many cloud vendors are now offering Hadoop as a service. Companies are moving towards the model where they want to provision an instance of service on the fly and use it for analytics. Usually, it takes weeks to provision a production-ready Hadoop cluster. This platform is container-native platform serves as the Backbone for all other analytical services.

Pros using Spark on Kubernetes

If you’re running data science pipelines and machine learning model workflows to get real business value, that probably means you need to process lots and lots of data. To do this in a scalable way and to handle complex computation steps across a large amount of data (without breaking the bank), Kubernetes is becoming an increasingly popular choice for scheduling Spark jobs, compared to YARN.

Kubernetes allows for the unification of infrastructure management under a single type of cluster for multiple workload types. You can run Spark, of course, but you can also run Python or R code, notebooks ,and even webapps. In the traditional Spark-on-YARN world, you need to have a dedicated Hadoop cluster for your Spark processing and something else for Python, R, etc.

Managing dependencies in Hadoop is time-consuming: packages have to exist on all nodes in a cluster, which makes isolation (i.e., making different versions of the same software — like TensorFlow 1.5 and TensorFlow 2.0 — coexist on the same node) difficult and updating environments challenging.

If you’re doing Enterprise AI and want to start moving data storage to the cloud (or have already started doing so), the idea of vendor lock-in can be very scary. That’s why today, more and more businesses are taking a cloud-agnostic approach. Running Spark on Kubernetes means building once and deploying anywhere, which makes a cloud-agnostic approach scalable.

And the cherry on the cake - it’s Cheaper.

The nice thing with Kubernetes (not just for Spark, but for everything else) is that all major cloud vendors offer managed services to provide Kubernetes clusters that are built on top of what makes the cloud nice — i.e., a very flexible and elastic infrastructure that can be created, destroyed, and scaled easily to match workload requirements.