Scalable Machine Learning
Spark ML on BD|CESGA platform



Javier Cacheiro / CESGA / Cloudera Certified Developer for Hadoop / @javicacheiro

1.048.576

BD|CESGA

Providing quick access to ready-to-use Big Data solutions

Because Big Data doesn't have to be complicated

Scalable

  • Storage capacity 816TB
  • Aggregated I/O throughtput 30GB/s
  • Aggregated RAM: 2432GB
  • 10GbE connectivity between all nodes
  • 456/912 cores

Hadoop Platform

  • Ready to use Hadoop ecosystem
  • Covers most of the uses cases
  • Fully optimized for Big Data applications
  • Production ready

PaaS Platform

  • When you need something outside the Hadoop ecosystem
  • Includes a catalog of products ready to use: eg. Cassandra, MongoDB, PostgreSQL

Spark

A fast and general engine for large-scale data processing

Speed

Easy

Spark ML

New Spark’s machine learning library

Spark ML vs MLlib

As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package.

Why switching?

  • DataFrames provide a more user-friendly API than RDDs
  • The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages
  • The RDD-based API is expected to be removed in Spark 3.0

Spark ML

Demo

Summary

Using BD|CESGA and Spark ML you can easily scale your machine learning problems to larger datasets

Take advantadge of parallelism!!

Q&A

hadoop.cesga.es

Demo Source Code

Stay up to date subscribing to our mailing list