Scala for (py)spark developers

Joël Vimenet

2020/09/28

Introduction

It is not

  • Language comparison
  • Functional programming course
  • Spark course

Objectives of the presentation

  • Beeing able to read Spark Scala code
  • Beeing able to write Basic Spark scala code

Language

Compiled JVM language

  • Can run anywhere a JVM is installed
  • No need to recompile to run on Windows, Mac, Linux, etc…
  • Interoperable with Java
    • Use any Java lib in Scala
    • Call Scala code from Java (not great idea)

Statically typed

  • Values (contants and variables) have a type that cannot change
  • Type inference
val myValue : Int = 1
val myInferedValue = "is a string"
var mutableValue = 42
mutableValue = "cannot become a string" // won't compile

Strongly typed

Everything has a type

Values, functions and control structures have a type


val myValue = true
def myFunction(param: Boolean) : Int =
    if (param) 1 else -1

val mySeq = Seq(1, 2, 3)

val newSeq : Seq[Int] =
    for (value <- mySeq)
    yield value * 2

Describe everything by a type

  • A value is not mandatory : Option
  • An error can happen : Try
  • A result can have one type or another one : Either

Compiler is your friend

  • Static and strong typing catches a lot of errors
    • Never return null, use Option instead: No null exception
    • Don’t throw exceptions, return Try instead : No unhandled exceptions

Object oriented

Traits and classes

  • Inheritence is supported
    • mostly used between traits and classes
    • Deep inheritence is a bad practice
trait Person {
    def sayHello(): String
}

class Adult(name: String, age: Int) extends Person {
    def sayHello(): String =
        s"Hello, my name is $name, I am $age years old"
}

class Child(name: String) extends Person {
    def sayHello(): String =
        s"Hello, my name is $name"
}

val jojo = new Adult("Joël", 36)
val baby = new Child("Baby")

println(jojo.sayHello())

// Hello, my name is Joël, I am 36 years old"

case classes

Scala has a special type of class called case class. It is designed to be a simple data container that has the following characteristics :

  • immutable by default
  • public fields
  • automatic getter, equals, hashCode and copy generation
  • automatic pattern matching support

objects

object is a static holder containing constants


object MyFunctions {
    val myConstant = 42

    def multiply(value: Int) = myConstant * value
}

val myResult = myFunctions.multiply(2)
// 84, what a beautiful year!

Generics

Types that can be parameterized by other types. For instance, a Seq or an Option can contain objects of any type. Type only has to be implemented once and remains typesafe.

val intSeq : Seq[Int] = Seq(1, 2, 3, 4)
val stringSeq : Seq[String] = Seq("A", "B", "C")

val maybeIntHead : Option[Int] =
    intSeq.headOption // Some(1)
val maybeStringHead : Option[String] =
    stringSeq.headOption // Some("A")

Functional

Functionnal programming

Functional programming is about describing data and the transformations applied to it.

Functions

Functions are first class citizen of the language

  • Have a type
  • Can be assigned to a value
  • Can be passed as parameter
  • Pure functions
    • Result depends only on the parameters
    • No side effects
    • Easy to test

val add : (Int, Int) => Int =
    (x, y) => x + y

val result = add(1, 2)

val myList = Seq(1, 2, 3, 4)

val reducedList = myList.reduce(add)

Pattern matching

  • Switch on steroids
  • Typesafe
  • Extract parts of the content to work on it
  • Combined with strong typing, compiler can check match exhaustivity

val maybeValue = functionThatReturnsAnOption(param1, param2)

maybeValue match {
    case Some(result) => valueHandler(result)
    case None         => noValueHandler()
}

Variables deconstruction

In the same idea than pattern matching, you can deconstruct variables. It works with everything working with pattern matching:

  • case classes
  • lists
  • tuples
  • etc…
case class Person(
  firstName: String,
  lastName: String,
  age: Int)

val me = Person("Joël", "Vimenet", 36)
val Person(myFirstName, myLastName, myAge) = me

println(s"$myFirstName $myLastName is $myAge years old")
// Joël Vimenet is 36 years old

Immutability

By default everything is immutable

  • val is prefered, var is discouraged
  • case classes are immutable
    • copy method eases transforming case class
  • collections are immutable

Implicits

  • Pass parameters implicitly
    • Implicit value declared in the context is bound to the function by the compiler
    • It allows to reduce the number of parameters passed
    • for instance, when writing async code, the execution context
  • Beware it can reduce understanding of the code
    • Where does this parameter comes from?
    • help from IDE

Implicit conversions

Implicit conversions are used to create extension methods (adding new methods on an existing type).

You can declare a function or class as implicit if it takes only one parameter.


case class Euro(value: Double)

implicit class DoubleToEuro(value: Double) {
    def euros = Euro(value)
}

val twoEuros = 2 euros

Official documentation

The official documentation contains detailed examples of what I described previously and much more.

Language specification

It is relatively short and can be read in a short time.

2.13 Specification

Tooling

SBT

  • Simple Build Tool (sic)
  • build.sbt project description file
    • describe dependencies
    • compile options
    • plugins
      • sbt-release
      • formatter, linter, etc…

IDE

  • IntelliJ IDEA community edition
  • VSCode + Metals
  • Vim, Emacs, etc…

Spark

Native API

Spark is developped in Scala, so Scala API is native

Typesafe API

Spark Scala provides the Dataset API

type DataFrame = Dataset[Row]

Dataset programming

  • Describe your data structures as case classes,
  • Define your transformation functions based on your types (Spark independent),
  • Problems are catched at compile time and not at runtime.

Your code is safer, easier to reason about.

Demo time

// reveal.js plugins