Les tribulations d'un octet

← revenir au blog


Start listening to your Spark Application

Attention cet article à plus de 3 ans ! Les informations peuvent ne plus être d'actualité et les liens peuvent être cassé.

According to Wikipedia Apache Spark is "an open source cluster computing framework". It aims to make it faster and easier to develop large-scale big data applications. You can use Spark as a cluster manager, a big data application framework and even as a machine learning or streaming library!

It works alone but is designed to work with hadoop, Cassandra, MESOS, Amazon S3 and EC2. You get it, the goal of Spark is to interface you as a developer with all the big data stuff that already exists.

As I said in my previous article in French, I worked on Spark as an intern in the research lab of my school, Telecom SudParis, this summer. I built a tool to visualize metrics from Spark applications (such as runtime, disk usage, network, ...)

The event-driven part of Spark

Spark is composed of many sub projects (such as spark-sql or spark-streaming) and I only worked on its core project. Fortunately, the documentation of Spark is really complete and clear! I discovered how Spark does the job scheduling with the DAGScheduler component.

If you worked with Spark you should already know the SparkUI interface to monitor your applications. All the information you see in that monitoring interface is retrieved and processed by the DAGScheduler. The DAGScheduler converts actions that should be done on a data set to tasks that can be serialized to executors.

DAGScheduler

In order to work, the DAGScheduler works with events (like StageCompleted). There are many callbacks fired in the core of Spark. As an example, when an executor sends the results, the DAGScheduler deserializes the answer and fires an event.

SparkListener

SparkListener is a trait available in the JAVA and Scala API of spark-core (subpackage scheduler). This trait is used to define callbacks on Spark events.

The really cool thing is that Spark lets you define your own callbacks, yes! you can register your own Listener by extending SparkListener and adding a new listener to your Spark context. Let's try it!

import org.apache.spark.scheduler.{SparkListenerTaskEnd, SparkListener}

class MyListener extends SparkListener {
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
      // do what you want here at the end of each task, retrieve all the information
      // from the case class SparkListenerTaskEnd
  }
}

Okay, we have a new Listener, let's tell Spark!

val sparkConf = new SparkConf()
                    .setAppName("test listener")
                    .setMaster("yarn-client")
val sc = new SparkContext(sparkConf)
val myListener = new MyListener()

sc.addSparkListener(myListener)

// do your usual Spark application!

It's that easy! Be careful, SparkListener is part of developer API and can be removed or changed at any updates. It comes without any warranties of success.

Conclusions

I didn't have enough time to create something else than a metrics data visualization but I'm sure there are many things that can be done with that function. It's up to you to use it, even to just have a better view of the comportments of your application!

Thank you for reading, Have fun with Spark!