Skip to content

Installation

NOTE

Spark can run without Hadoop. But this page assumes that you are trying to install Spark for your Hadoop cluster.

Download Binary
  • Go to this link and choose the directory containing the spark version you want.
  • Right-click on the file which looks like "spark-<spark version>-bin-hadoop<hadoop version>.tgz" and copy link.
  • SSH into linux server and do the following inside the directory of your choice.
shell
# Using Spark 3.1.1 for Hadoop 3.2+
wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
# This will download the .tgz file to your directory.

Pre-requisites

  • You will need to have Hadoop installed for Spark to work (This is required only when Spark is being installed for hadoop cluster).
  • Get the path where python command is available, by running (this is needed to set environment variable for PySpark later):
shell
# Path to python command
which python
  • Optional: Install scala. Spark 2.x onwards, it comes pre-built with Scala. So this isn't required for its functioning, but recommended to do so.
shell
sudo apt install scala