Installation

NOTE

Spark can run without Hadoop. But this page assumes that you are trying to install Spark for your Hadoop cluster.

Download Binary

Go to this link and choose the directory containing the spark version you want.
Right-click on the file which looks like "spark-<spark version>-bin-hadoop<hadoop version>.tgz" and copy link.
SSH into linux server and do the following inside the directory of your choice.

shell

# Using Spark 3.1.1 for Hadoop 3.2+
wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
# This will download the .tgz file to your directory.

Pre-requisites

You will need to have Hadoop installed for Spark to work (This is required only when Spark is being installed for hadoop cluster).
Get the path where python command is available, by running (this is needed to set environment variable for PySpark later):

shell

# Path to python command
which python

Optional: Install scala. Spark 2.x onwards, it comes pre-built with Scala. So this isn't required for its functioning, but recommended to do so.

shell

sudo apt install scala