Installation
NOTE
Spark can run without Hadoop. But this page assumes that you are trying to install Spark for your Hadoop cluster.
Download Binary
- Go to this link and choose the directory containing the spark version you want.
- Right-click on the file which looks like "spark-<spark version>-bin-hadoop<hadoop version>.tgz" and copy link.
- SSH into linux server and do the following inside the directory of your choice.
shell
# Using Spark 3.1.1 for Hadoop 3.2+
wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
# This will download the .tgz file to your directory.
Pre-requisites
- You will need to have Hadoop installed for Spark to work (This is required only when Spark is being installed for hadoop cluster).
- Get the path where
python
command is available, by running (this is needed to set environment variable for PySpark later):
shell
# Path to python command
which python
- Optional: Install scala. Spark 2.x onwards, it comes pre-built with Scala. So this isn't required for its functioning, but recommended to do so.
shell
sudo apt install scala