Pyspark and Jupyter notebook setup in Ubuntu


As a first step ensure that Python3 is installed in Ubuntu. By default latest version of Ubuntu comes with Python2 and Python3 versions.

Step 1 : Install PIP3

jit@ubuntu:~$ sudo apt install python3-pip

Step 2 : Install Jupyter Notebook

jit@ubuntu:~$ pip3 install jupyter

Step 3 : Install Java / update java runtime.

jit@ubuntu:~$ sudo apt-get update
jit@ubuntu:~$ sudo apt-get install default-jre
jit@ubuntu:~$ java -version # check java version

Step 4 : Install Scala

jit@ubuntu:~$ sudo apt-get install scala  
jit@ubuntu:~$ scala -version # check scala version

Step 5 : Install Py4J(It enables python programs running in a python interpreter to dynamically access Java objects in JVM)

jit@ubuntu:~$ pip3 install py4j

Step 6: Download Spark from http://spark.apache.org/downloads.html

Step 7 : unzip the tar and move to home folder

jit@ubuntu:~/Downloads$ sudo tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz 

Step 8 : Configure Spark and Jupyter environment variables in .profile and source it

PATH="$HOME/bin:$HOME/.local/bin:$PATH"
export SPARK_HOME='~/spark-2.1.1-bin-hadoop2.7'
export PATH=$SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3

Step 9 : Set permissions on spark folders

jit@ubuntu:~$ sudo chmod 777 spark-2.1.1-bin-hadoop2.7
              sudo chmod 777 python
              sudo chmod 777 python/pyspark

Instead of adding pyspark folders to path, let us use another module called findspark.

Step 10 : Install findspark

jit@ubuntu:~$ pip3 install findspark

Now our installation is complete and try following steps in a Jupyter notebook.

import findspark
findspark.init('/home/jit/spark-2.1.1-bin-hadoop2.7')
import pyspark

If no errors our Pyspark and Jupyter notebook set up is successful.