Pyspark and Jupyter notebook setup in Mac


Step 1:

Install latest Python3 in Mac OS (If you already have Python3 that should work perfectly fine too). I prefer Anaconda distribution since it comes with lot of packages which we need in further development. You can install Anaconda distribution from here.

jmac:~ jit$ python
Python 3.6.1 |Anaconda 4.4.0 (x86_64)| (default, May 11 2017, 13:04:09) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 

Step 2:

Install latest Java in your Mac. I am downloading the SDK from here, but latest Java runtime should do the job. You can download and install Java from here.

jmac:~ jit$ java -version
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
jmac:~ jit$ 

Step 3:

Download Apache Spark from here to a preferred folder. I downloaded to a Dev folder in my home directory.

Now unzip the tar.

jmac:Dev jit$tar -xzf spark-2.1.1-bin-hadoop2.7.tgz

Step 4 :

Now add anaconda and spark binaries to path.

# added by Anaconda3 4.4.0 installer
export PATH="/Users/jit/Dev/anaconda/bin:$PATH"
export SPARK_HOME="/Users/jit/Dev/spark-2.1.1-bin-hadoop2.7"
export PATH=$SPARK_HOME/bin:$PATH

Step 5: Check the installation.

jmac:~ jit$ pyspark
Python 3.6.1 |Anaconda 4.4.0 (x86_64)| (default, May 11 2017, 13:04:09) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/06/11 14:58:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/06/11 14:58:24 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.1
      /_/

Using Python version 3.6.1 (default, May 11 2017 13:04:09)
SparkSession available as 'spark'.
>>> 

step 6:

Now let us configure the Jupyter notebook for developing PySpark applications. Easiest way to do this is by installing findspark package.

jmac:~ jit$ pip install findspark

Now open Jupyter notebook and let us try a simple pyspark application.

Programming is fun . Enjoy !