Install latest Python3 in Mac OS (If you already have Python3 that should work perfectly fine too). I prefer Anaconda distribution since it comes with lot of packages which we need in further development. You can install Anaconda distribution from here.
jmac:~ jit$ python Python 3.6.1 |Anaconda 4.4.0 (x86_64)| (default, May 11 2017, 13:04:09) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>>
Install latest Java in your Mac. I am downloading the SDK from here, but latest Java runtime should do the job. You can download and install Java from here.
jmac:~ jit$ java -version java version "1.8.0_131" Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode) jmac:~ jit$
Download Apache Spark from here to a preferred folder. I downloaded to a Dev folder in my home directory.
Now unzip the tar.
jmac:Dev jit$tar -xzf spark-2.1.1-bin-hadoop2.7.tgz
Step 4 :
Now add anaconda and spark binaries to path.
# added by Anaconda3 4.4.0 installer export PATH="/Users/jit/Dev/anaconda/bin:$PATH" export SPARK_HOME="/Users/jit/Dev/spark-2.1.1-bin-hadoop2.7" export PATH=$SPARK_HOME/bin:$PATH
Step 5: Check the installation.
jmac:~ jit$ pyspark Python 3.6.1 |Anaconda 4.4.0 (x86_64)| (default, May 11 2017, 13:04:09) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/06/11 14:58:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/06/11 14:58:24 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.1 /_/ Using Python version 3.6.1 (default, May 11 2017 13:04:09) SparkSession available as 'spark'. >>>
Now let us configure the Jupyter notebook for developing PySpark applications. Easiest way to do this is by installing findspark package.
jmac:~ jit$ pip install findspark
Now open Jupyter notebook and let us try a simple pyspark application.
Programming is fun . Enjoy !