The Java version that is used for the examples demonstrated on the Spark Shell is Java 8. Please download and install the correct java 8 version most suitable for your machine processor (Intel, Apple Silicon etc.) from here. If you already have a java version installed, install Java 8 from the link provided and manage the version using jenv (You might have to install brew and jenv as mentioned in the link). Once done, make sure that your java version is showing version 1.8.x
if you execute java -version
command from your terminal.
We use Python version 3.9 for this set up. Please install it from this link. You might be having other versions of Python installed on your machine already which should be fine. In the next section, we have provided an environment variable which should help your setup point to Python 3.9.
Note 1: Before you proceed with the Apache Spark local installation here, please note that the exercises in House 9 of data-derp don't need this set up. They are instead done on Pycharm Spark installation of the next section of this readMe. So if you are facing any challenges in doing this Spark local set up, we request you to proceed towards the Pycharm spark installation in the next section and complete the exercises of House 9. You can come back to this set up later on. In case if you have come here after going through Vanilla Spark videos, and you would like to practice examples on the spark-shell (and you are facing challenges in this local spark set up), we request you to get in touch with your tour guide to help you out.
The Apache Spark version that I used for the examples demonstrated use version 3.0.2. You can set it up on your local machine using the following steps
- Please download the file named
spark-3.0.2-bin-hadoop2.7.tgz
(It should be ~ 200MB) from this link at a preferred location on your machine - Extract / Un tar it to get a folder by the name "spark-3.0.2-bin-hadoop2.7".
tar -xzvf spark-3.0.2-bin-hadoop2.7.tgz
- Set up the location of the folder extracted in step 2 as your
SPARK_HOME
in your.bash_profile
or.zshrc
fileexport SPARK_HOME="<YOUR_PREFERRED_LOCATION>/spark-3.0.2-bin-hadoop2.7"
- Add the
bin
folder of SPARK_HOME to the path.export PATH="$JAVA_HOME/bin:$SPARK_HOME/bin:$PATH"
) - You should be good to go now. Echo SPARK_HOME
echo $SPARK_HOME
from your terminal. You should be able to get the path to your spark installation location. - Open a new terminal and type
$SPARK_HOME/bin/spark-shell
. The spark shell should start with Spark version 3.0.2. - Before proceeding, set up
export PYSPARK_PYTHON=python3.9
for your pyspark to point to Python 3.9. - The same can be done for pyspark
$SPARK_HOME/bin/pyspark
. Please note that in order for pyspark to work, you need to have python installed on your machines as mentioned in the "Python Set up instructions above" above.
- Please ensure that Python 3.9 is installed from the Python Set up Instructions above.
- Please install PyCharm community edition from this link.
- Clone this repo on the location of your choice on your machine.
- Open the Repo in Pycharm.
- Go to your Pycharm Settings, select your project and select the Python Interpreter option.
- Make sure that you select No Interpreter from the
Python Interpreter
dropdown and click Add Interpreter and select Add Local Interpreter. - A dialog box will Pop up. Select the option of Virtualenv Environment.
- In the options on the right-hand side of the same dialog box, select Environment as New, Location as any suitable location to save python "venv" folder on your local machine, leave the Base Interpreter as default and select Inherit global site packages checkbox.
- This should set up your Python Virtual Env for this repo. Double-check your Python Interpreter from Setting again and make sure that your newly created Python interpreter is selected for the project.
- Now if your Python Interpreter is created and selected as per the instructions above, you should get a message like
Package requirements 'pyspark...' etc. are not installed
. Click on the install requirement link to install the plug-ins required for this repo. These plug-ins are listed down in therequirements.txt
of this repo. - You are all set now to run your first program of this repo. Open source file
00 - File Reads.py
from the SRC folder of this repo and run it. It should give you the desired output of the dataframe as shown below.
PS: Please note that you don't need a separate Apache Spark installation on your local machine to run the examples of this repo from Pycharm. The PySpark plug-in you have configured above should be sufficient.