Java Standard Edition Set up Instructions

The Java version that is used for the examples demonstrated on the Spark Shell is Java 8. Please download and install the correct java 8 version most suitable for your machine processor (Intel, Apple Silicon etc.) from here. If you already have a java version installed, install Java 8 from the link provided and manage the version using jenv (You might have to install brew and jenv as mentioned in the link). Once done, make sure that your java version is showing version 1.8.x if you execute java -version command from your terminal.

Python Set up Instructions

We use Python version 3.9 for this set up. Please install it from this link. You might be having other versions of Python installed on your machine already which should be fine. In the next section, we have provided an environment variable which should help your setup point to Python 3.9.

Apache Spark Set up Instructions

Note 1: Before you proceed with the Apache Spark local installation here, please note that the exercises in House 9 of data-derp don't need this set up. They are instead done on Pycharm Spark installation of the next section of this readMe. So if you are facing any challenges in doing this Spark local set up, we request you to proceed towards the Pycharm spark installation in the next section and complete the exercises of House 9. You can come back to this set up later on. In case if you have come here after going through Vanilla Spark videos, and you would like to practice examples on the spark-shell (and you are facing challenges in this local spark set up), we request you to get in touch with your tour guide to help you out.

The Apache Spark version that I used for the examples demonstrated use version 3.0.2. You can set it up on your local machine using the following steps

Please download the file named spark-3.0.2-bin-hadoop2.7.tgz (It should be ~ 200MB) from this link at a preferred location on your machine
Extract / Un tar it to get a folder by the name "spark-3.0.2-bin-hadoop2.7".tar -xzvf spark-3.0.2-bin-hadoop2.7.tgz
Set up the location of the folder extracted in step 2 as your SPARK_HOME in your .bash_profile or .zshrc file export SPARK_HOME="<YOUR_PREFERRED_LOCATION>/spark-3.0.2-bin-hadoop2.7"
Add the bin folder of SPARK_HOME to the path. export PATH="$JAVA_HOME/bin:$SPARK_HOME/bin:$PATH")
You should be good to go now. Echo SPARK_HOME echo $SPARK_HOME from your terminal. You should be able to get the path to your spark installation location.
Open a new terminal and type $SPARK_HOME/bin/spark-shell. The spark shell should start with Spark version 3.0.2.
Before proceeding, set up export PYSPARK_PYTHON=python3.9 for your pyspark to point to Python 3.9.
The same can be done for pyspark $SPARK_HOME/bin/pyspark. Please note that in order for pyspark to work, you need to have python installed on your machines as mentioned in the "Python Set up instructions above" above.

Repo set up in Pycharm

Please ensure that Python 3.9 is installed from the Python Set up Instructions above.
Please install PyCharm community edition from this link.
Clone this repo on the location of your choice on your machine.
Open the Repo in Pycharm.
Go to your Pycharm Settings, select your project and select the Python Interpreter option.
Make sure that you select No Interpreter from the Python Interpreter dropdown and click Add Interpreter and select Add Local Interpreter.
A dialog box will Pop up. Select the option of Virtualenv Environment.
In the options on the right-hand side of the same dialog box, select Environment as New, Location as any suitable location to save python "venv" folder on your local machine, leave the Base Interpreter as default and select Inherit global site packages checkbox.
This should set up your Python Virtual Env for this repo. Double-check your Python Interpreter from Setting again and make sure that your newly created Python interpreter is selected for the project.
Now if your Python Interpreter is created and selected as per the instructions above, you should get a message like Package requirements 'pyspark...' etc. are not installed. Click on the install requirement link to install the plug-ins required for this repo. These plug-ins are listed down in the requirements.txt of this repo.
You are all set now to run your first program of this repo. Open source file 00 - File Reads.py from the SRC folder of this repo and run it. It should give you the desired output of the dataframe as shown below.

PS: Please note that you don't need a separate Apache Spark installation on your local machine to run the examples of this repo from Pycharm. The PySpark plug-in you have configured above should be sufficient.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
assets		assets
data		data
lib		lib
src		src
.gitignore		.gitignore
README.md		README.md
log4j.properties		log4j.properties
requirements.txt		requirements.txt
spark.conf		spark.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Java Standard Edition Set up Instructions

Python Set up Instructions

Apache Spark Set up Instructions

Repo set up in Pycharm

About

Releases

Packages

Contributors 6

Languages

data-derp/exercise-vanilla-spark

Folders and files

Latest commit

History

Repository files navigation

Java Standard Edition Set up Instructions

Python Set up Instructions

Apache Spark Set up Instructions

Repo set up in Pycharm

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages