Software

3 minute read

How working/install Spark with Notebooks?

Tessa Wegert

January 16, 2023

how-working/install-spark-with-notebooks?

🌟📝 Basic commands to work with Spark in Notebooks like a Standalone cluster

🔗Related content

📀Google Collab

📺YouTube

🐱‍🏍GitHub

You can connect with me in:

🧬LinkedIn

Resume 🧾

I will install Spark program and will use a library of Python to write a job that answer the question, how many row exists by each rating?

Before start we setup environment to run Spark Standalone Cluster.

1st – Mount Google Drive 🚠

We will mount Google Drive to can use it files.

I use following script:

from google.colab import drive
drive.mount('/content/gdrive')

2nd – Install Spark 🎇

Later got a Colab notebook up, to get Spark running you have to run the following script (I apologize for how ugly it is).

I use following script:

%%bash
apt-get install openjdk-8-jdk-headless -qq > /dev/null 
if [[ ! -d spark-3.3.1-bin-hadoop3.tgz ]]; then
  echo "Spark hasn't been installed, Downloading and installing!"
  wget -q https://downloads.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
  tar xf spark-3.3.1-bin-hadoop3.tgz
  rm -f spark-3.3.1-bin-hadoop3.tgz

fi
pip install -q findspark

You would can get other version if you need in: https://downloads.apache.org/spark/ and later replace it in the before command.

3rd – Setting environment variables 🌐

I use following command:

import os
os.environ["JAVA_HOME"] = "https://dev.to/usr/lib/jvm/java-8-openjdk-amd64"  
os.environ["SPARK_HOME"] = "https://dev.to/content/spark-3.3.1-bin-hadoop3"
import findspark
findspark.init()

4th – Configuring a SparkSession 🚪

I use following command:

from pyspark.sql import SparkSession
spark = SparkSession.builder 
    .master("local[*]")  # set up like master using all(*) threads
    .appName("BLOG_XLMRIOSX")  # a generic name
    .getOrCreate()
sc = spark.sparkContext
sc

Later of execute this we can use many data strcuture to manage data, like RDD and Dataframe(Spark and Pandas).
Exists one more but only exists in Scala and it is Dataset.

5th – Getting a dataset to anlyze with Spark 💾

I use a dataset from grouplens. You can get other in:
http://files.grouplens.org/datasets/

This time I use movieslens and you can download it using:

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

To use data extract files. I extract files in path later of -d in command:

!unzip "https://dev.to/content/ml-100k.zip" -d "https://dev.to/content/ml-100k_folder"

6th – Configuring data to analyze 💿

We create a variable called data where put path where data is like:

data = '/content/ml-100k_folder/ml-100k/u.data'

Later define a variable called df_spark where put information of data:

df_spark = spark.read.csv(data, inferSchema=True, header=True)

We can inspect type of variable df_spark like:

print(type(df_spark))

We can inspect data frame of variable df_spark like:

df_spark.show()

We can see format is incorrect so we will fix that where we configure format of data by following way:

df_spark = spark.read.csv(data, inferSchema=True, header=False, sep="t")

9th – Making a query 🙈

To make this we need know format of data, so I infer the following structure:

First column reference to userID.
Second column reference to movieID.
Third column reference to rating.
Fourth column reference to timestamp.

I will answer the question, how many movies by each rating exists…

9th – Making a query with SQL syntax ❔🛢

First, create a table with dataframe.

df_spark.createOrReplaceTempView("table")

Second, can make query to answer question.

sql = spark.sql("SELECT _c2, COUNT(*) FROM table GROUP BY _c2")

To see results:

sql.show()

9th – Making a query with Dataframe ❔📄

It’s so easy make this query with dataframe.

dataframe.groupBy("_c2").count().show()

9th – Making a query with RDD ❔🧊

First, we transform dataframe to rdd type.

rdd = df_spark.rdd

Second, make query with RDD functions.

rdd
.groupBy(lambda x: x[2])
.mapValues(lambda values: len(set(values)))
.collect()

🔗Related content

📀Google Collab

📺YouTube

🐱‍🏍GitHub

You can connect with me in:

🧬LinkedIn

Quick Start with the ReductStore JavaScript SDK

January 16, 2023

Software

Mastering the Art of Writing Effective GitHub Commit Messages

January 16, 2023

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

Build an AI Chatbot in Reverb Livewire with Laravel 12 Step-by-Step

Introducing Gemma 3n: The developer guide

Unlock deeper insights with the new Python client library for Data Commons

Trending Tags

How working/install Spark with Notebooks?

🌟📝 Basic commands to work with Spark in Notebooks like a Standalone cluster

🔗Related content

You can connect with me in:

Resume 🧾

1st – Mount Google Drive 🚠

2nd – Install Spark 🎇

3rd – Setting environment variables 🌐

4th – Configuring a SparkSession 🚪

5th – Getting a dataset to anlyze with Spark 💾

6th – Configuring data to analyze 💿

9th – Making a query 🙈

9th – Making a query with SQL syntax ❔🛢

9th – Making a query with Dataframe ❔📄

9th – Making a query with RDD ❔🧊

🔗Related content

You can connect with me in:

Leave a Reply Cancel reply

Previous Post

Quick Start with the ReductStore JavaScript SDK

Next Post

Mastering the Art of Writing Effective GitHub Commit Messages

How working/install Spark with Notebooks?

🌟📝 Basic commands to work with Spark in Notebooks like a Standalone cluster

🔗Related content

You can find post related in:

You can find video related in:

You can find repo related in:

You can connect with me in:

Resume 🧾

1st – Mount Google Drive 🚠

2nd – Install Spark 🎇

3rd – Setting environment variables 🌐

4th – Configuring a SparkSession 🚪

5th – Getting a dataset to anlyze with Spark 💾

6th – Configuring data to analyze 💿

9th – Making a query 🙈

9th – Making a query with SQL syntax ❔🛢

9th – Making a query with Dataframe ❔📄

9th – Making a query with RDD ❔🧊

10th – Say thanks, give like and share if this has been of help/interest 😁🖖

🔗Related content

You can find post related in:

You can find video related in:

You can find repo related in:

You can connect with me in:

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts