Software

2 minute read

🔥 Day 7: PySpark Joins, Unions, and GroupBy Guide

December 8, 2025

Welcome to Day 7 of your Spark Mastery journey!

Today is one of the most practical days because joins, unions, and aggregations are used in almost every pipeline you will ever build — be it feature engineering, building fact tables, or aggregating transactional data.

Let’s master the fundamentals with clarity and real-world examples.

🌟 1. Joins in PySpark — The Heart of ETL Pipelines

A join merges two DataFrames based on keys, similar to SQL.

Basic join:

df.join(df2, df.id == df2.id, "inner")

Join on same column name:

df.join(df2, ["id"], "left")

🔹 Types of Joins in Spark
Join Type – Meaning
inner – Matching rows
left – All rows from left, match from right
right – All rows from right
full – All rows from both
left_anti – Rows in left NOT in right
left_semi – Rows in left WHERE match exists in right
cross Cartesian product

Important
left_semi = existence check left_anti = anti-join / unmatched rows

🌟 2. Union — Stack DataFrames Vertically

Union (same schema, same order)

df.union(df2)

Union by column names:

df.unionByName(df2)

Why important?
Because in real projects you combine:

monthly files
daily ingestion datasets
partitions

🌟 3. GroupBy + Aggregation — Business Logic Layer

This is how reports, fact tables, metrics are built.

Example:

df.groupBy("dept").agg(
sum("salary").alias("total_salary"),
avg("age").alias("avg_age")
)

🔹 count vs countDistinct

df.select(count("id"))
df.select(countDistinct("id"))

🔹 approx_count_distinct (faster!)

df.select(approx_count_distinct("id"))

🌟 4. Real ETL Example — Sales Aggregation

Suppose you have:

sales table
product table

Join them:

df_joined = sales.join(products, "product_id", "left")

Aggregate revenue:

df_agg = df_joined.groupBy("category").agg(
 sum("amount").alias("total_revenue"),
    count("*").alias("transactions")
)

This is EXACTLY how business dashboards are built.

🌟 5. Join Performance Optimization

Use Broadcast Join for small lookup tables:

df.join(broadcast(df_small), "id")

Why?
Avoids shuffle → runs much faster.

🚀 Summary of Day 7

Today we learnt:

Joins
Union / UnionByName
GroupBy
Aggregations
broadcast join optimization

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

Want a Clean SAP Go-Live? Fix Data Integrity Before It Breaks You

December 8, 2025

Software

🚩 Red flags series #3: If-else endless tower

December 8, 2025

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

Machine Vision Lighting Solutions for Unwanted Glare

I Fine Tuned an Open Source Model and the Bhagavad Gita Explained It Better Than Any Paper

What STEM Professionals Should Know About EB1A Self-Petition in 2026

Trending Tags

🔥 Day 7: PySpark Joins, Unions, and GroupBy Guide

Leave a Reply Cancel reply

Previous Post

Want a Clean SAP Go-Live? Fix Data Integrity Before It Breaks You

Next Post

🚩 Red flags series #3: If-else endless tower

🔥 Day 7: PySpark Joins, Unions, and GroupBy Guide

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts