0% found this document useful (0 votes)

138 views

Must Know Pyspark Coding Before Databricks Interview

Uploaded by

Richard Smith

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

138 views

Must Know Pyspark Coding Before Databricks Interview

Uploaded by

Richard Smith

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Must Know Pyspark Coding Before Your Next

Databricks Interview

Document by – Siddhartha Subudhi

Visit my LinkedIn profile
1. Find the second highest salary in a DataFrame using PySpark.

Scenario: You have a DataFrame of employee salaries and want to find the second highest salary.

from pyspark.sql import Window

from pyspark.sql.functions import col, dense_rank

windowSpec = Window.orderBy(col("salary").desc())

df_with_rank = df.withColumn("rank", dense_rank().over(windowSpec))

second_highest_salary = df_with_rank.filter(col("rank") == 2).select("salary")

second_highest_salary.show()

2. Count the number of null values in each column of a PySpark DataFrame.

Scenario: Given a DataFrame, identify how many null values each column contains.

from pyspark.sql.functions import col, isnan, when, count

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

3. Calculate the moving average over a window of 3 rows.

Scenario: For a stock price dataset, calculate a moving average over the last 3 days.

from pyspark.sql import Window

from pyspark.sql.functions import avg

windowSpec = Window.orderBy("date").rowsBetween(-2, 0)

df_with_moving_avg = df.withColumn("moving_avg", avg("price").over(windowSpec))

df_with_moving_avg.show()

4. Remove duplicate rows based on a subset of columns in a PySpark DataFrame.

Scenario: You need to remove duplicates from a DataFrame based on certain columns.

df = df.dropDuplicates(["column1", "column2"])

df.show()
5. Split a single column with comma-separated values into multiple columns.

Scenario: Your DataFrame contains a column with comma-separated values. You want to split this into multiple
columns.

from pyspark.sql.functions import split

df_split = df.withColumn("new_column1", split(df["column"], ",").getItem(0)) \

.withColumn("new_column2", split(df["column"], ",").getItem(1))

df_split.show()

6. Group data by a specific column and calculate the sum of another column.

Scenario: Group sales data by "product" and calculate the total sales.

df.groupBy("product").sum("sales").show()

7. Join two DataFrames on a specific condition.

Scenario: You have two DataFrames: one for customer data and one for orders. Join these DataFrames on the
customer ID.

df_joined = df_customers.join(df_orders, df_customers.customer_id == df_orders.customer_id, "inner")

df_joined.show()

8. Create a new column based on conditions from existing columns.

Scenario: Add a new column "category" that assigns "high", "medium", or "low" based on the value of the "sales"
column.

from pyspark.sql.functions import when

df = df.withColumn("category", when(df.sales > 500, "high")

.when((df.sales <= 500) & (df.sales > 200), "medium")

.otherwise("low"))

df.show()
9. Calculate the percentage contribution of each value in a column to the total.

Scenario: For a sales dataset, calculate the percentage contribution of each product's sales to the total sales.

from pyspark.sql.functions import sum, col

total_sales = df.agg(sum("sales").alias("total_sales")).collect()[0]["total_sales"]

df = df.withColumn("percentage", (col("sales") / total_sales) * 100)

df.show()

10. Find the top N records from a DataFrame based on a column.

Scenario: You need to find the top 5 highest-selling products.

df.orderBy(col("sales").desc()).limit(5).show()

11. Write PySpark code to pivot a DataFrame.

Scenario: You have sales data by "year" and "product", and you want to pivot the table to show "product" sales by
year.

df_pivot = df.groupBy("product").pivot("year").sum("sales")

df_pivot.show()

12. Add row numbers to a PySpark DataFrame based on a specific ordering.

Scenario: Add row numbers to a DataFrame ordered by "sales" in descending order.

from pyspark.sql.window import Window

from pyspark.sql.functions import row_number

windowSpec = Window.orderBy(col("sales").desc())

df_with_row_number = df.withColumn("row_number", row_number().over(windowSpec))

df_with_row_number.show()
13. Filter rows based on a condition.

Scenario: You want to filter only those customers who made purchases over ₹1000.

df_filtered = df.filter(df.purchase_amount > 1000)

df_filtered.show()

14. Flatten a JSON column in PySpark.

Scenario: Your DataFrame contains a JSON column, and you want to extract specific fields from it.

from pyspark.sql.functions import from_json, col

from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([

StructField("name", StringType(), True),

StructField("age", StringType(), True)

])

df = df.withColumn("json_data", from_json(col("json_column"), schema))

df.select("json_data.name", "json_data.age").show()

15. Convert a PySpark DataFrame column to a list.

Scenario: Convert a column from your DataFrame into a list for further processing.

column_list = df.select("column_name").rdd.flatMap(lambda x: x).collect()

16. Handle NULL values by replacing them with a default value.

Scenario: Replace all NULL values in the "sales" column with 0.

df = df.na.fill({"sales": 0})

df.show()

17. Perform a self-join on a PySpark DataFrame.

Scenario: You have a hierarchy of employees and want to find each employee's manager.

df_self_join = df.alias("e1").join(df.alias("e2"), col("e1.manager_id") == col("e2.employee_id"), "inner") \

.select(col("e1.employee_name"), col("e2.employee_name").alias("manager_name"))

df_self_join.show()
18. Write PySpark code to unpivot a DataFrame.

Scenario: You have a DataFrame with "year" columns and want to convert them to rows.

from pyspark.sql.functions import expr

df_unpivot = df.selectExpr("id", "stack(2, '2021', sales_2021, '2022', sales_2022) as (year, sales)")

df_unpivot.show()

19. Write a PySpark code to group data based on multiple columns and calculate aggregate functions.

Scenario: Group data by "product" and "region" and calculate the average sales for each group.

df.groupBy("product", "region").agg({"sales": "avg"}).show()

20. Write PySpark code to remove rows with duplicate values in any column.

Scenario: You want to remove rows where any column has duplicate values.

df_cleaned = df.dropDuplicates()

df_cleaned.show()

21. Write PySpark code to read a CSV file and infer its schema.

Scenario: You need to load a CSV file into a DataFrame, ensuring the schema is inferred.

df = spark.read.option("header", "true").option("inferSchema", "true").csv("path_to_csv")

df.show()

22. Write PySpark code to merge multiple small files into a single file.

Scenario: You have multiple small files in HDFS, and you want to consolidate them into one large file.

df.coalesce(1).write.mode("overwrite").csv("output_path")
23. Write PySpark code to calculate the cumulative sum of a column.

Scenario: You want to calculate a cumulative sum of sales in your DataFrame.

from pyspark.sql.window import Window

from pyspark.sql.functions import sum

windowSpec = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, 0)

df_with_cumsum = df.withColumn("cumulative_sum", sum("sales").over(windowSpec))

df_with_cumsum.show()

24. Write PySpark code to find outliers in a dataset.

Scenario: Detect outliers in the "sales" column based on the 1.5 * IQR rule.

from pyspark.sql.functions import expr

q1 = df.approxQuantile("sales", [0.25], 0.01)[0]

q3 = df.approxQuantile("sales", [0.75], 0.01)[0]

iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr

upper_bound = q3 + 1.5 * iqr

df_outliers = df.filter((col("sales") < lower_bound) | (col("sales") > upper_bound))

df_outliers.show()

25. Write PySpark code to convert a DataFrame to a Pandas DataFrame.

Scenario: Convert your PySpark DataFrame into a Pandas DataFrame for local processing.

pandas_df = df.toPandas()

Download full Modern Data Analytics in Excel Using Power Query Power Pivot and More for Enhanced Data Analytics 1st / converted Edition George Mount ebook all chapters
100% (1)
Download full Modern Data Analytics in Excel Using Power Query Power Pivot and More for Enhanced Data Analytics 1st / converted Edition George Mount ebook all chapters
77 pages
Guide To Building AI Agents From Scratch
100% (4)
Guide To Building AI Agents From Scratch
17 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Pyspark IQ FREE Guide
No ratings yet
Pyspark IQ FREE Guide
57 pages
azure DE interview que
100% (1)
azure DE interview que
25 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
PySpark VS SQL Interview Questions
No ratings yet
PySpark VS SQL Interview Questions
16 pages
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Lab 3 - Enabling Team Based Data Science With Azure Databricks
No ratings yet
Lab 3 - Enabling Team Based Data Science With Azure Databricks
18 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
100% (4)
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
55 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Azure Data Enginner
No ratings yet
Azure Data Enginner
8 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Pyspark Hands on
No ratings yet
Pyspark Hands on
189 pages
Siva
No ratings yet
Siva
4 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
Azure Data Factory Interview Questions and Aswers
No ratings yet
Azure Data Factory Interview Questions and Aswers
5 pages
De Mod 0 Get Started With Pyspark Programming
No ratings yet
De Mod 0 Get Started With Pyspark Programming
7 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
15 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Snow SQL
No ratings yet
Snow SQL
3 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
DataGrokr Technical Assignment
No ratings yet
DataGrokr Technical Assignment
4 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Pyspark Code
No ratings yet
Pyspark Code
3 pages
Databricks
No ratings yet
Databricks
4 pages
50 PySpark Interview Questions.pdf
No ratings yet
50 PySpark Interview Questions.pdf
7 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
ABD00 Notebooks Combined - Databricks
No ratings yet
ABD00 Notebooks Combined - Databricks
109 pages
HowToCrackInterview Udemy
No ratings yet
HowToCrackInterview Udemy
58 pages
azure comapny wise question
No ratings yet
azure comapny wise question
68 pages
Data Engineering
100% (1)
Data Engineering
131 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Azure Data Factory Notes 1682135573
No ratings yet
Azure Data Factory Notes 1682135573
78 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
PostgreSQL 9 High Availability Cookbook
From Everand
PostgreSQL 9 High Availability Cookbook
Shaun M. Thomas
5/5 (2)
Mastering Kafka Streams: From Basics to Expert Proficiency
From Everand
Mastering Kafka Streams: From Basics to Expert Proficiency
William Smith
No ratings yet
Master Airflow With This Amazing Document!
No ratings yet
Master Airflow With This Amazing Document!
63 pages
Data Modelling
No ratings yet
Data Modelling
40 pages
Day 89
No ratings yet
Day 89
9 pages
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
Azure DE Roadmap2024
No ratings yet
Azure DE Roadmap2024
10 pages
PySpark Comprehensive Notes⚡
No ratings yet
PySpark Comprehensive Notes⚡
59 pages
?????? ???????? ??????????
No ratings yet
?????? ???????? ??????????
5 pages
spark - groupByKey vs reduceByKey
No ratings yet
spark - groupByKey vs reduceByKey
3 pages
Python Portfolio Project For Data Analyst
No ratings yet
Python Portfolio Project For Data Analyst
13 pages
PySpark 30 Days Practice Guide?
No ratings yet
PySpark 30 Days Practice Guide?
35 pages
20+ Key Difference in Spark
No ratings yet
20+ Key Difference in Spark
9 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Step-By-Step Method To Find Drop Off Points in A User Flow
No ratings yet
Step-By-Step Method To Find Drop Off Points in A User Flow
17 pages
Data Engineer Question
No ratings yet
Data Engineer Question
33 pages
Full Load
No ratings yet
Full Load
16 pages
Discover India's Path To Net-Zero - Sustainable Growth & Green Energy!
No ratings yet
Discover India's Path To Net-Zero - Sustainable Growth & Green Energy!
1 page
How To Create Stock Locators by Using API
No ratings yet
How To Create Stock Locators by Using API
5 pages
Cookies Scribd
No ratings yet
Cookies Scribd
11 pages
000001
No ratings yet
000001
140 pages
MySQL Cookbook 2nd Edition Solutions Examples for Database Developers and DBAs Paul Dubois pdf download
100% (1)
MySQL Cookbook 2nd Edition Solutions Examples for Database Developers and DBAs Paul Dubois pdf download
38 pages
Selection Control Structures
No ratings yet
Selection Control Structures
47 pages
Ex No: 1. Data Definition of Base Tables. Date: Aim
No ratings yet
Ex No: 1. Data Definition of Base Tables. Date: Aim
30 pages
Oracle Workflow - by Dinesh Kumar S
100% (6)
Oracle Workflow - by Dinesh Kumar S
85 pages
Long Quiz 001 - IT6202 - Database Management System 1
No ratings yet
Long Quiz 001 - IT6202 - Database Management System 1
11 pages
After Completing This Module, You Will Be Able To
No ratings yet
After Completing This Module, You Will Be Able To
0 pages
SQL Server Connection
No ratings yet
SQL Server Connection
11 pages
HP Vertica 7.1.x ExtendingHPVertica
No ratings yet
HP Vertica 7.1.x ExtendingHPVertica
256 pages
ADO NET Mejores Practicas
No ratings yet
ADO NET Mejores Practicas
11 pages
SQL Training 101
No ratings yet
SQL Training 101
25 pages
PU DBMS Parctical - 2019
No ratings yet
PU DBMS Parctical - 2019
5 pages
02 Big Data - Lessons Learnt
No ratings yet
02 Big Data - Lessons Learnt
110 pages
How To Resolve Pending or Erred Cost Manager or Cost Workers (Doc ID 748704.1)
No ratings yet
How To Resolve Pending or Erred Cost Manager or Cost Workers (Doc ID 748704.1)
8 pages
OPERA Credit Card Encryption Key Utility 50
No ratings yet
OPERA Credit Card Encryption Key Utility 50
18 pages
Emmy Correct
No ratings yet
Emmy Correct
48 pages
The RODBC Package: R Topics Documented
No ratings yet
The RODBC Package: R Topics Documented
22 pages
CS_Pre_Board_1_Exam_Class_XII_Session-2022-23
No ratings yet
CS_Pre_Board_1_Exam_Class_XII_Session-2022-23
8 pages
MCQ Exam Sys and Source Code
No ratings yet
MCQ Exam Sys and Source Code
48 pages
70 761 PDF
No ratings yet
70 761 PDF
355 pages
12 IP CBSE Practical File (PART-2)
No ratings yet
12 IP CBSE Practical File (PART-2)
7 pages
Fixed Asset BO Universe For Oracle EBS (12.1.x) : by Ishaq Baig
No ratings yet
Fixed Asset BO Universe For Oracle EBS (12.1.x) : by Ishaq Baig
18 pages
Dbms 2
No ratings yet
Dbms 2
26 pages
MySQL Cheat Sheet String Functions
No ratings yet
MySQL Cheat Sheet String Functions
1 page
Practical File XII Computer SC 2024-25
No ratings yet
Practical File XII Computer SC 2024-25
4 pages
Sync Assignment Costing and Create Auto User With Responsibility
No ratings yet
Sync Assignment Costing and Create Auto User With Responsibility
41 pages
Cratio CRM API Reference
No ratings yet
Cratio CRM API Reference
45 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Must Know Pyspark Coding Before Databricks Interview

Uploaded by

Must Know Pyspark Coding Before Databricks Interview

Uploaded by

Must Know Pyspark Coding Before Your Next

Document by – Siddhartha Subudhi

from pyspark.sql import Window

from pyspark.sql.functions import col, dense_rank

df_with_rank = df.withColumn("rank", dense_rank().over(windowSpec))

second_highest_salary = df_with_rank.filter(col("rank") == 2).select("salary")

2. Count the number of null values in each column of a PySpark DataFrame.

from pyspark.sql.functions import col, isnan, when, count

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

3. Calculate the moving average over a window of 3 rows.

from pyspark.sql import Window

from pyspark.sql.functions import avg

df_with_moving_avg = df.withColumn("moving_avg", avg("price").over(windowSpec))

4. Remove duplicate rows based on a subset of columns in a PySpark DataFrame.

from pyspark.sql.functions import split

df_split = df.withColumn("new_column1", split(df["column"], ",").getItem(0)) \

.withColumn("new_column2", split(df["column"], ",").getItem(1))

7. Join two DataFrames on a specific condition.

df_joined = df_customers.join(df_orders, df_customers.customer_id == df_orders.customer_id, "inner")

8. Create a new column based on conditions from existing columns.

from pyspark.sql.functions import when

df = df.withColumn("category", when(df.sales > 500, "high")

.when((df.sales <= 500) & (df.sales > 200), "medium")

from pyspark.sql.functions import sum, col

df = df.withColumn("percentage", (col("sales") / total_sales) * 100)

10. Find the top N records from a DataFrame based on a column.

Scenario: You need to find the top 5 highest-selling products.

11. Write PySpark code to pivot a DataFrame.

12. Add row numbers to a PySpark DataFrame based on a specific ordering.

Scenario: Add row numbers to a DataFrame ordered by "sales" in descending order.

from pyspark.sql.window import Window

from pyspark.sql.functions import row_number

df_with_row_number = df.withColumn("row_number", row_number().over(windowSpec))

df_filtered = df.filter(df.purchase_amount > 1000)

14. Flatten a JSON column in PySpark.

from pyspark.sql.functions import from_json, col

from pyspark.sql.types import StructType, StructField, StringType

StructField("name", StringType(), True),

StructField("age", StringType(), True)

df = df.withColumn("json_data", from_json(col("json_column"), schema))

15. Convert a PySpark DataFrame column to a list.

column_list = df.select("column_name").rdd.flatMap(lambda x: x).collect()

16. Handle NULL values by replacing them with a default value.

Scenario: Replace all NULL values in the "sales" column with 0.

17. Perform a self-join on a PySpark DataFrame.

df_self_join = df.alias("e1").join(df.alias("e2"), col("e1.manager_id") == col("e2.employee_id"), "inner") \

from pyspark.sql.functions import expr

df_unpivot = df.selectExpr("id", "stack(2, '2021', sales_2021, '2022', sales_2022) as (year, sales)")

df.groupBy("product", "region").agg({"sales": "avg"}).show()

df = spark.read.option("header", "true").option("inferSchema", "true").csv("path_to_csv")

Scenario: You want to calculate a cumulative sum of sales in your DataFrame.

from pyspark.sql.window import Window

from pyspark.sql.functions import sum

df_with_cumsum = df.withColumn("cumulative_sum", sum("sales").over(windowSpec))

24. Write PySpark code to find outliers in a dataset.

from pyspark.sql.functions import expr

q1 = df.approxQuantile("sales", [0.25], 0.01)[0]

q3 = df.approxQuantile("sales", [0.75], 0.01)[0]

lower_bound = q1 - 1.5 * iqr

upper_bound = q3 + 1.5 * iqr

df_outliers = df.filter((col("sales") < lower_bound) | (col("sales") > upper_bound))

25. Write PySpark code to convert a DataFrame to a Pandas DataFrame.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.