Databricks spark dataframe api question analysis

  • Spark
  • Databricks
  • Certification

posted on 02 Sep 2020

Spark DataFrame API Questions Analysis

Content Outline

  • Subsetting DataFrames (select, filter, etc.)
  • Column manipulation (casting, creating columns, manipulating existing columns, complex column types)
  • String manipulation (Splitting strings, regular expressions)
  • Performance-based operations (repartitioning, shuffle partitions, caching)
  • Combining DataFrames (joins, broadcasting, unions, etc.)
  • Reading/writing DataFrames (schemas, overwriting)
  • Working with dates (extraction, formatting, etc.)
  • Aggregations
  • Miscellaneous (sorting, missing values, typed UDFs, value extraction, sampling)
  • Candidates should be intimately familiar with applying all of these topics.
  • Note: The use of “etc.” here indicates that there might be a few more things in those sub-topics, but the items listed should give appropriate context for each sub-topic.

Format: Operation identification

Example

Which of the following operations can be used to create a new DataFrame with a new column and all previously existing columns from an existing DataFrame?

A. DataFrame.withColumn()

B. DataFrame.drop()

C. DataFrame.withColumnRenamed()

D. DataFrame.head()

E. DataFrame.filter()

Analysis

  1. What parts the question are interchangble to other things and ideas?
  2. How do we use Spark: The Definitive Guide to discover questions that would use the same format?

Format: Code Block Comparison

Example

Which of the following code blocks returns a DataFrame with a new column aSqaured and all previously existing columns from DataFrame df?

A. df.withColumn("aSquared", col("a") * col("a"))

B. df.withColumnRenamed("aSquared", col("a") * col("a"))

C. df.select("aSquared")

D. df.withColumn(col("a") * col("a"), "aSquared")

E. df.withColumnRenamed("aSquared", col("a") * col("a"))

Format: Error Identification

Example

The code block show below contains an error. The code block is intended to return a DataFrame with a new column aSquared and all previously existing columns from DataFrame df. Identify the error. Code block:

df.withColumn(col("a") * col("a"), "aSquared")

A. The arguments to df.withColumn are provided in reverse order. “aSquared” should be first, and col("a") * col("a) should be second.

B. The df.withColumn() operation does no create new columns. The df.newColumn() operation should be used instead.

C. The argument “aSquared” must be wrapped in the col() function becaouse it is a column name.

D. The withColumn() operation is not a DataFrame method. It should be called on its own with the first argument being df.

E. The df.withColumn() operation does not create new columns. The df.withColumnsRenamed() operation should be used instead.

Format: Fill-in-the-blank

Example

The code block show below should return a DatFrame with a new column aSquared and all previously existing columns from DataFrame df. Choose the response that correctly fills in the number blanks within the code block to complete this task.

Code block:

df._1_(_2_, _3_)

A.

1. withColumn
2. "aSquared"
3. col("a") * col("a")

B.

1. withColumnRenamed
2. "aSquared"
3. col("a") * col("a")

C.

1. withColumn
2. col("aSquared")
3. col("a") * col("a")

D.

1. withColumn
2. "aSquared"
3. "a" * "a"

E.

1. withColumnRenamed
2. "aSquared"
3. "a" * "a"

Format: Ordering lines of code

Example

In what order should the below line of code be run in order to return a DataFrame with a new column aSquared and all previously existing columns from DataFrame df?

1. df
2. .withColumn("aSquared", "a" * "a")
3. .withColumn("aSquared", col("a") * col("a"))
4. DataFrame
5. .withColumn(col("aSquared"), col("a") * col("a")

A. 1,3

B. 1,2

C. 1,5

D. 4,2

E. 4,3