Unlock your Full Databricks-Certified-Data-Engineer-Associate Databricks Stable Exam

Databricks Certified Data Engineer Associate Exam Questions and Answers

Question 1

A data engineer has joined an existing project and they see the following query in the project repository:

CREATE STREAMING LIVE TABLE loyal_customers AS

SELECT customer_id -

FROM STREAM(LIVE.customers)

WHERE loyalty_level = 'high';

Which of the following describes why the STREAM function is included in the query?

Options:

The STREAM function is not needed and will cause an error.

The table being created is a live table.

The customers table is a streaming live table.

The customers table is a reference to a Structured Streaming query on a PySpark DataFrame.

The data in the customers table has been updated since its last run.

Question 2

Which of the following is a benefit of the Databricks Lakehouse Platform embracing open source technologies?

Options:

Cloud-specific integrations

Simplified governance

Ability to scale storage

Ability to scale workloads

Avoiding vendor lock-in

Question 3

A data engineer wants to create a new table containing the names of customers that live in France.

They have written the following command:

Question # 3

A senior data engineer mentions that it is organization policy to include a table property indicating that the new table includes personally identifiable information (PII).

Which of the following lines of code fills in the above blank to successfully complete the task?

Options:

There is no way to indicate whether a table contains PII.

"COMMENT PII"

TBLPROPERTIES PII

COMMENT "Contains PII"

PII

Question 4

Identify the impact of ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE for a constraint violation.

A data engineer has created an ETL pipeline using Delta Live table to manage their company travel reimbursement detail, they want to ensure that the if the location details has not been provided by the employee, the pipeline needs to be terminated.

How can the scenario be implemented?

Options:

CONSTRAINT valid_location EXPECT (location = NULL)

CONSTRAINT valid_location EXPECT (location != NULL) ON VIOLATION FAIL UPDATE

CONSTRAINT valid_location EXPECT (location != NULL) ON DROP ROW

CONSTRAINT valid_location EXPECT (location != NULL) ON VIOLATION FAIL

Question 5

A data engineer wants to reduce costs and optimize cloud spending. The data engineer has decided to use Databricks Serverless for lowering cloud costs while maintaining existing SLAs.

What is the first step in migrating to Databricks Serverless?

Options:

Legacy Ingestion pipelines that include ingestion from sources API's, files, JDBC/ODBC connections

Low frequency Bl Dashboarding and Adhoc SQL Analytics

A frequently running and efficient Python-based data transformation pipeline compatible with the latest Databricks runtime and Unity Catalog

A frequently running and efficient Scala-based data transformation pipeline compatible with the latest Databricks runtime and Unity Catalog

Question 6

A data engineer is designing an ETL pipeline to process both streaming and batch data from multiple sources The pipeline must ensure data quality, handle schema evolution, and provide easy maintenance. The team is considering using Delta Live Tables (DLT) in Databricks to achieve these goals. They want to understand the key features and benefits of DLT that make it suitable for this use case.

Why is Delta Live Tables (DLT) an appropriate choice?

Options:

Automatic data quality checks, built-in support for schema evolution, and declarative pipeline development

Manual schema enforcement, high operational overhead, and limited scalability

Requires custom code for data quality checks, no support for streaming data, and complex pipeline maintenance

Supports only batch processing, no data versioning, and high infrastructure costs

Question 7

A data engineer is working in a Python notebook on Databricks to process data, but notices that the output is not as expected. The data engineer wants to investigate the issue by stepping through the code and checking the values of certain variables during execution.

Which tool should the data engineer use to inspect the code execution and variables in real-time?

Options:

Python Notebook Interactive Debugger

Cluster Logs

SQL Analytics

Job Execution Dashboard

Question 8

Which of the following describes a scenario in which a data team will want to utilize cluster pools?

Options:

An automated report needs to be refreshed as quickly as possible.

An automated report needs to be made reproducible.

An automated report needs to be tested to identify errors.

An automated report needs to be version-controlled across multiple collaborators.

An automated report needs to be runnable by all stakeholders.

Question 9

A global retail company sells products across multiple categories (e.g.. Electronics, Clothing) and regions (e.g.. North. South, East. West). The sales team has provided the data engineer with a PySpark dataframe named sales_df as below and the team wants the data engineer to analyze the sales data to help them make strategic decisions.

Question # 9

Options:

Category_sales = sales df.groupBy("category").agg(sum("sales amount") .alias ("total sales amount"))

Category_sales = sales_df.sum("3ales_amount"). g-1- upBy("categcryn).alias("toLal_sales_amount))

Category_sale: .es df -agg (sum ("sales amount") .-;r*i:rRy ("category") .alias ("total sa.en amount"))

Category_sales = sales_df.groupBy("reqion"). agq(sum("sales_amountn).alias(ntotal_sales_amount''))

Question 10

Which TWO items are characteristics of the Gold Layer?

Choose 2 answers

Options:

Read-optimized

Normalised

Raw Data

Historical lineage

De-normalised

Question 11

Which of the following tools is used by Auto Loader process data incrementally?

Options:

Checkpointing

Spark Structured Streaming

Data Explorer

Unity Catalog

Databricks SQL

Question 12

Which of the following commands will return the location of database customer360?

Options:

DESCRIBE LOCATION customer360;

DROP DATABASE customer360;

DESCRIBE DATABASE customer360;

ALTER DATABASE customer360 SET DBPROPERTIES ('location' = '/user'};

USE DATABASE customer360;

Question 13

A data engineer is writing a script that is meant to ingest new data from cloud storage. In the event of the Schema change, the ingestion should fail. It should fail until the changes downstream source can be found and verified as intended changes.

Which command will meet the requirements?

Options:

addNewColumns

failOnNewColumns

rescue

none

Question 14

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The code block used by the data engineer is below:

Question # 14

If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?

Options:

processingTime(1)

trigger(availableNow=True)

trigger(parallelBatch=True)

trigger(processingTime="once")

trigger(continuous="once")

Question 15

Question # 15

Calculate the total sales amount for each region and store the results in a new dataframe called region_sales.

Given the expected result:

Question # 15

Which code will generate the expected result?

Options:

region_sales = sales_df.groupBy("region").agg(sum("sales_amountM).alias("total_sales_amount"))

region_sales = sales_df. sum ("salen_aiTiount") . groupBy ("region") .alias ("total_sale3_amount")

region_sales= sales_df.groupBy("category").sum(nsales_amount").alias("t_otal_sales_amounl")

region sales - sales_df.agg(sum("sales_amount").groupBy("region").alias("total sales amount"))

Question 16

A data engineer needs to combine sales data from an on-premises PostgreSQL database with customer data in Azure Synapse for a comprehensive report. The goal is to avoid data duplication and ensure up-to-date information

How should the data engineer achieve this using Databricks?

Options:

Develop custom ETL pipelines to ingest data into Databricks

Use Lakehouse Federation to query both data sources directly

Manually synchronize data from both sources into a single database

Export data from both sources to CSV files and upload them to Databricks

Question 17

Which SQL keyword can be used to convert a table from a long format to a wide format?

Options:

TRANSFORM

PIVOT

SUM

CONVERT

Question 18

A data engineer needs to apply custom logic to string column city in table stores for a specific use case. In order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF).

Which of the following code blocks creates this SQL UDF?

Options:

Option A18

Option B18

Option C18

Option D18

Option E18

Question 19

A data engineer has been given a new record of data:

id STRING = 'a1'

rank INTEGER = 6

rating FLOAT = 9.4

Which of the following SQL commands can be used to append the new record to an existing Delta table my_table?

Options:

INSERT INTO my_table VALUES ('a1', 6, 9.4)

my_table UNION VALUES ('a1', 6, 9.4)

INSERT VALUES ( 'a1' , 6, 9.4) INTO my_table

UPDATE my_table VALUES ('a1', 6, 9.4)

UPDATE VALUES ('a1', 6, 9.4) my_table

Question 20

Which tool is used by Auto Loader to process data incrementally?

Options:

Spark Structured Streaming

Unity Catalog

Checkpointing

Databricks SQL

Question 21

A data engineering team has two tables. The first table march_transactions is a collection of all retail transactions in the month of March. The second table april_transactions is a collection of all retail transactions in the month of April. There are no duplicate records between the tables.

Which of the following commands should be run to create a new table all_transactions that contains all records from march_transactions and april_transactions without duplicate records?