Unlock your Full Databricks-Certified-Professional-Data-Engineer Databricks Stable Exam

Databricks Certified Data Engineer Professional Exam Questions and Answers

Question 1

To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries.

The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.

Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?

Options:

Send all users notice that the schema for the table will be changing; include in the communication the logic necessary to revert the new table schema to match historic queries.

Configure a new table with all the requisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table.

Create a new table with the required schema and new fields and use Delta Lake's deep clone functionality to sync up changes committed to one table to the corresponding table.

Replace the current table definition with a logical view defined with the query logic currently writing the aggregate table; create a new table to power the customer-facing application.

Add a table comment warning all users that the table schema and field names will be changing on a given date; overwrite the table in place to the specifications of the customer-facing application.

Question 2

The following table consists of items found in user carts within an e-commerce website.

Question # 2

The following MERGE statement is used to update this table using an updates view, with schema evaluation enabled on this table.

Question # 2

How would the following update be handled?

Options:

The update is moved to separate ''restored'' column because it is missing a column expected in the target schema.

The new restored field is added to the target schema, and dynamically read as NULL for existing unmatched records.

The update throws an error because changes to existing columns in the target schema are not supported.

The new nested field is added to the target schema, and files underlying existing records are updated to include NULL values for the new field.

Question 3

Which distribution does Databricks support for installing custom Python code packages?

Options:

sbt

CRAN

CRAM

nom

Wheels

jars

Question 4

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table -

Which of the following describes how results are generated each time the dashboard is updated?

Options:

The total count of rows is calculated by scanning all data files

The total count of rows will be returned from cached results unless REFRESH is run

The total count of records is calculated from the Delta transaction logs

The total count of records is calculated from the parquet file metadata

The total count of records is calculated from the Hive metastore

Question 5

A data engineer wants to join a stream of advertisement impressions (when an ad was shown) with another stream of user clicks on advertisements to correlate when impression led to monitizable clicks.

Question # 5

Which solution would improve the performance?

Question # 5

Options:

Option A

Option B

Option C

Option D

Question 6

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.

Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

Options:

Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.

Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.

The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.

Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.

Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

Answer:

Explanation:

The scenario presented involves inconsistent microbatch processing times in a Structured Streaming job during peak hours, with the need to ensure that records are processed within 10 seconds. The trigger once option is the most suitable adjustment to address these challenges:

Understanding Triggering Options:

Fixed Interval Triggering (Current Setup): The current trigger interval of 10 seconds may contribute to the inconsistency during peak times as it doesn't adapt based on the processing time of the microbatches. If a batch takes longer to process, subsequent batches will start piling up, exacerbating the delays.

Trigger Once: This option allows the job to run a single microbatch for processing all available data and then stop. It is useful in scenarios where batch sizes are unpredictable and can vary significantly, which seems to be the case during peak hours in this scenario.

Implementation of Trigger Once:

Setup: Instead of continuously running, the job can be scheduled to run every 10 seconds using a Databricks job. This scheduling effectively acts as a custom trigger interval, ensuring that each execution cycle handles all available data up to that point without overlapping or queuing up additional executions.

Advantages: This approach allows for each batch to complete processing all available data before the next batch starts, ensuring consistency in handling data surges and preventing the system from being overwhelmed.

Rationale Against Other Options:

Option A and E (Decrease Interval): Decreasing the trigger interval to 5 seconds might exacerbate the problem by increasing the frequency of batch starts without ensuring the completion of previous batches, potentially leading to higher overhead and less efficient processing.

Option B (Increase Interval): Increasing the trigger interval to 30 seconds could lead to latency issues, as the data would be processed less frequently, which contradicts the requirement of processing records in less than 10 seconds.

Option C (Modify Partitions): While increasing parallelism through more shuffle partitions can improve performance, it does not address the fundamental issue of batch scheduling and could still lead to inconsistency during peak loads.

Conclusion:

By using the trigger once option and scheduling the job every 10 seconds, you ensure that each microbatch has sufficient time to process all available data thoroughly before the next cycle begins, aligning with the need to handle peak loads more predictably and efficiently.

References

Structured Streaming Programming Guide - Triggering

Databricks Jobs Scheduling

Question 7

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG

post_text STRING

post_id STRING

longitude FLOAT

latitude FLOAT

post_time TIMESTAMP

date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

Options:

date

user_id

post_id

post_time

Answer:

Explanation:

Partitioning a Delta Lake table is a strategy used to improve query performance by dividing the table into distinct segments based on the values of a specific column. This approach allows queries to scan only the relevant partitions, thereby reducing the amount of data read and enhancing performance.

Considerations for Choosing a Partition Column:

Cardinality: Columns with high cardinality (i.e., a large number of unique values) are generally poor choices for partitioning. High cardinality can lead to a large number of small partitions, which can degrade performance.

Query Patterns: The partition column should align with common query filters. If queries frequently filter data based on a particular column, partitioning by that column can be beneficial.

Partition Size: Each partition should ideally contain at least 1 GB of data. This ensures that partitions are neither too small (leading to too many partitions) nor too large (negating the benefits of partitioning).

Evaluation of Columns:

date:

Cardinality: Typically low, especially if data spans over days, months, or years.

Query Patterns: Many analytical queries filter data based on date ranges.

Partition Size: Likely to meet the 1 GB threshold per partition, depending on data volume.

user_id:

Cardinality: High, as each user has a unique ID.

Query Patterns: While some queries might filter by user_id, the high cardinality makes it unsuitable for partitioning.

Partition Size: Partitions could be too small, leading to inefficiencies.

post_id:

Cardinality: Extremely high, with each post having a unique ID.

Query Patterns: Unlikely to be used for filtering large datasets.

Partition Size: Each partition would be very small, resulting in a large number of partitions.

post_time:

Cardinality: High, especially if it includes exact timestamps.

Query Patterns: Queries might filter by time, but the high cardinality poses challenges.

Partition Size: Similar to user_id, partitions could be too small.

Conclusion:

Given the considerations, the date column is the most suitable candidate for partitioning. It has low cardinality, aligns with common query patterns, and is likely to result in appropriately sized partitions.

[References:, Delta Lake Best Practices, Partitioning in Delta Lake, , ]

Question 8

Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code.

Which statement describes a main benefit that offset this additional effort?

Options:

Improves the quality of your data

Validates a complete use case of your application

Troubleshooting is easier since all steps are isolated and tested individually

Yields faster deployment and execution times

Ensures that all steps interact correctly to achieve the desired end result

Question 9

A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

Options:

userLookup.join(streamingDF, ["userid"], how="inner")

streamingDF.join(userLookup, ["user_id"], how="outer")

streamingDF.join(userLookup, ["user_id”], how="left")

streamingDF.join(userLookup, ["userid"], how="inner")

userLookup.join(streamingDF, ["user_id"], how="right")

Question 10

A nightly job ingests data into a Delta Lake table using the following code:

Question # 10

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():

Options:

return spark.readStream.table("bronze")

return spark.readStream.load("bronze")

Option C10

return spark.read.option("readChangeFeed", "true").table ("bronze")

Option E10

Question 11

A data pipeline uses Structured Streaming to ingest data from kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka_generated timesamp, key, and value. Three months after the pipeline is deployed the data engineering team has noticed some latency issued during certain times of the day.

A senior data engineer updates the Delta Table's schema and ingestion logic to include the current timestamp (as recoded by Apache Spark) as well the Kafka topic and partition. The team plans to use the additional metadata fields to diagnose the transient processing delays:

Which limitation will the team face while diagnosing this problem?

Options:

New fields not be computed for historic records.

Updating the table schema will invalidate the Delta transaction log metadata.

Updating the table schema requires a default value provided for each file added.

Spark cannot capture the topic partition fields from the kafka source.

Question 12

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Options:

Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored.

The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.

Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.

Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Question 13

A data engineer needs to capture pipeline settings from an existing in the workspace, and use them to create and version a JSON file to create a new pipeline.

Which command should the data engineer enter in a web terminal configured with the Databricks CLI?

Options:

Use the get command to capture the settings for the existing pipeline; remove the pipeline_id and rename the pipeline; use this in a create command

Stop the existing pipeline; use the returned settings in a reset command

Use the alone command to create a copy of an existing pipeline; use the get JSON command to get the pipeline definition; save this to git

Use list pipelines to get the specs for all pipelines; get the pipeline spec from the return results parse and use this to create a pipeline

Question 14

The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property "contains_pii" = true.

The following SQL DDL statement is executed to create a new table:

Question # 14

Which command allows manual confirmation that these three requirements have been met?

Options:

DESCRIBE EXTENDED dev.pii test

DESCRIBE DETAIL dev.pii test

SHOW TBLPROPERTIES dev.pii test

DESCRIBE HISTORY dev.pii test

SHOW TABLES dev

Question 15

A developer has successfully configured credential for Databricks Repos and cloned a remote Git repository. Hey don not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace.

Use Response to pull changes from the remote Git repository commit and push changes to a branch that appeared as a changes were pulled.

Options:

Use Repos to merge all differences and make a pull request back to the remote repository.

Use repos to merge all difference and make a pull request back to the remote repository.

Use Repos to create a new branch commit all changes and push changes to the remote Git repertory.

Use repos to create a fork of the remote repository commit all changes and make a pull request on the source repository

Question 16

A user wants to use DLT expectations to validate that a derived table report contains all records from the source, included in the table validation_copy.

The user attempts and fails to accomplish this by adding an expectation to the report table definition.

Which approach would allow using DLT expectations to validate all expected records are present in this table?

Options:

Define a SQL UDF that performs a left outer join on two tables, and check if this returns null values for report key values in a DLT expectation for the report table.

Define a function that performs a left outer join on validation_copy and report and report, and check against the result in a DLT expectation for the report table

Define a temporary table that perform a left outer join on validation_copy and report, and define an expectation that no report key values are null

Define a view that performs a left outer join on validation_copy and report, and reference this view in DLT expectations for the report table

Question 17

Which statement describes Delta Lake Auto Compaction?

Options:

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.

Question 18

A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:

Question # 18

Which statement describes the execution and results of running the above query multiple times?

Options:

Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.

Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.

Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.

Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.

Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table giving the desired result.

Question 19

Which Python variable contains a list of directories to be searched when trying to locate required modules?

Options:

importlib.resource path

,sys.path

os-path

pypi.path

pylib.source

Question 20

A data architect has heard about lake's built-in versioning and time travel capabilities. For auditing purposes they have a requirement to maintain a full of all valid street addresses as they appear in the customers table.

The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability.

Which piece of information is critical to this decision?

Options:

Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution.

Delta Lake time travel cannot be used to query previous versions of these tables because Type 1 changes modify data files in place.

Shallow clones can be combined with Type 1 tables to accelerate historic queries for long-term versioning.

Data corruption can occur if a query fails in a partially completed state because Type 2 tables requires

Setting multiple fields in a single update.

Question 21

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

df has the following schema: device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT

Code block:

df.withWatermark("event_time", "10 minutes")

.groupBy(

________,

"device_id"

)

.agg(

avg("temp").alias("avg_temp"),

avg("humidity").alias("avg_humidity")

)

.writeStream

.format("delta")

.saveAsTable("sensor_avg")

Which line of code correctly fills in the blank within the code block to complete this task?

Options:

window("event_time", "5 minutes").alias("time")

to_interval("event_time", "5 minutes").alias("time")

"event_time"

lag("event_time", "5 minutes").alias("time")

Question 22

A table is registered with the following code:

Question # 22

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?

Options:

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Question 23

A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.

Which kind of the test does the above line exemplify?

Options:

Integration

Unit

Manual

functional

Question 24

The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.

Which statement describes what the number alongside this field represents?

Options:

The job_id is returned in this field.

The job_id and number of times the job has been are concatenated and returned.

The number of times the job definition has been run in the workspace.

The globally unique ID of the newly triggered run.

Question 25

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:

(spark.readStream

.format("parquet")

.load("/mnt/raw_orders/")

.withWatermark("time", "2 hours")

.dropDuplicates(["customer_id", "order_id"])

.writeStream

.trigger(once=True)

.table("orders")

)

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system. If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

Options:

The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.

The orders table will contain only the most recent 2 hours of records and no duplicates will be present.

All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.

Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.

Question 26

A team of data engineer are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks.

One member of the team suggests reusing these data quality rules across all tables defined for this pipeline.

What approach would allow them to do this?

Options:

Maintain data quality rules in a Delta table outside of this pipeline’s target schema, providing the schema name as a pipeline parameter.

Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline.

Add data quality constraints to tables in this pipeline using an external job with access to pipeline configuration files.

Maintain data quality rules in a separate Databricks notebook that each DLT notebook of file.

Question 27

The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.

Which approach will ensure that this requirement is met?

Options:

Whenever a database is being created, make sure that the location keyword is used

When configuring an external data warehouse for all table storage. leverage Databricks for all ELT.

Whenever a table is being created, make sure that the location keyword is used.

When tables are created, make sure that the external keyword is used in the create table statement.

When the workspace is being configured, make sure that external cloud object storage has been mounted.

Question 28

Which statement describes the default execution mode for Databricks Auto Loader?

Options:

New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.

Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.

Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.

New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.

Question 29

A transactions table has been liquid clustered on the columns product_id, user_id, and event_date.

Which operation lacks support for cluster on write?

Options:

spark.writestream.format('delta').mode('append')

CTAS and RTAS statements

INSERT INTO operations

spark.write.format('delta').mode('append')

Question 30

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream for highly selective joins on a number of fields, and will also be leveraged by the machine learning team to filter on a handful of relevant fields, in total, 15 fields have been identified that will often be used for filter and join logic.

The data engineer is trying to determine the best approach for dealing with these nested fields before declaring the table schema.

Which of the following accurately presents information about Delta Lake and Databricks that may Impact their decision-making process?

Options:

Because Delta Lake uses Parquet for data storage, Dremel encoding information for nesting can be directly referenced by the Delta transaction log.

Tungsten encoding used by Databricks is optimized for storing string data: newly-added native support for querying JSON strings means that string types are always most efficient.

Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

By default Delta Lake collects statistics on the first 32 columns in a table; these statistics are leveraged for data skipping when executing selective queries.

Question 31

The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.

The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.

Which statement exemplifies best practices for implementing this system?

Options:

Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.

Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.

Storinq all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.

Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.

Because all tables must live in the same storage containers used for the database they're created in, organizations should be prepared to create between dozens and thousands of databases depending on their data isolation requirements.

Question 32

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.

The function is displayed below with a blank:

Question # 32

Which response correctly fills in the blank to meet the specified requirements?

Question # 32

Options:

Option A

Option B

Option C

Option D

Option E

Question 33

A data engineer wants to reflector the following DLT code, which includes multiple definition with very similar code:

Question # 33

In an attempt to programmatically create these tables using a parameterized table definition, the data engineer writes the following code.

Question # 33

The pipeline runs an update with this refactored code, but generates a different DAG showing incorrect configuration values for tables.

How can the data engineer fix this?

Options:

Convert the list of configuration values to a dictionary of table settings, using table names as keys.

Convert the list of configuration values to a dictionary of table settings, using different input the for loop.

Load the configuration values for these tables from a separate file, located at a path provided by a pipeline parameter.

Wrap the loop inside another table definition, using generalized names and properties to replace with those from the inner table

Question 34

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

Options:

Task queueing resulting from improper thread pool assignment.

Spill resulting from attached volume storage being too small.

Network latency due to some cluster nodes being in different regions from the source data

Skew caused by more data being assigned to a subset of spark-partitions.

Credential validation errors while pulling data from an external system.

Question 35

Which of the following is true of Delta Lake and the Lakehouse?

Options:

Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.

Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

Z-order can only be applied to numeric values stored in Delta Lake tables

Answer:

Explanation:

https://docs.delta.io/2.0.0/table-properties.html

Delta Lake automatically collects statistics on the first 32 columns of each table, which are leveraged in data skipping based on query filters1. Data skipping is a performance optimization technique that aims to avoid reading irrelevant data from the storage layer1. By collecting statistics such as min/max values, null counts, and bloom filters, Delta Lake can efficiently prune unnecessary files or partitions from the query plan1. This can significantly improve the query performance and reduce the I/O cost.

The other options are false because:

Parquet compresses data column by column, not row by row2. This allows for better compression ratios, especially for repeated or similar values within a column2.

Views in the Lakehouse do not maintain a valid cache of the most recent versions of source tables at all times3. Views are logical constructs that are defined by a SQL query on one or more base tables3. Views are not materialized by default, which means they do not store any data, but only the query definition3. Therefore, views always reflect the latest state of the source tables when queried3. However, views can be cached manually using the CACHE TABLE or CREATE TABLE AS SELECT commands.

Primary and foreign key constraints can not be leveraged to ensure duplicate values are never entered into a dimension table. Delta Lake does not support enforcing primary and foreign key constraints on tables. Constraints are logical rules that define the integrity and validity of the data in a table. Delta Lake relies on the application logic or the user to ensure the data quality and consistency.

Z-order can be applied to any values stored in Delta Lake tables, not only numeric values. Z-order is a technique to optimize the layout of the data files by sorting them on one or more columns. Z-order can improve the query performance by clustering related values together and enabling more efficient data skipping. Z-order can be applied to any column that has a defined ordering, such as numeric, string, date, or boolean values.

[References: Data Skipping, Parquet Format, Views, [Caching], [Constraints], [Z-Ordering], ]

Question 36

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.

Question # 36

Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?

Options:

Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.

The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.

Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.

One new job named "Ingest new data" will be defined in the workspace, but it will not be executed.

The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.

Question 37

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

Options:

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: Unlimited

Cluster: New Job Cluster;

Retries: None;

Maximum Concurrent Runs: 1

Cluster: Existing All-Purpose Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

Cluster: Existing All-Purpose Cluster;

Retries: None;

Maximum Concurrent Runs: 1

Question 38

What is the first of a Databricks Python notebook when viewed in a text editor?

Options:

%python

% Databricks notebook source

-- Databricks notebook source

//Databricks notebook source

Question 39

The data engineer team is configuring environment for development testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team want to develop and test against similar production data as possible.

A junior data engineer suggests that production data can be mounted to the development testing environments, allowing pre production code to execute against production data. Because all users have

Admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.

Which statement captures best practices for this situation?

Options:

Because access to production data will always be verified using passthrough credentials it is safe to mount data to any Databricks development environment.

All developer, testing and production code and data should exist in a single unified workspace; creating separate environments for testing and development further reduces risks.

In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.

Because delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data, as such it is generally safe to mount production data anywhere.

Question 40

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?

Options:

Regex

Julia

pyspsark.ml.feature

Scala Datasets

C++

Load More Databricks-Certified-Professional-Data-Engineer Questions

Big Halloween Sale Limited Time Flat 70% Discount offer - Ends in 0d 00h 00m 00s - Coupon code: 70spcl

Activedumpsnet Logo

Activedumpsnet Navigation

Activedumpsnet Slider

Databricks Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Exam Practice Test

Databricks Certified Data Engineer Professional Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation: