Azure Synapse Analytics Introduced Intelligent Cache for Apache Spark to improve performance of repeat queries

Microsoft recently introduced Intelligent Cache for Apache Spark engine in Azure Synapse Analytics as part of Azure Synapse Caching.

This Intelligent Cache of Apache spark pool in synapse reduces the total cost of ownership (TCO) by improving performance for reading the remote file types in data lake:

  • Parquet files: up to 65% on subsequent reads &
  • CSV files : 50% for CSV files.

Currently Intelligent Cache of Spark pool in synapse is currently in Public Preview.

By default this cache is disabled, we need to enable it while creating a new Apache Spark pools or updating an existing Apache Spark pools

Intelligent Cache of Apache Spark in Azure Synapse Analytics
Intelligent Cache in Azure Synapse Analytics
WHY we should use it? What issue Intelligent Cache solves? (Azure Synapse Caching)

In general, when we query data from remote ADLS Gen2 data lake, the Apache Spark engine in Synapse makes call to storage every time it needs to read data.

So for scenario where we need to read a file frequently i.e. repeating the same queries, every time calling to remote ADLS Gen2 data lake adds latency to the overall processing time and decrease performance.

In native Apache Spark cache can create stale data if the underlying data changes as we must be manually set & release cache to minimize the latency and improve overall performance.

How Intelligent Cache solves above issue & improve performance?

To resolve above issue, once it is enabled

  1. behind the scenes, it automatically cache each read from remote data lake within the allocated cache storage space on each Spark node to speed up the execution of Spark on subsequent read.
  2. it automatically detects changes to the underlying files and automatically refreshing them in the cache. This ensure we have access to most recent data i.e. no stale data is there.
  3. it automatically release the least-read data to make space for more recent data if cache reaches its size limit
When we should use Intelligent Cache to improve performance?
  1. If it requires to read the same file frequently and file size can fit into the cache.
  2. If we use Delta tables, parquet file formats and CSV files.
  3. If we are using Apache Spark v3.1 or higher on Azure Synapse.
How to Enable or Disable the cache?

We can enable it while creating new spark pool or we can enable/disable it while updating existing spark pool.

For new new Spark pools:

Open your synapse account and click on ‘+ New Apache Spark pool‘ or click on ‘Apache Spark pools‘ at left side and then click on + icon in next page , it will create new form (New Apache Spark pool) to create pool

Creating New Apache Spark pool in Azure Synapse

‘New Apache Spark pool’: Now go to Additional settings tab as shown in below image –

Enable or Disable Intelligent cache in Azure Synapse

In this tab, scroll down then we can see the header like “Intelligent cache Reserve space for Synapse storage cache“:

If we do not modify anything it will remain disable by default.

  1. To enable: we need to move the slider bar from 0 (disabled at extreme left) to the required percentage for cache size to enable it. This size is cache size as percent of total disk size available for each Apache Spark pool.
  2. To disable: keep slider at extreme left at 0 or move from right to left at 0
  3. We can increase or decrease it by slider bar from left right or from right to left.
Enabling/Disabling cache for existing Spark pools:

For existing Spark pools, in ‘Scale settings’, of existing Apache Spark pool we enable cache by moving the slider to a value more then 0, or disable it, by moving slider to 0 and apply.

Updating cache size for existing Spark pools:

if the pool has active sessions, we must force a restart to change the Intelligent Cache size of a pool.
It will show Force new settings, once we click the check box & select Apply, it will automatically restart the session.

Enabling and disabling the cache within the session:

We can also enable and disable the cache within the session by running the following code in your notebook:

We can enable cache by running below code:

%pyspark

spark.conf.set('spark.synapse.vegas.useCache', 'true')

We can disable cache by running below code:

%pyspark

spark.conf.set('spark.synapse.vegas.useCache', 'false')

So here we learnt what issue intelligent cache resolve & how it resolve as well. We also learnt how to enable/disable in Azure synapse.

Thanks for reading the article. Please feel free to share your queries/thought on this in comment section.

Leave a Reply