Quantcast
Channel: Simple Talk
Viewing all articles
Browse latest Browse all 266

How Shortcuts affect Lakehouse’s Maintenance

$
0
0

I wrote about lakehouse maintenance before, about multiple lakehouse maintenances, published videos about this subject and provided sample code about it.

However, there is one problem: All the maintenance execution should be avoided over shortcuts.

The tables require maintenance in their original place. According to our solution advances, we start using shortcuts, lots of them. Our maintenance code should always skip shortcuts and make the maintenance only on the tables.

In this way, the maintenance in each lakehouse manages the tables in that lakehouse and delegates the maintenance of the shortcuts to their source location.

The Problem: how to identify which objects in a lakehouse are shortcuts?

This is not a straightforward task. It’s not something like a boolean value nor a check we can make on the object.

We need to use the Fabric API to list all shortcuts from the lakehouse. Once we have a list of all shortcuts, we can change the maintenance code to skip the shortcuts.

This is the original code for lakehouse maintenance:

# Welcome to your new notebook
# Type here in the cell editor to add code!

import pyarrow.dataset as pq
import os
import org.apache.spark.sql

def cleanTables(delta_file_path,delta_table_name):
    spark.sql(f'OPTIMIZE {delta_table_name} VORDER')
    spark.sql(f'VACUUM \'{delta_file_path}\' RETAIN 0 HOURS');

    print(f'\nTable {delta_file_path} OPTIMIZE and VACUUM sucessfully')




# Test the function with a path to your delta file#

spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
full_tables = os.listdir('/lakehouse/default/Tables')

for table in full_tables:
    cleanTables('Tables/' + table,table)

We need an additional function to retrieve the list of shortcuts from the lakehouse:

#shortcuts need to be excluded from the lakehouse maintenance

import sempy.fabric as fabric
from sempy.fabric.exceptions import FabricHTTPException

def loadShortcuts():
    client = fabric.FabricRestClient()
    url = f"v1/workspaces/{datausageWorkspaceId}/items/{lakehouseId}/shortcuts"
    result = client.get(url)
    data = result.json().get('value', [])

    # Directly extract the 'Name' from each dictionary
    shortcuts = [item.get("Name") for item in data]
    return shortcuts

The shortcuts are loaded as a collection because this makes it easier to filter the table names.

We need to change the main code to retrieve the shortcuts and skip them during the maintenance.

# Test the function with a path to your delta file#

spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
full_tables = os.listdir('/lakehouse/default/Tables')
shortcuts=loadShortcuts()

for table in full_tables:
    if table not in shortcuts:
       cleanTables('Tables/' + table,table)

Conclusion

We need to be careful with shortcuts in our maintenance code

The post How Shortcuts affect Lakehouse’s Maintenance appeared first on Simple Talk.


Viewing all articles
Browse latest Browse all 266

Trending Articles