Tuesday, September 22, 2020

Indexing in Azure Cosmos DB

 Even though Cosmos DB automatically indexes every property by default, understanding how indexing works in Cosmos DB is vital for achieving efficient query performance.

In Azure Cosmos DB, every property in our items are indexed by default. This is fantastic for developers, as this means we don’t have to spend time managing indexing ourselves.

However, there may be times where we do want to customize the indexing policy depending on the requirements of our workloads. The purpose of this article is to show you how indexing works in Azure Cosmos DB, what kinds of indexes there are in Cosmos DB and how we can employ different indexing strategies to optimize performance depending on what we’re trying to achieve.

How indexing works in Cosmos DB

Azure Cosmos DB persists our items within our containers as JSON documents. We can think of these documents as trees and each property in our item as a node within that tree.

Say that I have a document for a customer, and that customer has multiple address associated with them, we could visualize our property paths for this customer like so:


/address/0/addressLine1: "1 Shipping way"
/address/0/addressCity: "Shipville"
/address/1/adressLine1: "2 Billing Road"
/address/1/addressCity: "Billington"

When our customer item gets written to our container in Cosmos DB, each property path and the value of the property is indexed.


Types of Indexes in Cosmos

Cosmos DB supports 3 kinds of indexes:

  1. Range Indexes
  2. Spatial Indexes
  3. Composite Indexes

Range Indexes are used for equality queries (WHERE someProperty = ‘someValue’), range queries (WHERE someProperty > ‘someValue’), ORDER BY queries, JOIN queries and for checking the presence of a property (WHERE IS_DEFINED(someProperty)). We use Range indexes on string or number values.

Spatial Indexes are used for geospatial objects. Currently, Points, LineStrings, Polygons and MultiPolygons are supported. We can use Spatial indexes on GeoJSON objects.

Composite indexes are used when we need to increase the efficiency on queries that perform operations on multiple fields. Let’s take a look at a couple of examples.

If we have a query where we have a filter and an ORDER BY statements, we can use a Composite index as long as we add the filtered property to the ORDER BY clause:


SELECT \* FROM Customer c WHERE c.customerType = "Business" ORDER BY c.customerType, c.customerValue DESC


We can also use a Composite Index for queries that ORDER BY on multiple properties:


SELECT \* FROM c ORDER BY c.customerType, c.customerValue DESC


Finally, we can use Composite Indexes for queries that filter on multiple properties provided that at least ONE property is an equality filer. For 

example:

SELECT \* FROM c WHERE c.customerValue > '10000' AND c.customerType = "Business"

Include vs Exclude Strategies

Having every property in our items indexed by default can be a significant advantage to us as developers. However, there are situations where we would customize what properties are either included or excluded from our indexing strategy.

In an ‘Include strategy’, we would explicitly specify which properties we want to index in our containers. This is particularity beneficial in heavy-write situations as this will improve the latency of these operations.

In a ‘Exclude strategy’ we would specify which properties in our items that we want to exclude from indexing. Let’s take a look at a couple of example to show the differences between include and exclude indexing policies.

{
  "indexingMode": "consistent",
  "automatic": true,
  "includedPaths": [
    {
      "path": "/customerName/?"
    }
  ],
  "excludedPaths": [
    {
      "path": "/\*"
    }
  ]
}

This index is what we would we would use for an include strategy. We are saying that all values under the property path /customerName will be indexed and all other paths in our documents will be excluded.

{
  "indexingMode": "consistent",
  "automatic": true,
  "includedPaths": [
    {
      "path": "/\*"
    }
  ],
  "excludedPaths": [
    {
      "path": "/customerValue/?"
    }
  ]
}

In this indexing strategy, we are specifying which paths we want to be excluded from indexing. Here we are stating that values under the path /customerValue will not be indexed and everything else in our document will be.

Fairly simple right?

Other things to consider for indexing

As well as different types of indexes, Cosmos DB supports two kinds of indexing modes ( Consistent or None ). Consistent indexes are updates synchronously as you add, update or delete items within your container, while None essentially means that indexing is disabled on the container.

You can read more about indexing modes here.

When we specify the property path that we want to include or exclude from indexing, we need to define our paths correctly depending on the value. So for string or number values, we would set the property like so:

/customerName/?

If we wanted to index the elements of an array, we would do so like this:

/customerAddresses/[]

Finally, if we wanted to match any element underneath our path, we would just need to do this:

/\*

Bringing it all together

Let’s bring all of the concepts that I’ve discussed in this article and create an indexing policy that captures them all. Say we have a bunch of documents that represent customers that look something like the following:

{
   "customerId":"1",
   "customerName": "John Smith",
   "customerType": "Business",
   "customerAddresses": [
     {
       "addressLine1": "1 Billing Way",
       "addressCity": "Billsville"
     }
   ],
   "customerValue": 500000
}

In our application, let’s say I want to run the following query:

SELECT \* FROM Customer c WHERE c.customerValue > 100000 AND c.customerType = "Business" ORDER BY c.customerValue, c.customerType DESC

The following indexing policy would help optimize the performance of my query that I wish to run:

{
  "indexingMode": "consistent",
  "automatic": true,
  "includedPaths": [
    {
      "path": "/customerValue/?"
    },
    {
      "path": "/customerType/?"
    }
  ],
  "excludedPaths": [
    {
      "path": "/\*"
    }
  ],
  "compositeIndexes": [
    [
      {
        "path": "/customerValue",
        "order": "descending"
      },
      {
        "path": "/customerType",
        "order": "descending"
      }
    ]
  ]
}

In this indexing policy, we are just indexing the values that are persisted on the customerValue and customerType paths and I’m excluding all other properties in my customer document from being indexed. I’m then adding a composite index for customerValue and customerType and ordering them by descending.

Conclusion

Indexing is pretty straightforward in Azure Cosmos DB. You can just let the database manage it for you if you really want. However, if you’re looking to optimize performance and decrease RU expenditure, I’d recommend analyzing what types of queries that you’re going to perform on your items within your container and create your indexing policy that’s going to maximize the performance of those queries.

I hope you’ve enjoyed reading this article and are ready to experiment with indexing in Cosmos DB to optimize your queries. If you have any questions or comments, let me know and I’ll do my best to answer them!

Thursday, September 3, 2020

Use Azure Key Vault for Azure Databricks

Use Case

I need to use some passwords and keys in my Databricks notebook, but for security reasons I don't want to store them in the notebook. How do I prevent storing sensitive data in Databricks?
Using Azure Key Vault for Azure Databricks














Solution
Let's say you want to connect to an Azure SQL Database with SQL Authentication or an Azure Blob Storage container with an Access key in Databricks. Instead of storing the password or key in the notebook in plain text, we will store it in an Azure Key Vault as a secret. With an extra line of code we will retrieve the secret and use its value for the connection.

The example below will show all individual steps in detail including creating an Azure Key Vault, but assumes you already have an Azure Databricks notebook and a cluster to run its code. The steps to give Databricks access to the Key Vault slightly deviate from Azure Data Factory or Azure Automation Runbook, because the access policy is set from within Databricks itself.

1) Create Key Vault
First step is creating a key vault. If you already have one then you can skip this step.
  • Go to the Azure portal and create a new resource
  • Search for key vault
  • Select Key Vault and click on Create
  • Select your Subscription and Resource Group 
  • Choose a useful name for the Key Vault
  • Select your Region (the same as your other resources)
  • And choose the Pricing tier. We will use Standard for this demo
Creating a new Key Vault
















2) Add Secret
Now that we have a Key Vault we can add the password from the SQL Server user or the key from the Azure Storage account. The Key Vault stores three types of items: Secrets, Keys and Certificates. For passwords, account keys or connectionstrings you need the Secret.
  • Go to the newly created Azure Key Vault
  • Go to Secrets in the left menu
  • Click on the Generate/Import button to create a new secret
  • Choose Manual in the upload options
  • Enter a recognizable and descriptive name. You will later on use this name in Databricks
  • Next step is to add the secret value which we will retrieve in Databricks
  • Keep Content type Empty and don't use the activation or expiration date for this example
  • Make sure the secret is enabled and then click on the Create button
Adding a new secret to Azure Key Vault
















3) Create Secret Scope
Instead of adding Databricks via the Access policies in the Key Vault we will use a special page in the Databricks workspace editor which will create an Application in the Azure Active Directory and give that application access within the Key Vault. At the moment of writing this feature is still a Public Preview feature and therefore layout could still change and it will probably become a menu item when in GA.
  • Go to your Azure Databricks overview page in the Azure portal
  • Click on the Launch workspace button in the middle (a new tab will opened)
  • Change the URL of the new tab by adding '#secrets/createScope' after the URL
    Example
    https://westeurope.azuredatabricks.net/?o=1234567890123456
    Becomes
    https://westeurope.azuredatabricks.net/?o=1234567890123456#secrets/createScope
Find 'Create Secret Scope' form






















Next step is to fill in the 'Create Secret Scope' form. This will connect Databricks to the Key Vault.
  • Fill in the name of your secret scope. It should be unique within the workspace and will be used in code to retrieve the secret from Key Vault.
  • The Manage Principal for the premium tier can either be Creator (secret scope only for you) or All Users (secret scope for all users within the workspace). For the standard tier you can only choose All Users.
  • The DNS name is the URL of your Key Vault which can be found on the Overview page of your Azure Key Vault which looks like: https://bitools.vault.azure.net/
  • The Resource ID is a path that points to your Azure Key Vault. Within the following path replace the three parts within the brackets:
    /subscriptions/[1.your-subscription]/resourcegroups/[2.resourcegroup_of_keyvault]/providers/ Microsoft.KeyVault/vaults/[3.name_of_keyvault]
    1. The guid of your subscription
    2. The name of the resource group that hosts your Key Vault
    3. The name of your Key Vault
      Tip: if you go to your Key Vault in the Azure portal then this path is part of the URL which you could copy
  • Click on the Create button and when creation has finished click on the Ok button
Create Secret Scope






















4) Verify Access policies
Next you want to verify the rights of Databricks in the Key Vault and probably restrict some options because by default it gets a lot of permissions.
  1. Go to your Key Vault in the Azure Portal
  2. Go to Access policies in the left menu
  3. Locate the application with Databricks in its name
  4. Check which permissions you need. When using secrets only, the Get and List for secrets is probably enough.
Verify permissions of Databricks in Azure Key Vault














5) Scala code
Now it is time to retrieve the secrets from the Key Vault in your notebook with Scala code (Python code in next step). First code is for when you forgot the name of your secret scope or want to know which ones are available in your workspace.
1
2
3
4
5
6
// Scala code
 
// Get list of all scopes
val mysecrets = dbutils.secrets.listScopes()
// Loop through list
mysecrets.foreach { println }
Scala code to get secret scopes










If you want to get the value of one secret you can execute the following code. Note that the value will not be shown in your notebook execution result
1
2
3
// Scala code
 
dbutils.secrets.get(scope = "bitools_secrets", key = "blobkey")
Scale code to retrieve secret from the Azure Key Vault








And if you want to use that code to retrieve the key from your blob storage account and get a list of files you can combine it in the following code. The name of the storage account is 'bitools2'
1
2
3
4
5
6
// Scala code
 
dbutils.fs.mount(
  mountPoint = "/mnt/bitools2",
  extraConfigs = Map("fs.azure.account.key.bitools2.blob.core.windows.net" -> dbutils.secrets.get(scope = "bitools_secrets", key = "blobkey")))

Scale code to mount Storage and get list of files














6) Python code
Now it is time to retrieve the secrets from the Key Vault in your notebook with Python code. First code is for when you forgot the name of your secret scope or want to know which ones are available in your workspace.
1
2
3
4
5
6
7
# Python code
 
# Get list of all scopes
mysecrets = dbutils.secrets.listScopes()
# Loop through list
for secret in mysecrets:
  print(secret.name)
Python code to get secret scopes










If you want to get the value of one secret you can execute the following code. Note that the value will not be shown in your notebook execution result
1
2
3
# Python code
 
dbutils.secrets.get(scope = "bitools_secrets", key = "blobkey")
Python code to retrieve secret from the Azure Key Vault








And if you want to use that code to retrieve the key from your blob storage account and get a list of files you can combine it in the following code. The name of the storage account is 'bitools2'
1
2
3
4
5
6
# Python code
 
dbutils.fs.mount(
  mount_point = "/mnt/bitools2a",
  extra_configs = {"fs.azure.account.key.bitools2.blob.core.windows.net":dbutils.secrets.get(scope = "bitools_secrets", key = "blobkey")})

Python code to mount Storage and get list of files














Conclusion
In this post you learned how to store sensitive data for your data preparation in databricks the right way by creating a Key Vault and use it in your notebook. The feature is still in public preview which will probably mean the layout will slightly change before going to GA, but the features will most likely stay the same. Another point of attention is that you don't have any influence on the name of the Databricks application in de AAD and the default permissions in the Key Vault.

In previous posts we also showed you how to use the same Key Vault in an Azure Data Factory and an Azure Automation Runbook to avoid hardcoded passwords and keys. In a future post we will show you how to use it in other tools like Azure Functions.

Get files of last hour in Azure Data Factory

  Case I have a Data Factory pipeline that should run each hour and collect all new files added to the data lake since the last run. What is...