Tuesday, September 22, 2020

Indexing in Azure Cosmos DB

 Even though Cosmos DB automatically indexes every property by default, understanding how indexing works in Cosmos DB is vital for achieving efficient query performance.

In Azure Cosmos DB, every property in our items are indexed by default. This is fantastic for developers, as this means we don’t have to spend time managing indexing ourselves.

However, there may be times where we do want to customize the indexing policy depending on the requirements of our workloads. The purpose of this article is to show you how indexing works in Azure Cosmos DB, what kinds of indexes there are in Cosmos DB and how we can employ different indexing strategies to optimize performance depending on what we’re trying to achieve.

How indexing works in Cosmos DB

Azure Cosmos DB persists our items within our containers as JSON documents. We can think of these documents as trees and each property in our item as a node within that tree.

Say that I have a document for a customer, and that customer has multiple address associated with them, we could visualize our property paths for this customer like so:


/address/0/addressLine1: "1 Shipping way"
/address/0/addressCity: "Shipville"
/address/1/adressLine1: "2 Billing Road"
/address/1/addressCity: "Billington"

When our customer item gets written to our container in Cosmos DB, each property path and the value of the property is indexed.


Types of Indexes in Cosmos

Cosmos DB supports 3 kinds of indexes:

  1. Range Indexes
  2. Spatial Indexes
  3. Composite Indexes

Range Indexes are used for equality queries (WHERE someProperty = ‘someValue’), range queries (WHERE someProperty > ‘someValue’), ORDER BY queries, JOIN queries and for checking the presence of a property (WHERE IS_DEFINED(someProperty)). We use Range indexes on string or number values.

Spatial Indexes are used for geospatial objects. Currently, Points, LineStrings, Polygons and MultiPolygons are supported. We can use Spatial indexes on GeoJSON objects.

Composite indexes are used when we need to increase the efficiency on queries that perform operations on multiple fields. Let’s take a look at a couple of examples.

If we have a query where we have a filter and an ORDER BY statements, we can use a Composite index as long as we add the filtered property to the ORDER BY clause:


SELECT \* FROM Customer c WHERE c.customerType = "Business" ORDER BY c.customerType, c.customerValue DESC


We can also use a Composite Index for queries that ORDER BY on multiple properties:


SELECT \* FROM c ORDER BY c.customerType, c.customerValue DESC


Finally, we can use Composite Indexes for queries that filter on multiple properties provided that at least ONE property is an equality filer. For 

example:

SELECT \* FROM c WHERE c.customerValue > '10000' AND c.customerType = "Business"

Include vs Exclude Strategies

Having every property in our items indexed by default can be a significant advantage to us as developers. However, there are situations where we would customize what properties are either included or excluded from our indexing strategy.

In an ‘Include strategy’, we would explicitly specify which properties we want to index in our containers. This is particularity beneficial in heavy-write situations as this will improve the latency of these operations.

In a ‘Exclude strategy’ we would specify which properties in our items that we want to exclude from indexing. Let’s take a look at a couple of example to show the differences between include and exclude indexing policies.

{
  "indexingMode": "consistent",
  "automatic": true,
  "includedPaths": [
    {
      "path": "/customerName/?"
    }
  ],
  "excludedPaths": [
    {
      "path": "/\*"
    }
  ]
}

This index is what we would we would use for an include strategy. We are saying that all values under the property path /customerName will be indexed and all other paths in our documents will be excluded.

{
  "indexingMode": "consistent",
  "automatic": true,
  "includedPaths": [
    {
      "path": "/\*"
    }
  ],
  "excludedPaths": [
    {
      "path": "/customerValue/?"
    }
  ]
}

In this indexing strategy, we are specifying which paths we want to be excluded from indexing. Here we are stating that values under the path /customerValue will not be indexed and everything else in our document will be.

Fairly simple right?

Other things to consider for indexing

As well as different types of indexes, Cosmos DB supports two kinds of indexing modes ( Consistent or None ). Consistent indexes are updates synchronously as you add, update or delete items within your container, while None essentially means that indexing is disabled on the container.

You can read more about indexing modes here.

When we specify the property path that we want to include or exclude from indexing, we need to define our paths correctly depending on the value. So for string or number values, we would set the property like so:

/customerName/?

If we wanted to index the elements of an array, we would do so like this:

/customerAddresses/[]

Finally, if we wanted to match any element underneath our path, we would just need to do this:

/\*

Bringing it all together

Let’s bring all of the concepts that I’ve discussed in this article and create an indexing policy that captures them all. Say we have a bunch of documents that represent customers that look something like the following:

{
   "customerId":"1",
   "customerName": "John Smith",
   "customerType": "Business",
   "customerAddresses": [
     {
       "addressLine1": "1 Billing Way",
       "addressCity": "Billsville"
     }
   ],
   "customerValue": 500000
}

In our application, let’s say I want to run the following query:

SELECT \* FROM Customer c WHERE c.customerValue > 100000 AND c.customerType = "Business" ORDER BY c.customerValue, c.customerType DESC

The following indexing policy would help optimize the performance of my query that I wish to run:

{
  "indexingMode": "consistent",
  "automatic": true,
  "includedPaths": [
    {
      "path": "/customerValue/?"
    },
    {
      "path": "/customerType/?"
    }
  ],
  "excludedPaths": [
    {
      "path": "/\*"
    }
  ],
  "compositeIndexes": [
    [
      {
        "path": "/customerValue",
        "order": "descending"
      },
      {
        "path": "/customerType",
        "order": "descending"
      }
    ]
  ]
}

In this indexing policy, we are just indexing the values that are persisted on the customerValue and customerType paths and I’m excluding all other properties in my customer document from being indexed. I’m then adding a composite index for customerValue and customerType and ordering them by descending.

Conclusion

Indexing is pretty straightforward in Azure Cosmos DB. You can just let the database manage it for you if you really want. However, if you’re looking to optimize performance and decrease RU expenditure, I’d recommend analyzing what types of queries that you’re going to perform on your items within your container and create your indexing policy that’s going to maximize the performance of those queries.

I hope you’ve enjoyed reading this article and are ready to experiment with indexing in Cosmos DB to optimize your queries. If you have any questions or comments, let me know and I’ll do my best to answer them!

No comments:

Post a Comment

Get files of last hour in Azure Data Factory

  Case I have a Data Factory pipeline that should run each hour and collect all new files added to the data lake since the last run. What is...