Azure Search – Lucene based search engine

Having search capabilities as part of an application is a must these days. A very popular search solution is Elasticsearch, which is built on top of the open source project of  Lucene, providing both on-premise and SaaS deployment options.Microsoft has built its own engine

Microsoft has built its own search engine simply named Azure Search which is also based on Lucene, wrapping it with a rather easy to use REST API and providing connections to the full plethora of data sources available in Azure. Using Lucene enables a rich query language with advanced indexing and searching options. Microsoft has even added its own set of Analyzers and Tokenizers used in other products such as Bing and Office which greatly enhance the text processing capabilities of the solution. Using Azure infrastructure enables built-in scalability options and Azure Portal provides a visual interface to manage the most commonly used operations.

Kensee, providing a rich database of news articles, signals and insights related to commercial Real Estate, has search and filtering at its core. With a complex set of SQL DB tables, creating a proper view for using with Azure Search was not so trivial.

Working with string collections

For example, here are the steps for mapping a one to many relationship to a single field in the search index. We use it when there are multiple selection options such as the topics of an article or the categories of signals that were found. The user can choose several values and we need to match all documents containing either one of the values.

Step 1 – Creating a JSON Array from DB rows

Azure search can process JSON arrays and knows how to automatically take the array values and map them to the same field in the index. If you’re using DocumentDB to store all information as one big document, then you’re good to go without any special handling. But if the data is divided into several DB tables such as Items table, Topics table and ItemToTopic mapping table, things get messy.

Azure Search can use as a data source any Azure SQL DB table or view. Here’s the SQL code to transform these tables to one row with a column containing a JSON array:

SELECT ',"' + t.Name + '"'
FROM ItemToTopic it
inner join Topics t
on i.Id = it.ItemId and it.TopicId = t.Id
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 1, '')+']'
as Topic
from Items i

Step 2 – Creating field mapping from JSON array to string collection

The next step is to tell the search index what field should be mapped to the array we created and that the column value should be interpreted as a JSON array and not a regular string.

First, we start by creating the data source:

Then we continue by creating the index:

And finally combining all parts together with the indexer:

The request body should contain the field mapping definition (“targetFieldName” can be omitted if it’s the same as “targetFieldName”):

"fieldMappings" : [
{ "sourceFieldName" : "Topic", "targetFieldName" : "Topic", "mappingFunction" : {   "name" : "jsonArrayToStringCollection" } }

Step 3 – Running the indexer

One of the disadvantages of Azure Search is that many of the operations can only be done using the API. Some important operations such as editing existing fields are missing altogether and you’ll have to drop your index and create it from scratch. For a large index, this can take a few minutes so it should be best practice to create the new index side by side with your existing one and switch the application to use the new index, dropping the old one only when everything is working properly. This means that you’ll have to guarantee there is enough free space available per the selected pricing plan or you’ll have to create a totally new search service.

When a change to a live index is supported by the service, you can easily update the index by resetting the indexer and running it again on the whole corpus. Again, this can take a while but at least you won’t have to drop and create the index from scratch.

To do that from the portal, choose your search service from the list, click on the ‘Indexers’ tile, click on the relevant indexer and on the blade that opens, click on ‘Reset’ and once the operation completes, click ‘Run’. The progress indicator is not updating so frequently but closing the Indexers blade and opening again can show you the numbers of documents that were processed so far in batches of 10,000.


Step 4 – Automatic updates and deletions of documents

The search indexer can automatically detect if a document has been updated or should be deleted from the index.

Both options are controlled from the ‘Data sources’ tile and can be edited even after the index has been created. Don’t forget to Reset and Run the indexer again if existing documents should be re-indexed as well.

For updating a document, define a ‘High watermark column’ from the data source. Each change to the underlying data must increment the value in this column as well.

For certain types of data sources, tracking deletions is done automatically. However, if you’re using a DB view as the data source, a ‘Soft delete column’ should be defined and the value indicating that the row should be removed from the index must be specified.

Looking at the SQL script from step 1, we can see these columns as LastUpdated and IsDeleted.

data source

Geospatial searches

A cool and easy to use feature of Azure Search is geospatial searches – The ability to find results at a certain distance from a specific point or results that are contained in a certain geographical polygon.

I created a nice demo for a specific customer of Kensee, allowing them to search for properties in London taken from the Land Registry, by several filters such as the property size and distance from a certain address or landmark.

I used a DocumentDB collection as the data source for the search index. This enabled easier implementation for the required field format that is required for Azure Search. DocumentDB has built-in support for geospatial queries but using a search service enabled much reacher set of functionality and improved performance.

GeoJSON specification is used to define the set of coordinates for each object. For a single point, the format is as follows:

      "coordinates":[ 31.9, -4.8 ]

For a polygon use:

       [ 31.8, -5 ],
       [ 31.8, -4.7 ],
       [ 32, -4.7 ],
       [ 32, -5 ],
       [ 31.8, -5 ]

And remember that the last point should match the first one. One additional restriction for searching on a polygon is that points must be listed in counterclockwise order.

In the index define the corresponding fields as Edm.GeographyPoint, and perform the actual search by using geo.distance and geo.intersects OData filters as part of the query:

$filter=geo.distance(location, geography'POINT(-122.131577 47.678581)') le 10
$filter=geo.intersects(location, geography'POLYGON((-122.031577 47.578581, -122.031577 47.678581, -122.131577 47.678581, -122.031577 47.578581))')

More on the Land Registry project in a future post.

Monitoring the search service

Apart from Search Explorer feature, provided as part of the search service in the portal, a really cool way to monitor the usage and performance of your search service is by using the Azure Search content pack for Power BI.


Image is taken from Power BI  blog


It required you to enable Azure Search Traffic Analytics on your Azure Search account, and after connecting the content pack to the service, you have access to 4 pages of reports – Search, Indexing, Service stats and Service metrics.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s