Automatically index Umbraco media file content

You can now automatically have the contents of your Umbraco media files indexed with ExamineX without the need for additional indexes.

Requirements

This new feature is specifically targeted at Umbraco media when configured with the UmbracoFileSystemProviders.Azure.Media (>= 2.0.0) package. If you are using Azure to host your Umbraco website it is recommended to use blob storage as your media provider. This provides more flexibility with scaling your solution along with the benefits of CDN support.

With this ExamineX update, it means that your media document files such as PDFs and Microsoft Office docs will automatically have their content’s indexed 🎉

Installation

Install, configure and test the UmbracoFileSystemProviders.Azure.Media package for your media.

Install, configure and test ExamineX

Then install the ExamineX.AzureSearch.Umbraco.BlobMedia Nuget package:

PM> Install-Package ExamineX.AzureSearch.Umbraco.BlobMedia -Pre

That’s it!

Once the ExamineX.AzureSearch.Umbraco.BlobMedia package is installed any PDF files, MS Office document files, and others file types will automatically be indexed and stored in your corresponding internal/external Umbraco indexes with the field name content.

NOTE: The field name ‘content’ typically cannot be changed. This is a limitation of Azure Search’s field mapping. It is possible to re-map this field with the Azure Search indexer field mappings, but then it’s not possible to also still have a field called ‘content’.

Searching

Searching on this content is exactly the same way you would search any field in Examine. For example, if you wanted to search for a term within the contents of a media file in the ExternalIndex, you could do:

 if(ExamineManager.TryGetIndex("ExternalIndex", out var index))
{
    var searcher = index.GetSearcher();

    // Query on the 'content' field for media
    var results = searcher
        .CreateQuery("media")
        .Field("content", searchTerm)
        .Execute();
}

Breaking changes

There are some minor breaking changes to be aware of in this release:

The IndexModifiedEventType.Updating is marked Obsolete and is no longer used. This enum is used in the CreatingOrUpdatingIndex event. IndexModifiedEventType now contains: Rebuilding indicating that an index rebuild is occurring and FieldsChanging indicating that index field definitions are being added.
The CreatingOrUpdatingIndexEventArgs contains 2x additional properties: AzureSearchIndexerDefinition representing the Azure Search Indexer being updated and SearchServiceClient which exposes the ISearchServiceClient directly in this event. The CreatingOrUpdatingIndex event will be raised whenever an Azure Search index or indexer is updated which means you will need to check for null on the properties: AzureSearchIndexDefinition and AzureSearchIndexerDefinition when handling this event.

What about images?

There is another exciting new feature currently in development: Indexing media image file content!

This utilizes Azure Cognitive Search’s AI engine to scan images and extract information about them. This will be configurable as to what information you would like extracted but typically this would be OCR (extracting any text found in the image), generating descriptions, face detection, categories, brands, etc…

All of this information can then be added directly to your media index. Once this media document feature is released we’ll get beta shipped for image data extraction.

Feedback

Any feedback on the beta is hugely appreciated. For any bugs and questions please the ExamineX public issue tracker on GitHub.