You can now automatically have the contents of your Umbraco media files indexed with ExamineX without the need for additional indexes.

Requirements

This new feature is specifically targeted at Umbraco media when configured with the UmbracoFileSystemProviders.Azure.Media (>= 2.0.0) package. If you are using Azure to host your Umbraco website it is recommended to use blob storage as your media provider. This provides more flexibility with scaling your solution along with the benefits of CDN support.

With this ExamineX update, it means that your media document files such as PDFs and Microsoft Office docs will automatically have their content’s indexed 🎉

Installation

Install, configure and test the UmbracoFileSystemProviders.Azure.Media package for your media.

Install, configure and test ExamineX

Then install the ExamineX.AzureSearch.Umbraco.BlobMedia Nuget package:

PM> Install-Package ExamineX.AzureSearch.Umbraco.BlobMedia -Pre

That’s it!

Once the ExamineX.AzureSearch.Umbraco.BlobMedia package is installed any PDF files, MS Office document files, and others file types will automatically be indexed and stored in your corresponding internal/external Umbraco indexes with the field name fileContent.

Searching

Searching on this content is exactly the same way you would search any field in Examine. For example, if you wanted to search for a term within the contents of a media file in the ExternalIndex, you could do:

 if(ExamineManager.TryGetIndex("ExternalIndex", out var index))
{
    var searcher = index.GetSearcher();

    // Query on the fileContent field for media
    var results = searcher
        .CreateQuery("media")
        .Field("fileContent", searchTerm)
        .Execute();
}

What about images?

There is another exciting new feature currently in development: Indexing media image file content!

This utilizes Azure Cognitive Search’s AI engine to scan images and extract information about them. This will be configurable as to what information you would like extracted but typically this would be OCR (extracting any text found in the image), generating descriptions, face detection, categories, brands, etc…

All of this information can then be added directly to your media index. Once this media document feature is released we’ll get beta shipped for image data extraction.

Feedback

Any feedback on the beta is hugely appreciated. For any bugs and questions please the ExamineX public issue tracker on GitHub.