You can now automatically have the contents of your Umbraco media files indexed with ExamineX without the need for additional indexes.
Requirements
This new feature is specifically targeted at Umbraco media when configured with the UmbracoFileSystemProviders.Azure.Media (>= 2.0.0) package. If you are using Azure to host your Umbraco website it is recommended to use blob storage as your media provider. This provides more flexibility with scaling your solution along with the benefits of CDN support.
With this ExamineX update, it means that your media document files such as PDFs and Microsoft Office docs will automatically have their content’s indexed 🎉
Installation
Install, configure and test the UmbracoFileSystemProviders.Azure.Media package for your media.
Install, configure and test ExamineX
Then install the ExamineX.AzureSearch.Umbraco.BlobMedia Nuget package:
PM> Install-Package ExamineX.AzureSearch.Umbraco.BlobMedia -Pre
That’s it!
Once the ExamineX.AzureSearch.Umbraco.BlobMedia
package is installed any PDF files, MS Office document files, and others file types will automatically be indexed and stored in your corresponding internal/external Umbraco indexes with the field name content
.
NOTE: The field name ‘content’ typically cannot be changed. This is a limitation of Azure Search’s field mapping. It is possible to re-map this field with the Azure Search indexer field mappings, but then it’s not possible to also still have a field called ‘content’.
Searching
Searching on this content is exactly the same way you would search any field in Examine. For example, if you wanted to search for a term within the contents of a media file in the ExternalIndex, you could do:
if(ExamineManager.TryGetIndex("ExternalIndex", out var index))
{
var searcher = index.GetSearcher();
// Query on the 'content' field for media
var results = searcher
.CreateQuery("media")
.Field("content", searchTerm)
.Execute();
}
Breaking changes
There are some minor breaking changes to be aware of in this release:
- The
IndexModifiedEventType.Updating
is marked Obsolete and is no longer used. This enum is used in theCreatingOrUpdatingIndex
event.IndexModifiedEventType
now contains:Rebuilding
indicating that an index rebuild is occurring andFieldsChanging
indicating that index field definitions are being added. - The
CreatingOrUpdatingIndexEventArgs
contains 2x additional properties:AzureSearchIndexerDefinition
representing the Azure Search Indexer being updated andSearchServiceClient
which exposes theISearchServiceClient
directly in this event. TheCreatingOrUpdatingIndex
event will be raised whenever an Azure Search index or indexer is updated which means you will need to check fornull
on the properties:AzureSearchIndexDefinition
andAzureSearchIndexerDefinition
when handling this event.
What about images?
There is another exciting new feature currently in development: Indexing media image file content!
This utilizes Azure Cognitive Search’s AI engine to scan images and extract information about them. This will be configurable as to what information you would like extracted but typically this would be OCR (extracting any text found in the image), generating descriptions, face detection, categories, brands, etc…
All of this information can then be added directly to your media index. Once this media document feature is released we’ll get beta shipped for image data extraction.
Feedback
Any feedback on the beta is hugely appreciated. For any bugs and questions please the ExamineX public issue tracker on GitHub.