Thank you for your question and for detailing your use case with Azure AI Search.
You're absolutely right — using any() filters like: "filter": "LANGUAGES/any(t: t eq 'en') and LANGUAGES/any(t: t eq 'ja') and LANGUAGES/any(t: t eq 'fr')" It will match documents that contain 'en', 'ja', and 'fr', but not exclusively those values. Documents with additional values like 'de' or 'es' will also match, which leads to unintended deletions. The any() and all() operators help filter based on inclusion criteria but aren't sufficient for enforcing strict equality of collection content.
Azure Cognitive Search does not currently support collection‐length predicates (such as length(LANGUAGES) eq 3) or an OData comparison that enforces exact‐set equality on a multi‐valued field. The most reliable way to enforce “no more, no less” semantics is to index an additional single‐valued field that represents the entire collection in a deterministic, canonical form—then filter on that field. This is often the most practical and performant approach for this specific problem.
1. Add a new field to your index: Let's call it languages_checksum (or similar). This should be a Collection(Edm.String) or Edm.String field.
2. Generate a canonical representation: Before indexing, for each document, sort the LANGUAGES array alphabetically and then concatenate the values into a single string.
a. Example: ["ja", "en", "fr"] becomes ["en", "fr", "ja"]
b. Then concatenate: "en_fr_ja" (using a consistent separator like _ or ,)
c. Alternatively, you could store it as Collection(Edm.String) and rely on exact matching on this new field if you only have one value. However, a single concatenated string is usually more robust for exact set matching.
3.Index this languages_checksum field: Store this generated string in your document.
- Filter on the languages_checksum field: When you want to find documents with exactly ["en", "ja", "fr"], you would construct the canonical string "en_fr_ja" and use an equality filter:
JSON
{"filter": "languages_checksum eq 'en_fr_ja'"}
Pros:
- Exact match: Provides the precise exact set matching you need.
- Performant: Filtering on a single Edm.String field with equality is very efficient.
- Simple filter query: The filter itself becomes very straightforward.
Cons:
- Pre-processing required: You need to modify your data ingestion pipeline to generate this languages_checksum field.
- Index schema modification: Requires adding a new field to your index.
- Maintainability: If your set of languages changes frequently or becomes very large, generating and managing these checksums might add a bit of overhead.
Reference links: