Azure AI Search indexer reported "Could not parse document. Document key cannot be longer than 1024 characters."

Question

Azure AI Search indexer reported "Could not parse document. Document key cannot be longer than 1024 characters."

桂學文 Kevin Kuei 145

Hi, I've an AI Search service and I tried to use "Import and vectorize data wizard" to create index+indexer+dataSource+skillSet to build index and indexing from blob storage.

Then I saw the indexer reported error as below:

Could you please advise me on how to resolve this?

I also noticed that some of the blob pathnames in my Blob Storage container are quite long (to make them easier for humans to read). Could this be contributing to the problem?

Thank you for your assistance.

Error
Document Key
localId=aHR0cHM6Ly9zdGlvdDA2MTkuYmxvYi5jb3JlLndpbmRvd3MubmV0L2N0LWlvdC0wNjE5LWttLyVFOCU4OCVBQSVFOSU4MSU4QiVFNSVBRCVBMyVFNSU4OCU4QS9kb2N1bWVudC8lRTglODglQUElRTklODElOEIlRTUlQUQlQTMlRTUlODglOEFfJUU3JTkyJUIwJUU1JUEyJTgzJUU0JUI4JThEJUU3JUEyJUJBJUU1JUFFJTlBJUU3JTlGJUE1JUU4JUE2JUJBJUU1JUIwJThEJUU4JTg4JUFBJUU3JUFFJUExJUU3JUIzJUJCJUU2JTg5JTgwJUU1JUE0JUE3JUU1JUFEJUI4JUU3JTk0JTlGJUU0JUI5JThCJUU3JTk0JTlGJUU2JUI2JUFGJUU5JTgxJUIyJUU3JTk2JTkxJUU1JUJEJUIxJUU5JTlGJUJGJUU2JThFJUEyJUU4JUE4JThFJUUyJTgwJTk0JUU4JTg3JUFBJUU2JTg4JTkxJUU2JTk1JTg4JUU4JTgzJUJEJUU4JTg4JTg3JUU1JUJGJTgzJUU3JTkwJTg2JUU4JUIzJTg3JUU2JTlDJUFDJUU3JTlBJTg0JUU1JUI5JUIyJUU2JTkzJUJFJUU2JTk1JTg4JUU2JTlFJTlDLyVFNyU5MiVCMCVFNSVBMiU4MyVFNCVCOCU4RCVFNyVBMiVCQSVFNSVBRSU5QSVFNyU5RiVBNSVFOCVBNiVCQSVFNSVCMCU4RCVFOCU4OCVBQSVFNyVBRSVBMSVFNyVCMyVCQiVFNiU4OSU4MCVFNSVBNCVBNyVFNSVBRCVCOCVFNyU5NCU5RiVFNCVCOSU4QiVFNyU5NCU5RiVFNiVCNiVBRiVFOSU4MSVCMiVFNyU5NiU5MSVFNSVCRCVCMSVFOSU5RiVCRiVFNiU4RSVBMiVFOCVBOCU4RSVFMiU4MCU5NCVFOCU4NyVBQSVFNiU4OCU5MSVFNiU5NSU4OCVFOCU4MyVCRCVFOCU4OCU4NyVFNSVCRiU4MyVFNyU5MCU4NiVFOCVCMyU4NyVFNiU5QyVBQyVFNyU5QSU4NCVFNSVCOSVCMiVFNiU5MyVCRSVFNiU5NSU4OCVFNiU5RSU5Qy5wZGY1&documentKey=aHR0cHM6Ly9zdGlvdDA2MTkuYmxvYi5jb3JlLndpbmRvd3MubmV0L2N0LWlvdC0wNjE5LWttLyVFOCU4OCVBQSVFOSU4MSU4QiVFNSVBRCVBMyVFNSU4OCU4QS9kb2N1bWVudC8lRTglODglQUElRTklODElOEIlRTUlQUQlQTMlRTUlODglOEFfJUU3JTkyJUIwJUU1JUEyJTgzJUU0JUI4JThEJUU3JUEyJUJBJUU1JUFFJTlBJUU3JTlGJUE1JUU4JUE2JUJBJUU1JUIwJThEJUU4JTg4JUFBJUU3JUFFJUExJUU3JUIzJUJCJUU2JTg5JTgwJUU1JUE0JUE3JUU1JUFEJUI4JUU3JTk0JTlGJUU0JUI5JThCJUU3JTk0JTlGJUU2JUI2JUFGJUU5JTgxJUIyJUU3JTk2JTkxJUU1JUJEJUIxJUU5JTlGJUJGJUU2JThFJUEyJUU4JUE4JThFJUUyJTgwJTk0JUU4JTg3JUFBJUU2JTg4JTkxJUU2JTk1JTg4JUU4JTgzJUJEJUU4JTg4JTg3JUU1JUJGJTgzJUU3JTkwJTg2JUU4JUIzJTg3JUU2JTlDJUFDJUU3JTlBJTg0JUU1JUI5JUIyJUU2JTkzJUJFJUU2JTk1JTg4JUU2JTlFJTlDLyVFNyU5MiVCMCVFNSVBMiU4MyVFNCVCOCU4RCVFNyVBMiVCQSVFNSVBRSU5QSVFNyU5RiVBNSVFOCVBNiVCQSVFNSVCMCU4RCVFOCU4OCVBQSVFNyVBRSVBMSVFNyVCMyVCQiVFNiU4OSU4MCVFNSVBNCVBNyVFNSVBRCVCOCVFNyU5NCU5RiVFNCVCOSU4QiVFNyU5NCU5RiVFNiVCNiVBRiVFOSU4MSVCMiVFNyU5NiU5MSVFNSVCRCVCMSVFOSU5RiVCRiVFNiU4RSVBMiVFOCVBOCU4RSVFMiU4MCU5NCVFOCU4NyVBQSVFNiU4OCU5MSVFNiU5NSU4OCVFOCU4MyVCRCVFOCU4OCU4NyVFNSVCRiU4MyVFNyU5MCU4NiVFOCVCMyU4NyVFNiU5QyVBQyVFNyU5QSU4NCVFNSVCOSVCMiVFNiU5MyVCRSVFNiU5NSU4OCVFNiU5RSU5Qy5wZGY1

Operation
Target field 'chunk_id' is either not present, doesn't have a value set, or no data could be extracted from the document for it.Failed document: 'https://stiot0619.blob.core.windows.net/ct-iot-0619-km/%E8%88%AA%E9%81%8B%E5%AD%A3%E5%88%8A/document/%E8%88%AA%E9%81%8B%E5%AD%A3%E5%88%8A_%E7%92%B0%E5%A2%83%E4%B8%8D%E7%A2%BA%E5%AE%9A%E7%9F%A5%E8%A6%BA%E5%B0%8D%E8%88%AA%E7%AE%A1%E7%B3%BB%E6%89%80%E5%A4%A7%E5%AD%B8%E7%94%9F%E4%B9%8B%E7%94%9F%E6%B6%AF%E9%81%B2%E7%96%91%E5%BD%B1%E9%9F%BF%E6%8E%A2%E8%A8%8E%E2%80%94%E8%87%AA%E6%88%91%E6%95%88%E8%83%BD%E8%88%87%E5%BF%83%E7%90%86%E8%B3%87%E6%9C%AC%E7%9A%84%E5%B9%B2%E6%93%BE%E6%95%88%E6%9E%9C/%E7%92%B0%E5%A2%83%E4%B8%8D%E7%A2%BA%E5%AE%9A%E7%9F%A5%E8%A6%BA%E5%B0%8D%E8%88%AA%E7%AE%A1%E7%B3%BB%E6%89%80%E5%A4%A7%E5%AD%B8%E7%94%9F%E4%B9%8B%E7%94%9F%E6%B6%AF%E9%81%B2%E7%96%91%E5%BD%B1%E9%9F%BF%E6%8E%A2%E8%A8%8E%E2%80%94%E8%87%AA%E6%88%91%E6%95%88%E8%83%BD%E8%88%87%E5%BF%83%E7%90%86%E8%B3%87%E6%9C%AC%E7%9A%84%E5%B9%B2%E6%93%BE%E6%95%88%E6%9E%9C.pdf'

Message
Could not parse document. Document key cannot be longer than 1024 characters.

Details
Target field 'chunk_id' is either not present, doesn't have a value set, or no data could be extracted from the document for it.Failed document: 'https://stiot0619.blob.core.windows.net/ct-iot-0619-km/%E8%88%AA%E9%81%8B%E5%AD%A3%E5%88%8A/document/%E8%88%AA%E9%81%8B%E5%AD%A3%E5%88%8A_%E7%92%B0%E5%A2%83%E4%B8%8D%E7%A2%BA%E5%AE%9A%E7%9F%A5%E8%A6%BA%E5%B0%8D%E8%88%AA%E7%AE%A1%E7%B3%BB%E6%89%80%E5%A4%A7%E5%AD%B8%E7%94%9F%E4%B9%8B%E7%94%9F%E6%B6%AF%E9%81%B2%E7%96%91%E5%BD%B1%E9%9F%BF%E6%8E%A2%E8%A8%8E%E2%80%94%E8%87%AA%E6%88%91%E6%95%88%E8%83%BD%E8%88%87%E5%BF%83%E7%90%86%E8%B3%87%E6%9C%AC%E7%9A%84%E5%B9%B2%E6%93%BE%E6%95%88%E6%9E%9C/%E7%92%B0%E5%A2%83%E4%B8%8D%E7%A2%BA%E5%AE%9A%E7%9F%A5%E8%A6%BA%E5%B0%8D%E8%88%AA%E7%AE%A1%E7%B3%BB%E6%89%80%E5%A4%A7%E5%AD%B8%E7%94%9F%E4%B9%8B%E7%94%9F%E6%B6%AF%E9%81%B2%E7%96%91%E5%BD%B1%E9%9F%BF%E6%8E%A2%E8%A8%8E%E2%80%94%E8%87%AA%E6%88%91%E6%95%88%E8%83%BD%E8%88%87%E5%BF%83%E7%90%86%E8%B3%87%E6%9C%AC%E7%9A%84%E5%B9%B2%E6%93%BE%E6%95%88%E6%9E%9C.pdf'

my index:

{
  "@odata.etag": "\"0x8DDD01644836E64\"",
  "name": "idx-iot-0619",
  "fields": [
    {
      "name": "chunk_id",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": true,
      "facetable": false,
      "key": true,
      "analyzer": "keyword",
      "synonymMaps": []
    },
    {
      "name": "parent_id",
      "type": "Edm.String",
      "searchable": false,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "synonymMaps": []
    },
    {
      "name": "chunk",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "synonymMaps": []
    },
    {
      "name": "title",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "synonymMaps": []
    },
    {
      "name": "text_vector",
      "type": "Collection(Edm.Single)",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "dimensions": 3072,
      "vectorSearchProfile": "idx-iot-0619-azureOpenAi-text-profile",
      "synonymMaps": []
    }
  ],
  "scoringProfiles": [],
  "suggesters": [],
  "analyzers": [],
  "normalizers": [],
  "tokenizers": [],
  "tokenFilters": [],
  "charFilters": [],
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity"
  },
  "semantic": {
    "defaultConfiguration": "idx-iot-0619-semantic-configuration",
    "configurations": [
      {
        "name": "idx-iot-0619-semantic-configuration",
        "flightingOptIn": false,
        "rankingOrder": "BoostedRerankerScore",
        "prioritizedFields": {
          "titleField": {
            "fieldName": "title"
          },
          "prioritizedContentFields": [
            {
              "fieldName": "chunk"
            }
          ],
          "prioritizedKeywordsFields": []
        }
      }
    ]
  },
  "vectorSearch": {
    "algorithms": [
      {
        "name": "idx-iot-0619-algorithm",
        "kind": "hnsw",
        "hnswParameters": {
          "metric": "cosine",
          "m": 4,
          "efConstruction": 400,
          "efSearch": 500
        }
      }
    ],
    "profiles": [
      {
        "name": "idx-iot-0619-azureOpenAi-text-profile",
        "algorithm": "idx-iot-0619-algorithm",
        "vectorizer": "idx-iot-0619-azureOpenAi-text-vectorizer"
      }
    ],
    "vectorizers": [
      {
        "name": "idx-iot-0619-azureOpenAi-text-vectorizer",
        "kind": "azureOpenAI",
        "azureOpenAIParameters": {
          "resourceUri": "https://aoai-iot-0619.openai.azure.com",
          "deploymentId": "text-embedding-3-large",
          "apiKey": "<redacted>",
          "modelName": "text-embedding-3-large"
        }
      }
    ],
    "compressions": []
  }
}

my indexer:

{
  "@odata.context": "https://as-iot-0619.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"0x8DDD016BB8DE288\"",
  "name": "idx-iot-0619-indexer",
  "description": null,
  "dataSourceName": "idx-iot-0619-datasource",
  "skillsetName": "idx-iot-0619-skillset",
  "targetIndexName": "idx-iot-0619",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": null,
    "maxFailedItemsPerBatch": null,
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "default"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_name",
      "targetFieldName": "title",
      "mappingFunction": null
    }
  ],
  "outputFieldMappings": [],
  "cache": null,
  "encryptionKey": null
}

桂學文 Kevin Kuei 145 Reputation points

2025-08-01T00:13:40.9+00:00

Thank you for your response.

However, I’m still unsure how to resolve this issue.

In fact, I didn’t explicitly specify how the chunk_id value should be obtained (you can see this from my indexer definition). It seems that the “Import and Vectorize Data” wizard automatically uses the base64-encoded blob path as the chunk_id value. Unfortunately, the blob path is inherently too long (this path is provided by the users, so I have no control over shortening it).

My question is: How can I address this issue while still using the “Import and Vectorize Data” wizard?

I considered using a mappingFunction to hash the blob path, but it seems that mappingFunction does not support hashing. :(
桂學文 Kevin Kuei 145 Reputation points

2025-08-01T00:37:16.4333333+00:00

Just to add some context — the reason I must use the Import Data and Vectorize wizard is because it automatically creates the index, indexer, data source, and skillset for me.

Manually setting up all of those resources would be far too complicated for me.
Nikhil Jha (Accenture International Limited) 230 Reputation points Microsoft External Staff Moderator

2025-08-05T03:38:12.61+00:00

Hello 桂學文 Kevin Kuei,

I hope this has been helpful! We appreciate hearing from you and would love to help others who may have the same question. Accepting answers helps increase visibility of this question for other members of the Microsoft Q&A community. Thank you for helping to improve Microsoft Q&A!

Accepted answer

2 additional answers

Your answer

桂學文 Kevin Kuei 145 Reputation points

2025-08-01T00:13:40.9+00:00

Thank you for your response.

However, I’m still unsure how to resolve this issue.

In fact, I didn’t explicitly specify how the chunk_id value should be obtained (you can see this from my indexer definition). It seems that the “Import and Vectorize Data” wizard automatically uses the base64-encoded blob path as the chunk_id value. Unfortunately, the blob path is inherently too long (this path is provided by the users, so I have no control over shortening it).

My question is: How can I address this issue while still using the “Import and Vectorize Data” wizard?

I considered using a mappingFunction to hash the blob path, but it seems that mappingFunction does not support hashing. :(
桂學文 Kevin Kuei 145 Reputation points

2025-08-01T00:37:16.4333333+00:00

Just to add some context — the reason I must use the Import Data and Vectorize wizard is because it automatically creates the index, indexer, data source, and skillset for me.

Manually setting up all of those resources would be far too complicated for me.
Nikhil Jha (Accenture International Limited) 230 Reputation points Microsoft External Staff Moderator

2025-08-05T03:38:12.61+00:00

Hello 桂學文 Kevin Kuei,

I hope this has been helpful! We appreciate hearing from you and would love to help others who may have the same question. Accepting answers helps increase visibility of this question for other members of the Microsoft Q&A community. Thank you for helping to improve Microsoft Q&A!

Answer 1

Hello 桂學文 Kevin Kuei,

You're absolutely correct in noting that the “Import and Vectorize Data” wizard significantly streamlines the onboarding process. It efficiently auto-generates the key components needed for setup—including the index, data source, skillset, and indexer—which makes initial integration both faster and more accessible for users. However, as you’ve observed firsthand, the wizard automatically assigns a document key based on the blob's full path, encoded in base64 format. This becomes problematic when dealing with long or URL-encoded blob paths, as the resulting string can surpass the 1024-character limit enforced by Azure Cognitive Search. Unfortunately, the current implementation does not provide a built-in way to override or transform the default key behavior. Additionally, the mappingFunction does not offer support for key transformation methods like hashing or GUID creation, which limits flexibility in handling large or complex blob path structures.

As rightly suggested by external contributors Divyesh Govaerdhanan (credit to the community here 🙌), a feasible workaround is:
1. Rename chunk_id to a non-key field

This allows you to retain the full blob path for reference without violating the key length constraint.

2. Create a new field like doc_id or id

Populate it with a short, unique value such as a GUID or a truncated hash.
This must be done outside the wizard using a preprocessing step (e.g., Azure Function, Logic App, or script).
Set this new field as your index key.

Currently, the wizard doesn’t expose customization hooks for setting or transforming the document key, and this limitation has been raised multiple times by users and is under review by the Azure Cognitive Search product team.

If manual setup seems overwhelming, a middle ground would be to:

Use the wizard to generate your pipeline.
Export the index definition from the Azure portal or API.
Modify the index to use doc_id as the key.
Re-index your data using a shorter identifier injected into the documents.

We understand that this adds complexity, and we appreciate your feedback—it helps prioritize future improvements. Let us know if you'd like assistance with exporting the configuration or generating unique keys for your documents—we’d be happy to help guide you through that process.

Reference link:

Answer 2

Hello,

Welcome to Microsoft Q&A,

This error is because your document key (chunk_id) exceeds the Azure Cognitive Search limit of 1024 characters, which is a hard constraint for key fields in an index. The field chunk_id is marked as the primary key ("key": true) in your index. Azure Cognitive Search requires key fields to be no longer than 1024 UTF-16 characters. In your case, chunk_id is generated from a long Base64-encoded or URL-encoded blob path, likely resulting in a string far beyond the 1024-character limit.

You could generate a short, unique Key, or if your documents have a built-in short and unique identifier (e.g., metadata_storage_path or metadata_storage_name), consider using that instead for the key field.

If you must preserve long keys for any reason:

Split metadata and content indexing into two separate indices.
Use the long URI in a non-key field in a secondary index.
Link them using a short hashed ID as a join key.

https://learn.microsoft.com/en-us/azure/search/search-what-is-an-index#key-field

Please upvote and accept the answer if it helps!!

Answer 3

Hello 桂學文 Kevin Kuei,

You're right, the “Import and Vectorize Data” wizard simplifies the onboarding by generating the index, indexer, skillset, and datasource automatically, but it currently uses the full blob path (encoded) as the document key, which is problematic when paths exceed the 1024-character limit, as in your case. At this time, the wizard does not support hashing or transforming the chunk_id key automatically. This limitation is acknowledged in feedback from users and Microsoft community forums. Several requests have been raised to let users customize the key field logic in the import wizard, but this is still under review.

Is it possible for you to,

Rename chunk_id to a non-key field
Create a new field doc_id (or id) and:
- Populate it with a GUID or truncated hash from a custom app/script
- Set doc_id as your index key

As you've observed, mappingFunction does not support hash() or GUID generation natively, so that must happen outside the wizard.

Please Upvote and accept the answer if it helps!!

Share via

Azure AI Search indexer reported "Could not parse document. Document key cannot be longer than 1024 characters."

2 additional answers

Your answer