How to get the chunk index with Split Skill in azure AI search?

Question

I am new to Azure AI search, I want to get an attribute chunk index from this skillset to know at which index in the document the chunk is located. the content of pages after he split would looks like this

{'values': [{'recordId': '0', 'data': {'text': 'sample data 1 '}}, {'recordId': '1', 'data': {'text': 'sample data 1'}}, {'recordId': '2', 'data': {'text': 'sample data 3'}}

How to copy the recordId value as a field.

{
  "name": "testing-phase-1-docs-skillset",
  "description": "Skillset to chunk documents and generate embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#3",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ],
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "unit": "characters"
    }
  ],
  "@odata.etag": "\"0x8DD029DA50735BD\"",
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "testing-phase-1-docs-index",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/pages/*",
        "mappings": [
          {
            "name": "content",
            "source": "/document/pages/*"
          }, // want to add a recordId here
  
          {
            "name": "metadata_title",
            "source": "/document/metadata_title"
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  }
}

SplitSkill itself doesn't expose recordId directly, try to add it as a custom field by creating a projection which extracts recordId in the final index configuration. — Suresh Chikkam
– Suresh Chikkam, Commented Nov 12, 2024 at 5:21

Suresh Chikkam · Accepted Answer · 2024-11-12 11:17:25Z

How to get the chunk index with Split Skill in azure AI search?

Add a custom skill that assigns a chunkIndex field to each chunk, representing its position, By this you can track the chunk index within the document after splitting with SplitSkill.

chunkIndex can then be projected into the search index, enabling you to know each chunk’s exact position within the original document.

{
  "name": "testing-phase-1-docs-skillset",
  "description": "Skillset to chunk documents, assign a recordId and chunkIndex to each chunk, and generate embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "documentChunkingSkill",
      "description": "Splits document into chunks",
      "context": "/document",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ],
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "unit": "characters"
    },
    {
      "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
      "name": "generateRecordIdAndChunkIndexSkill",
      "description": "Generates a unique recordId and chunkIndex for each chunk",
      "context": "/document/pages/*",
      "inputs": [
        {
          "name": "text",
          "source": "/document/pages/*/text"
        }
      ],
      "outputs": [
        {
          "name": "recordId",
          "targetName": "recordId"
        },
        {
          "name": "chunkIndex",
          "targetName": "chunkIndex"
        }
      ]
    }
  ],
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "testing-phase-1-docs-index",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/pages/*",
        "mappings": [
          {
            "name": "content",
            "source": "/document/pages/*/text"
          },
          {
            "name": "recordId",
            "source": "/document/pages/*/recordId"
          },
          {
            "name": "chunkIndex",
            "source": "/document/pages/*/chunkIndex"
          },
          {
            "name": "metadata_title",
            "source": "/document/metadata_title"
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  }
}

In the above, I added a custom skill to check each chunk and gets a unique recordId.

It splits the document into chunks based on a specified page length.
It generates a unique recordId for each chunk.
It maps the split content and recordId to the final index.

recordId got generated for each chunk as expected.

enter image description here

I am getting this error : One or more skills are invalid. Details: Error in skill 'generateRecordIdAndChunkIndexSkill': Inputs are not supported by skill: content. Supported inputs: file_data
You need to pass the data in a format that the skill can process,

Collectives™ on Stack Overflow

How to get the chunk index with Split Skill in azure AI search?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related