0

I am new to Azure AI search, I want to get an attribute chunk index from this skillset to know at which index in the document the chunk is located. the content of pages after he split would looks like this

{'values': [{'recordId': '0', 'data': {'text': 'sample data 1 '}}, {'recordId': '1', 'data': {'text': 'sample data 1'}}, {'recordId': '2', 'data': {'text': 'sample data 3'}}

How to copy the recordId value as a field.

{
  "name": "testing-phase-1-docs-skillset",
  "description": "Skillset to chunk documents and generate embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#3",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ],
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "unit": "characters"
    }
  ],
  "@odata.etag": "\"0x8DD029DA50735BD\"",
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "testing-phase-1-docs-index",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/pages/*",
        "mappings": [
          {
            "name": "content",
            "source": "/document/pages/*"
          }, // want to add a recordId here
  
          {
            "name": "metadata_title",
            "source": "/document/metadata_title"
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  }
}
1
  • SplitSkill itself doesn't expose recordId directly, try to add it as a custom field by creating a projection which extracts recordId in the final index configuration. Commented Nov 12, 2024 at 5:21

1 Answer 1

0

How to get the chunk index with Split Skill in azure AI search?

Add a custom skill that assigns a chunkIndex field to each chunk, representing its position, By this you can track the chunk index within the document after splitting with SplitSkill.

  • chunkIndex can then be projected into the search index, enabling you to know each chunk’s exact position within the original document.
{
  "name": "testing-phase-1-docs-skillset",
  "description": "Skillset to chunk documents, assign a recordId and chunkIndex to each chunk, and generate embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "documentChunkingSkill",
      "description": "Splits document into chunks",
      "context": "/document",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ],
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "unit": "characters"
    },
    {
      "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
      "name": "generateRecordIdAndChunkIndexSkill",
      "description": "Generates a unique recordId and chunkIndex for each chunk",
      "context": "/document/pages/*",
      "inputs": [
        {
          "name": "text",
          "source": "/document/pages/*/text"
        }
      ],
      "outputs": [
        {
          "name": "recordId",
          "targetName": "recordId"
        },
        {
          "name": "chunkIndex",
          "targetName": "chunkIndex"
        }
      ]
    }
  ],
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "testing-phase-1-docs-index",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/pages/*",
        "mappings": [
          {
            "name": "content",
            "source": "/document/pages/*/text"
          },
          {
            "name": "recordId",
            "source": "/document/pages/*/recordId"
          },
          {
            "name": "chunkIndex",
            "source": "/document/pages/*/chunkIndex"
          },
          {
            "name": "metadata_title",
            "source": "/document/metadata_title"
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  }
}

In the above, I added a custom skill to check each chunk and gets a unique recordId.

  • It splits the document into chunks based on a specified page length.

  • It generates a unique recordId for each chunk.

  • It maps the split content and recordId to the final index.

recordId got generated for each chunk as expected.

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

I am getting this error : One or more skills are invalid. Details: Error in skill 'generateRecordIdAndChunkIndexSkill': Inputs are not supported by skill: content. Supported inputs: file_data
You need to pass the data in a format that the skill can process,

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.