I am going from a manual setup of my RAG solution in Azure to setting up everything programmatically using the azure python sdk. I have a container with a single pdf. When setting up manually is see that the Document count under the created index is 401 when setting the chunking to 256. When using my custom skillset:
split_skill = SplitSkill(
name="split",
description="Split skill to chunk documents",
context="/document",
text_split_mode="pages",
default_language_code="en",
maximum_page_length=300, # why cannot this be set to 256 if I can do this with a manual setup?
page_overlap_length=30,
inputs=[
InputFieldMappingEntry(name="text", source="/document/content"),
],
outputs=[
OutputFieldMappingEntry(name="textItems", target_name="pages")
],
)
I get 271. I want to mimic my manual chunking setup as much as possible as I already have good performance. What am I missing? Alternatively, could somebody point me to the default setup for chunking when it is performed by hand?
22 FEB EDIT
Answering @JayashankarGS
According to this doc the minimum value you need give is 300. learn.microsoft.com/en-us/azure/search/… Chunking in RAG is not as same as maximumPageLength in split skillset.
To me it looks like maximum_page_length is exactly chunking_size. But you are right, as of today, there is nothing to do regarding selecting a chunk size of less than 300 using SplitSkill...

maximumPageLengthin split skillset.maximumPageLength