Cannot mimic manual document split in Azure, programatically, using Azure SplitSkill

Question

I am going from a manual setup of my RAG solution in Azure to setting up everything programmatically using the azure python sdk. I have a container with a single pdf. When setting up manually is see that the Document count under the created index is 401 when setting the chunking to 256. When using my custom skillset:

split_skill = SplitSkill(
    name="split",
    description="Split skill to chunk documents",
    context="/document",
    text_split_mode="pages",
    default_language_code="en",
    maximum_page_length=300,  # why cannot this be set to 256 if I can do this with a manual setup?
    page_overlap_length=30,
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/content"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="textItems", target_name="pages")  
    ],
)

I get 271. I want to mimic my manual chunking setup as much as possible as I already have good performance. What am I missing? Alternatively, could somebody point me to the default setup for chunking when it is performed by hand?

22 FEB EDIT

Answering @JayashankarGS

According to this doc the minimum value you need give is 300. learn.microsoft.com/en-us/azure/search/… Chunking in RAG is not as same as maximumPageLength in split skillset.

To me it looks like maximum_page_length is exactly chunking_size. But you are right, as of today, there is nothing to do regarding selecting a chunk size of less than 300 using SplitSkill...

According to this doc the minimum value you need give is 300. learn.microsoft.com/en-us/azure/search/… Chunking in RAG is not as same as maximumPageLength in split skillset. — Jaya Shankar G S
– Jaya Shankar G S, Commented Feb 19, 2024 at 4:35
How can I achieve chunking, in the same way as when it is set up manually? There is no chunking parameter for SplitSkill... — Mike B
– Mike B, Commented Feb 19, 2024 at 10:38
Yes. There is no chunking parameter. All you have is maximumPageLength — Jaya Shankar G S
– Jaya Shankar G S, Commented Feb 19, 2024 at 11:06
Even if you do chunking and again the split skill take the default length as 300 for your documents. — Jaya Shankar G S
– Jaya Shankar G S, Commented Feb 22, 2024 at 11:23

Jaya Shankar G S · Accepted Answer · 2024-02-29 12:09:21Z

1

You can mimic the splitting; however, the text split skill has a minimum length limit of 300, which is not the case in your manual setup.

Since the text split skill doesn't accept a maximum_page_length less than 300, you can split your documentation using the LLM_RAG_CRACK_AND_CHUNK_AND_EMBED built-in component found in the Azure ML registry. Then, create an index on the resulting dataset from this component.

Refer to this Stack Overflow solution regarding LLM_RAG_CRACK_AND_CHUNK_AND_EMBED.

answered Feb 29, 2024 at 12:09

Jaya Shankar G S

8,6182 gold badges6 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jaya Shankar G S Over a year ago

This solution is provided from the above provided Stack thread so that it will help the community find better solutions.

Mike B Over a year ago

Is there any plans of changing this? I see no reason for a minimum of 300.

Jaya Shankar G S Over a year ago

In future it may change.

Collectives™ on Stack Overflow

Cannot mimic manual document split in Azure, programatically, using Azure SplitSkill

22 FEB EDIT

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

22 FEB EDIT

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related