0

I am going from a manual setup of my RAG solution in Azure to setting up everything programmatically using the azure python sdk. I have a container with a single pdf. When setting up manually is see that the Document count under the created index is 401 when setting the chunking to 256. When using my custom skillset:

split_skill = SplitSkill(
    name="split",
    description="Split skill to chunk documents",
    context="/document",
    text_split_mode="pages",
    default_language_code="en",
    maximum_page_length=300,  # why cannot this be set to 256 if I can do this with a manual setup?
    page_overlap_length=30,
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/content"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="textItems", target_name="pages")  
    ],
)

I get 271. I want to mimic my manual chunking setup as much as possible as I already have good performance. What am I missing? Alternatively, could somebody point me to the default setup for chunking when it is performed by hand?

22 FEB EDIT

Answering @JayashankarGS

According to this doc the minimum value you need give is 300. learn.microsoft.com/en-us/azure/search/… Chunking in RAG is not as same as maximumPageLength in split skillset.

To me it looks like maximum_page_length is exactly chunking_size. But you are right, as of today, there is nothing to do regarding selecting a chunk size of less than 300 using SplitSkill...

enter image description here

10
  • According to this doc the minimum value you need give is 300. learn.microsoft.com/en-us/azure/search/… Chunking in RAG is not as same as maximumPageLength in split skillset. Commented Feb 19, 2024 at 4:35
  • How can I achieve chunking, in the same way as when it is set up manually? There is no chunking parameter for SplitSkill... Commented Feb 19, 2024 at 10:38
  • Yes. There is no chunking parameter. All you have is maximumPageLength Commented Feb 19, 2024 at 11:06
  • Do I need to write my own chunking function? Commented Feb 19, 2024 at 11:59
  • Even if you do chunking and again the split skill take the default length as 300 for your documents. Commented Feb 22, 2024 at 11:23

1 Answer 1

1

You can mimic the splitting; however, the text split skill has a minimum length limit of 300, which is not the case in your manual setup.

Since the text split skill doesn't accept a maximum_page_length less than 300, you can split your documentation using the LLM_RAG_CRACK_AND_CHUNK_AND_EMBED built-in component found in the Azure ML registry. Then, create an index on the resulting dataset from this component.

Refer to this Stack Overflow solution regarding LLM_RAG_CRACK_AND_CHUNK_AND_EMBED.

Sign up to request clarification or add additional context in comments.

3 Comments

This solution is provided from the above provided Stack thread so that it will help the community find better solutions.
Is there any plans of changing this? I see no reason for a minimum of 300.
In future it may change.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.