In a project i'm currently working on, I have a information extraction task to do with a LLM that requires a large set of instructions. Those instructions contains an object schema (and schema of component of this object), examples of input/output for this object (and its component), created with the library Kor in Python. This set of instruction is large, about 5000 tokens without the input. To realize this task, I send these instructions to an LLM, with the text in which these information need to be extracted.
I would like to deploy a model to use for my webapp in which multiple users may call this function at the same time. I currently use a model deployed on Azure-ai-foundry, but the problem is that any time the model is called, I need to give it the complete set of instructions, thus consuming an additional 5k token with each call.
Is there any way to deploy a model with this set of instructions already "in memory", without having to fine-tune a model here ? I would like to avoid having to send these instructions any time I want to do this task.
I probably can generate multiple conversations with a model, and reuse these conversations but the previous texts sent to the model would be added to the context which is not what i want to do here. Each call to this function is independant to any previous or future call. So keeping 5-10 conversations open, and sending it to an available conversation that may have previously been used, is a possible solution but that may generated problem with the context window and the initial instructions.