Skip to main content

Using OpenVINO GenAI in Chat Scenario

For chat applications, OpenVINO GenAI provides special optimizations to maintain conversation context and improve performance using KV-cache.

Refer to the Stateful Models vs Stateless Models for more information about KV-cache.

tip

Use start_chat() and finish_chat() to properly manage the chat session's KV-cache. This improves performance by reusing context between messages.

info

Chat mode is supported for both LLMPipeline and VLMPipeline.

A simple chat example (with grouped beam search decoding):

import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, 'CPU')

config = {'max_new_tokens': 100, 'num_beam_groups': 3, 'num_beams': 15, 'diversity_penalty': 1.5}
pipe.set_generation_config(config)

pipe.start_chat()
while True:
try:
prompt = input('question:\n')
except EOFError:
break
answer = pipe.generate(prompt)
print('answer:\n')
print(answer)
print('\n----------\n')
pipe.finish_chat()
info

For more information, refer to the Python and C++ chat samples.