Using OpenVINO GenAI in Chat Scenario
For chat applications, OpenVINO GenAI provides special optimizations to maintain conversation context and improve performance using KV-cache.
Refer to the Stateful Models vs Stateless Models for more information about KV-cache.
tip
Use start_chat()
and finish_chat()
to properly manage the chat session's KV-cache. This improves performance by reusing context between messages.
info
Chat mode is supported for both LLMPipeline
and VLMPipeline
.
A simple chat example (with grouped beam search decoding):
- Python
- C++
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, 'CPU')
config = {'max_new_tokens': 100, 'num_beam_groups': 3, 'num_beams': 15, 'diversity_penalty': 1.5}
pipe.set_generation_config(config)
pipe.start_chat()
while True:
try:
prompt = input('question:\n')
except EOFError:
break
answer = pipe.generate(prompt)
print('answer:\n')
print(answer)
print('\n----------\n')
pipe.finish_chat()
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string prompt;
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
ov::genai::GenerationConfig config;
config.max_new_tokens = 100;
config.num_beam_groups = 3;
config.num_beams = 15;
config.diversity_penalty = 1.0f;
pipe.start_chat();
std::cout << "question:\n";
while (std::getline(std::cin, prompt)) {
std::cout << "answer:\n";
auto answer = pipe.generate(prompt, config);
std::cout << answer << std::endl;
std::cout << "\n----------\n"
"question:\n";
}
pipe.finish_chat();
}