Use OpenVINO GenAI in Chat Scenario

For chat applications, OpenVINO GenAI provides special optimizations to maintain conversation context and improve performance using KV-cache.

Refer to the How It Works for more information about KV-cache.

tip

Use start_chat() and finish_chat() to properly manage the chat session's KV-cache. This improves performance by reusing context between messages.

info

Chat mode is supported for both LLMPipeline and VLMPipeline.

A simple chat example (with grouped beam search decoding):

Python
C++

import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, 'CPU')

config = {'max_new_tokens': 100, 'num_beam_groups': 3, 'num_beams': 15, 'diversity_penalty': 1.5}
pipe.set_generation_config(config)

pipe.start_chat()
while True:
    try:
        prompt = input('question:\n')
    except EOFError:
        break
    answer = pipe.generate(prompt)
    print('answer:\n')
    print(answer)
    print('\n----------\n')
pipe.finish_chat()

#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>

int main(int argc, char* argv[]) {
    std::string prompt;

    std::string model_path = argv[1];
    ov::genai::LLMPipeline pipe(model_path, "CPU");

    ov::genai::GenerationConfig config;
    config.max_new_tokens = 100;
    config.num_beam_groups = 3;
    config.num_beams = 15;
    config.diversity_penalty = 1.0f;

    pipe.start_chat();
    std::cout << "question:\n";
    while (std::getline(std::cin, prompt)) {
        std::cout << "answer:\n";
        auto answer = pipe.generate(prompt, config);
        std::cout << answer << std::endl;
        std::cout << "\n----------\n"
            "question:\n";
    }
    pipe.finish_chat();
}

info

For more information, refer to the Python and C++ chat samples.