Text Generation Using LLMs
Convert and Optimize Model
Download and convert model (e.g. TinyLlama/TinyLlama-1.1B-Chat-v1.0) to OpenVINO format from Hugging Face:
optimum-cli export --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --trust-remote-code TinyLlama_1_1b_v1_ov
See all supported Large Language Models.
Refer to the Model Preparation guide for detailed instructions on how to download, convert and optimize models for OpenVINO GenAI.
Run Model Using OpenVINO GenAI
LLMPipeline
is the main object used for decoding. You can construct it straight away from the folder with the converted model.
It will automatically load the main model, tokenizer, detokenizer and default generation configuration.
- Python
- C++
- CPU
- GPU
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")
print(pipe.generate("What is OpenVINO?", max_new_tokens=100))
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "GPU")
print(pipe.generate("What is OpenVINO?", max_new_tokens=100))
- CPU
- GPU
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string models_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
std::cout << pipe.generate("What is OpenVINO?", ov::genai::max_new_tokens(100)) << '\n';
}
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string models_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "GPU");
std::cout << pipe.generate("What is OpenVINO?", ov::genai::max_new_tokens(100)) << '\n';
}
Use CPU or GPU as devices without any other code change.
Additional Usage Options
Use Different Generation Parameters
Fine-tune your LLM's output by adjusting various generation parameters. OpenVINO GenAI supports multiple sampling strategies and generation configurations to help you achieve the desired balance between deterministic and creative outputs.
Generation Configuration Workflow
- Get the model default config with
get_generation_config()
- Modify parameters
- Apply the updated config using one of the following methods:
- Use
set_generation_config(config)
- Pass config directly to
generate()
(e.g.generate(prompt, config)
) - Specify options as inputs in the
generate()
method (e.g.generate(prompt, max_new_tokens=100)
)
- Use
Basic Generation Configuration
- Python
- C++
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")
# Get default configuration
config = pipe.get_generation_config()
# Modify parameters
config.max_new_tokens = 100
config.temperature = 0.7
config.top_k = 50
config.top_p = 0.9
config.repetition_penalty = 1.2
# Generate text with custom configuration
output = pipe.generate(prompt, config)
int main() {
ov::genai::LLMPipeline pipe(model_path, "CPU");
// Get default configuration
auto config = pipe.get_generation_config();
// Modify parameters
config.max_new_tokens = 100;
config.temperature = 0.7f;
config.top_k = 50;
config.top_p = 0.9f;
config.repetition_penalty = 1.2f;
// Generate text with custom configuration
auto output = pipe.generate(prompt, config);
}
max_new_tokens
: The maximum numbers of tokens to generate, excluding the number of tokens in the prompt.max_new_tokens
has priority overmax_length
.temperature
: Controls the level of creativity in AI-generated text:- Low temperature (e.g. 0.2) leads to more focused and deterministic output, choosing tokens with the highest probability.
- Medium temperature (e.g. 1.0) maintains a balance between creativity and focus, selecting tokens based on their probabilities without significant bias.
- High temperature (e.g. 2.0) makes output more creative and adventurous, increasing the chances of selecting less likely tokens.
top_k
: Limits token selection to the k most likely next tokens. Higher values allow more diverse outputs.top_p
: Selects from the smallest set of tokens whose cumulative probability exceeds p. Helps balance diversity and quality.repetition_penalty
: Reduces the likelihood of repeating tokens. Values above 1.0 discourage repetition.
For the full list of generation parameters, refer to the Generation Config API.
Optimizing Generation with Grouped Beam Search
Beam search helps explore multiple possible text completions simultaneously, often leading to higher quality outputs.
- Python
- C++
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")
# Get default generation config
config = pipe.get_generation_config()
# Modify parameters
config.max_new_tokens = 256
config.num_beams = 15
config.num_beam_groups = 3
config.diversity_penalty = 1.0
# Generate text with custom configuration
output = pipe.generate(prompt, config)
int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
// Get default generation config
ov::genai::GenerationConfig config = pipe.get_generation_config();
// Modify parameters
config.max_new_tokens = 256;
config.num_beams = 15;
config.num_beam_groups = 3;
config.diversity_penalty = 1.0f;
// Generate text with custom configuration
auto output = pipe.generate(prompt, config);
}
max_new_tokens
: The maximum numbers of tokens to generate, excluding the number of tokens in the prompt.max_new_tokens
has priority overmax_length
.num_beams
: The number of beams for beam search. 1 disables beam search.num_beam_groups
: The number of groups to dividenum_beams
into in order to ensure diversity among different groups of beams.diversity_penalty
: value is subtracted from a beam's score if it generates the same token as any beam from other group at a particular time.
For the full list of generation parameters, refer to the Generation Config API.
Using OpenVINO GenAI in Chat Scenario
Refer to the Chat Scenario guide for more information on using OpenVINO GenAI in chat applications.
Streaming the Output
Refer to the Streaming guide for more information on streaming the output with OpenVINO GenAI.
Working with LoRA Adapters
LoRA adapters can be used to customize LLM outputs for specific tasks or styles. In text generation, adapters can help models perform better at particular activities like coding, creative writing, or domain-specific knowledge.
Refer to the LoRA Adapters for more details on working with LoRA adapters.
Accelerate Generation via Speculative Decoding
Speculative decoding (or assisted-generation in Hugging Face terminology) is a recent technique, that allows to speed up token generation when an additional smaller draft model is used alongside with the main model. This reduces the number of infer requests to the main model, increasing performance.
How Speculative Decoding Works
The draft model predicts the next K tokens one by one in an autoregressive manner, while the main model validates these predictions and corrects them if necessary. We go through each predicted token, and if a difference is detected between the draft and main model, we stop and keep the last token predicted by the main model. Then the draft model gets the latest main prediction and again tries to predict the next K tokens, repeating the cycle.
This approach reduces the need for multiple infer requests to the main model, enhancing performance. For instance, in more predictable parts of text generation, the draft model can, in best-case scenarios, generate the next K tokens that exactly match the target. In that case they are validated in a single inference request to the main model (which is bigger, more accurate but slower) instead of running K subsequent requests.
More details can be found in the original papers:
- Python
- C++
import openvino_genai
import queue
import threading
def streamer(subword):
print(subword, end='', flush=True)
return openvino_genai.StreamingStatus.RUNNING
def infer(model_dir: str, draft_model_dir: str, prompt: str):
main_device = 'CPU' # GPU can be used as well.
draft_device = 'CPU'
# Configure cache for better performance
scheduler_config = openvino_genai.SchedulerConfig()
scheduler_config.cache_size = 2 # in GB
# Initialize draft model
draft_model = openvino_genai.draft_model(
draft_model_dir,
draft_device
)
# Create pipeline with draft model
pipe = openvino_genai.LLMPipeline(
model_dir,
main_device,
scheduler_config=scheduler_config,
draft_model=draft_model
)
# Configure speculative decoding
config = openvino_genai.GenerationConfig()
config.max_new_tokens = 100
config.num_assistant_tokens = 5 # Number of tokens to predict speculatively
pipe.generate("The Sun is yellow because", config, streamer)
#include <openvino/openvino.hpp>
#include "openvino/genai/llm_pipeline.hpp"
int main(int argc, char* argv[]) {
if (4 != argc) {
throw std::runtime_error(std::string{"Usage: "} + argv[0] + " <MODEL_DIR> <DRAFT_MODEL_DIR> '<PROMPT>'");
}
ov::genai::GenerationConfig config;
config.max_new_tokens = 100;
config.num_assistant_tokens = 5; // Number of tokens to predict speculatively
std::string main_model_path = argv[1];
std::string draft_model_path = argv[2];
std::string prompt = argv[3];
std::string main_device = "CPU", draft_device = "CPU";
ov::genai::SchedulerConfig scheduler_config;
scheduler_config.cache_size = 5; // in GB
ov::genai::LLMPipeline pipe(
main_model_path,
main_device,
ov::genai::draft_model(draft_model_path, draft_device),
ov::genai::scheduler_config(scheduler_config));
auto streamer = [](std::string word) {
std::cout << word << std::flush;
return ov::genai::StreamingStatus::RUNNING;
};
pipe.generate("The Sun is yellow because", config, streamer);
}