Image Processing Using VLMs
Convert and Optimize Model
Download and convert model (e.g. openbmb/MiniCPM-V-2_6) to OpenVINO format from Hugging Face:
optimum-cli export --model openbmb/MiniCPM-V-2_6 --weight-format int4 --trust-remote-code MiniCPM_V_2_6_ov
See all supported Visual Language Models.
Refer to the Model Preparation guide for detailed instructions on how to download, convert and optimize models for OpenVINO GenAI.
Run Model Using OpenVINO GenAI
OpenVINO GenAI introduces the VLMPipeline
pipeline for inference of multimodal text-generation Vision Language Models (VLMs).
It can generate text from a text prompt and images as inputs.
- Python
- C++
- CPU
- GPU
import openvino_genai as ov_genai
import openvino as ov
from PIL import Image
import numpy as np
from pathlib import Path
def read_image(path: str) -> ov.Tensor:
pic = Image.open(path).convert("RGB")
image_data = np.array(pic)[None]
return ov.Tensor(image_data)
def read_images(path: str) -> list[ov.Tensor]:
entry = Path(path)
if entry.is_dir():
return [read_image(str(file)) for file in sorted(entry.iterdir())]
return [read_image(path)]
images = read_images("./images")
pipe = ov_genai.VLMPipeline(model_path, "CPU")
result = pipe.generate(prompt, images=images, max_new_tokens=100)
print(result.texts[0])
import openvino_genai as ov_genai
import openvino as ov
from PIL import Image
import numpy as np
from pathlib import Path
def read_image(path: str) -> ov.Tensor:
pic = Image.open(path).convert("RGB")
image_data = np.array(pic)[None]
return ov.Tensor(image_data)
def read_images(path: str) -> list[ov.Tensor]:
entry = Path(path)
if entry.is_dir():
return [read_image(str(file)) for file in sorted(entry.iterdir())]
return [read_image(path)]
images = read_images("./images")
pipe = ov_genai.VLMPipeline(model_path, "GPU")
result = pipe.generate(prompt, images=images, max_new_tokens=100)
print(result.texts[0])
- CPU
- GPU
#include "openvino/genai/visual_language/pipeline.hpp"
#include "load_image.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string models_path = argv[1], images_path = argv[2];;
std::vector<ov::Tensor> images = utils::load_images(images_path);
ov::genai::VLMPipeline pipe(models_path, "CPU");
ov::genai::VLMDecodedResults result = pipe.generate(
prompt,
ov::genai::images(images),
ov::genai::max_new_tokens(100)
);
std::cout << result.texts[0] << std::endl;
}
#include "openvino/genai/visual_language/pipeline.hpp"
#include "load_image.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string models_path = argv[1], images_path = argv[2];;
std::vector<ov::Tensor> images = utils::load_images(images_path);
ov::genai::VLMPipeline pipe(models_path, "GPU");
ov::genai::VLMDecodedResults result = pipe.generate(
prompt,
ov::genai::images(images),
ov::genai::max_new_tokens(100)
);
std::cout << result.texts[0] << std::endl;
}
Use CPU or GPU as devices without any other code change.
Additional Usage Options
Use Different Generation Parameters
Similar to text generation, VLM pipelines support various generation parameters to control the text output.
Generation Configuration Workflow
- Get the model default config with
get_generation_config()
- Modify parameters
- Apply the updated config using one of the following methods:
- Use
set_generation_config(config)
- Pass config directly to
generate()
(e.g.generate(prompt, config)
) - Specify options as inputs in the
generate()
method (e.g.generate(prompt, max_new_tokens=100)
)
- Use
Basic Generation Configuration
- Python
- C++
import openvino_genai as ov_genai
pipe = ov_genai.VLMPipeline(model_path, "CPU")
# Get default configuration
config = pipe.get_generation_config()
# Modify parameters
config.max_new_tokens = 100
config.temperature = 0.7
config.top_k = 50
config.top_p = 0.9
config.repetition_penalty = 1.2
# Generate text with custom configuration
output = pipe.generate(prompt, images, config)
int main() {
ov::genai::VLMPipeline pipe(model_path, "CPU");
// Get default configuration
auto config = pipe.get_generation_config();
// Modify parameters
config.max_new_tokens = 100;
config.temperature = 0.7f;
config.top_k = 50;
config.top_p = 0.9f;
config.repetition_penalty = 1.2f;
// Generate text with custom configuration
auto output = pipe.generate(prompt, images, config);
}
max_new_tokens
: The maximum numbers of tokens to generate, excluding the number of tokens in the prompt.max_new_tokens
has priority overmax_length
.temperature
: Controls the level of creativity in AI-generated text:- Low temperature (e.g. 0.2) leads to more focused and deterministic output, choosing tokens with the highest probability.
- Medium temperature (e.g. 1.0) maintains a balance between creativity and focus, selecting tokens based on their probabilities without significant bias.
- High temperature (e.g. 2.0) makes output more creative and adventurous, increasing the chances of selecting less likely tokens.
top_k
: Limits token selection to the k most likely next tokens. Higher values allow more diverse outputs.top_p
: Selects from the smallest set of tokens whose cumulative probability exceeds p. Helps balance diversity and quality.repetition_penalty
: Reduces the likelihood of repeating tokens. Values above 1.0 discourage repetition.
For the full list of generation parameters, refer to the Generation Config API.
Using OpenVINO GenAI in Chat Scenario
Refer to the Chat Scenario guide for more information on using OpenVINO GenAI in chat applications.
Streaming the Output
Refer to the Streaming guide for more information on streaming the output with OpenVINO GenAI.