我正在尝试使用onnxruntime-gpu
创建一个推理会话,我已经严重困惑了。
在onnx的官方教程中,https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/notebooks/PyTorch_Bert-Squad_OnnxRuntime_GPU.ipynb,他们在块[10]中创建了一个推理会话,看起来如下:
import psutil import onnxruntime import numpy assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers() device_name = 'gpu' sess_options = onnxruntime.SessionOptions() # Optional: store the optimized graph and view it using Netron to verify that model is fully optimized. # Note that this will increase session creation time so enable it for debugging only. sess_options.optimized_model_filepath = os.path.join(output_dir, "optimized_model_{}.onnx".format(device_name)) # Please change the value according to best setting in Performance Test Tool result. sess_options.intra_op_num_threads=psutil.cpu_count(logical=True) session = onnxruntime.InferenceSession(export_model_path, sess_options) latency = [] for i in range(total_samples): data = dataset[i] ort_inputs = { 'input_ids': data[0].cpu().reshape(1, max_seq_length).numpy(), 'input_mask': data[1].cpu().reshape(1, max_seq_length).numpy(), 'segment_ids': data[2].cpu().reshape(1, max_seq_length).numpy() } start = time.time() ort_outputs = session.run(None, ort_inputs) latency.append(time.time() - start) print("OnnxRuntime {} Inference time = {} ms".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))
我无法理解的是,在ort_input字典中,.cpu()...numpy()
是在输入上调用的。但是,print语句声明推理会话正在GPU上运行。这有什么意义?这是否不会调用主机设备通信?