Introduction 介绍
Recently, artificial intelligence (AI) technology is developing rapidly and has been widely used in various applications. The demand for high-performance computation platform is showing an increasing trend. However, traditional CMOS-based computing units are facing great challenges in terms of computing speed and power consumption as Moore's Law approaches its limits. Optical computing leverages the properties of photons to achieve the breakthrough of energy efficiency and processing latency [1]-[2].
近年来,人工智能 (AI) 技术发展迅速,并已广泛应用于各种应用中。对高性能计算平台的需求呈增长趋势。然而,随着摩尔定律接近极限,传统的基于 CMOS 的计算单元在计算速度和功耗方面面临着巨大挑战。光计算利用光子的特性来实现能效和处理延迟的突破 [1]-[2]。
In 2017, Shen et al. subversively proposed a new computing architecture based on optical neural networks (ONNs) using Mach-Zehnder interferometer arrays (MZIs) [3]. In this work, the clock rate of their system is at least two orders of magnitude of electronic neural networks. This work has significantly lower latency than CMOS-based computing chips. While maintaining high-speed computing, the power efficiency is at least five orders of magnitude better than conventional graphic processing units (GPUs) or at least three orders of magnitude better than application specific integrated circuit (ASIC). In addition to MZI arrays, other optical computing architectures such as micro-ring resonator arrays (MRRs) [4]-[5] and waveguide modulator arrays (WMAs) [6] also demonstrate their superiority over CMOS-based computing chips.
2017 年,Shen 等人颠覆性地提出了一种基于光学神经网络 (ONN) 的新计算架构,使用马赫-曾德尔干涉仪阵列 (MZI) [3]。在这项工作中,他们系统的时钟频率至少是电子神经网络的两个数量级。这项工作的延迟明显低于基于 CMOS 的计算芯片。在保持高速计算的同时,能效至少比传统图形处理单元 (GPU) 高 5 个数量级,或比专用集成电路 (ASIC) 至少高 3 个数量级。除了 MZI 阵列外,其他光学计算架构,如微环谐振器阵列 (MRR) [4]-[5] 和波导调制器阵列 (WMA) [6] 也证明了它们优于基于 CMOS 的计算芯片。
Although optical computing features high hardware efficiency, the simulation methods for photonic chips and their peripheral circuits are rarely mentioned in existing works [7] –[10]. The results of optical computing are usually affected by peripheral circuits, so an optoelectronic joint simulation platform is needed to evaluate the impact of peripheral circuits. To conduct reliable evaluation of optical computing, this paper proposes a hybrid optoelectronic computing evaluation and deployment platform. The main contributions are as follows.
尽管光计算具有较高的硬件效率,但光子芯片及其外围电路的仿真方法在现有工作中很少提及 [7] –[10]。光计算的结果通常受到外围电路的影响,因此需要一个光电联合仿真平台来评估外围电路的影响。为了对光计算进行可靠的评估,本文提出了一种混合光电计算评估和部署平台。主要贡献如下。
The theoretical analysis of the key module of the photonic chip is conducted and a simulation platform is proposed and built with Simulink tools.
对光子芯片的关键模块进行了理论分析,提出了仿真平台,并利用Simulink工具构建了仿真平台。The non-ideal factors of the platform are analyzed, such as noise, gain and offset error, nonlinearity, etc.
分析了平台的非理想因素,如噪声、增益和偏移误差、非线性等。Image filtering and image classification algorithms are deployed into the proposed platform with consideration of non-ideal factors from commercial peripheral chips and our tape-out photonic chip.
图像过滤和图像分类算法被部署到所提出的平台中,同时考虑了来自商业外围芯片和我们的流片光子芯片的非理想因素。
This paper is organized as follows. Section II explains the operating principle of the proposed computing evaluation and deployment platform with analysis of non-ideal factors. Section III describes the implementation of key modules in the platform. Section IV shows the experimental results and the conclusions are given in Section V.
本文的组织结构如下。第二节通过对非理想因素的分析解释了拟议的计算评估和部署平台的工作原理。第 III 节描述了平台中关键模块的实现。第 IV 部分显示了实验结果,结论在第 V 部分给出。
Principles and Architecture of the Platform
平台的原则和架构
A. Block Diagram A. 框图
Fig. 1 shows the structure of the optoelectronic computing evaluation and deployment platform. This architecture is divided into two parts, i.e., the photonic chip with peripheral circuits module as well as the fixed point calculation operator module. The fixed point calculation operator module is further divided into Relu, LUT, pooling, Leaky Relu, BN and Shift & Add [11].
图 1 显示了光电计算评估和部署平台的结构。该架构分为两部分,即带外围电路的光子芯片模块和定点计算算子模块。定点计算操作模块进一步分为Relu、LUT、pooling、Leaky Relu、BN和Shift & Add [11]。
The working flow of the photonic chip with peripheral circuits module is as follows. The high-speed digital-to-analog converter (DAC) with driver modulates the input optical signal by the electro-optic modulator. The low-speed DAC generates the photonic chip calculation matrix by biasing the thermal phase shifter array. The output of the photonic chip is then amplified through the conversion of photodetector (PD) and transimpedance amplifier (TIA). Subsequently, it is quantized and encoded by the high-speed ADC for further processing in the digital domain. The photonic chip generates a 16-row and 16-column matrix for computation, enabling 256 multiply-accumulate operations (MACs). The peripheral electrical modules include a 16-channel DAC and a 16-channel ADC, both transmitting 8-bit information at a clock frequency of 500 MHz. Consequently, the data rate for a single channel is 4 Gb/s, resulting in a communication bandwidth of 16 channels reaching 128 Gb/s for bi-directional transceiver. The calculation formula of hardware performance is shown in (1).
带外围电路的光子芯片模块的工作流程如下。带驱动器的高速数模转换器 (DAC) 通过电光调制器调制输入光信号。低速 DAC 通过偏置热移相器阵列来生成光子芯片计算矩阵。然后,光子芯片的输出通过光电探测器 (PD) 和跨阻放大器 (TIA) 的转换进行放大。随后,它由高速 ADC 量化和编码,以便在数字域中进一步处理。光子芯片生成一个 16 行和 16 列的矩阵用于计算,支持 256 次乘法累加运算 (MAC)。外围电气模块包括一个 16 通道 DAC 和一个 16 通道 ADC,均以 500 MHz 的时钟频率传输 8 位信息。因此,单个通道的数据速率为 4 Gb/s,导致 16 个通道的通信带宽达到双向收发器的 128 Gb/s。硬件性能的计算公式如 (1) 所示。
查看源 \begin{equation*}TOPS = \,Row\, \times \,Column\, \times 2 \times {F_{clk}}\tag{1}\end{equation*}
Row is the number of MACs matrix row, Column is the number of MAC matrix column, and Fclk is the clock frequency. This yields a hardware performance of 0.25 TOPS for a 16×16 optoelectronic computing chip.
Row 是 MAC 矩阵行的数量, Column 是 MAC 矩阵列的数量, Fclk 是时钟频率。这为 16×16 的光电计算芯片产生了 0.25 TOPS 的硬件性能。
B. Non-Ideal Factors B. 非理想因子
The peripheral electronic modules include ADC, DAC, and TIA. The photonic chip module includes electro-optic modulators, thermal phase shifter array, and PD. Fig. 2 depicts the block diagram with considerations for non-ideal factors. The platform introduces output referred noise, gain and offset error for the high-speed DAC, dark current noise for the PD, input referred current noise for the TIA. Input referred noise, gain and offset error are introduced for the high-speed ADC. Nonlinearity and insertion loss are introduced for the electro-optic modulator. The thermal phase shifter array considers insertion loss and the low-speed DAC considers output referred noise and quantization error.
外设电子模块包括 ADC、DAC 和 TIA。光子芯片模块包括电光调制器、热移相器阵列和 PD。图 2 描述了考虑非理想因素的框图。该平台为高速 DAC 引入了折合到输出端的噪声、增益和偏移误差,为 PD 引入了暗电流噪声,为 TIA 引入了折合到输入端的电流噪声。为高速 ADC 引入了折合到输入端的噪声、增益和偏移误差。为电光调制器引入了非线性和插入损耗。热移相器阵列考虑插入损耗,低速 DAC 考虑折合到输出的噪声和量化误差。
The output of a single-channel high-speed DAC can be expressed in (2).
单通道高速DAC的输出可以用(2)表示。
查看源 \begin{equation*}{x_I} = IN \times {E_{G,DAC}} + {V_{O,DAC}} + {N_{h\_DAC}}\tag{2}\end{equation*}
Where IN is the ideal output of a single-channel high-speed DAC, EG, DAC is the gain error, VO, DAC is the offset voltage, and Nh_ DAC is the output referred noise.
其中,IN是单通道高速DAC的理想输出,EG,DAC是增益误差,VO,DAC是失调电压,Nh_ DAC是折合到输出端的噪声。
查看源 \begin{equation*}X = \left[ {\begin{array}{c} {F\left( {{x_1}} \right)} \\ {F\left( {{x_2}} \right)} \\ \vdots \\ {F\left( {{x_{16}}} \right)} \end{array}} \right],\quad O = \left[ {\begin{array}{c} {{o_1}} \\ {{o_2}} \\ \vdots \\ {{o_{16}}} \end{array}} \right]\tag{3}\end{equation*}
In (3), a 16×1 matrix X can be obtained through the electro-optic modulators array. F is a function that introduces higher order harmonic terms and insertion loss for the output of high-speed DAC. The output O is obtained after multiplying X with the 16×16 matrix generated by the thermal phase shifter array. The derivation of the 16×16 matrix will be presented in Section Ⅲ. The signal that needs to be quantized by a single-channel DAC can be expressed in (4).
在(3)中,可以通过电光调制器阵列获得 16×1 矩阵 X。F 是为高速 DAC 的输出引入高阶谐波项和插入损耗的函数。输出 O 是将 X 与热移相器阵列生成的 16×16 矩阵相乘后得到的。16×16 矩阵的推导将在第 III 节中介绍。需要由单通道 DAC 量化的信号可以用 (4) 表示。
查看源 \begin{equation*}\begin{array}{l} {y_1} = \left( {{o_1} \times {A_{PD}} + {N_{PD}} + {N_{TLA}}} \right) \times {A_{TLA}} \times {E_{G,ADC}} \\ + {V_{O,ADC}} + {N_{h\_ADC}} \end{array} \tag{4}\end{equation*}
Where NPD is the dark current noise, APD is the optoelectronic responsivity, ATIA is the transconductance gain, NTIA is the input referred current noise, EG, ADC is the gain error, VO, ADC is the offset voltage, Nh_ ADC is the input referred noise.
其中 NPD 是暗电流噪声,APD 是光电响应度,ATIA 是跨导增益,NTIA 是折合到输入端的电流噪声,EG,ADC 是增益误差,V O,ADC 是失调电压,Nh_ ADC 是折合到输入端的噪声。
The platform can be packaged to create a 16×16 optoelectronic computing operator chip, equipped with 16 optical/electrical signal inputs and 16 optical/electrical signal outputs. The output signal can be further processed in the optical or electrical domain to support flexible deployment of neural networks.
该平台可以封装成一个 16×16 光电子计算算子芯片,配备 16 个光/电信号输入和 16 个光/电信号输出。输出信号可以在光学或电气域中进一步处理,以支持神经网络的灵活部署。
Optoelectronic Joint Simulation Platform
光电关节仿真平台
A. MZI Thermal Phase Shifter Array
A. MZI 热移相器阵列
The basic cell structure [3] in the MZI network is illustrated in Fig. 3. From left to right, there are two phase shifters in this MZI unit to generate phase to control the splitting ratio of the MZI, and then the optical signal passes through a beam splitter. The two arms of the output have two phase shifters to generate phase to control the phase shift difference between the output arms, and finally it passes through a beam splitter again.
MZI 网络中的基本单元结构 [3] 如图 3 所示。从左到右,这个 MZI 单元中有两个移相器产生相位以控制 MZI 的分光比,然后光信号通过分束器。输出的两个臂有两个移相器来产生相位以控制输出臂之间的相移差,最后它再次通过分束器。
The transfer matrix of the MZI can be written as a standard rotation matrix with the expression in (5).
马赫-曾德尔调制器的传递矩阵可以写成标准旋转矩阵,表达式为 (5)。
查看源 \begin{equation*}\begin{array}{r} {{M_1} = \left[ {\begin{array}{cc} {\cos \varphi }&{j\sin \varphi } \\ {j\sin \varphi }&{\cos \varphi } \end{array}} \right] \times \left[ {\begin{array}{cc} {{e^{j{\alpha _1}}}}&0 \\ 0&{{e^{j{\beta _1}}}} \end{array}} \right]} \\ {\quad \times \left[ {\begin{array}{ll} {\cos \varphi }&{j\sin \varphi } \\ {j\sin \varphi }&{\cos \varphi } \end{array}} \right] \times \left[ {\begin{array}{cc} {{e^{j{\alpha _0}}}}&0 \\ 0&{{e^{j{\beta _0}}}} \end{array}} \right]} \end{array}\tag{5}\end{equation*}
When φ = π /4, the beam splitter used has a beam split ratio of 50:50. Taking into account the influence of the insertion loss IMZI of MZI unit, (5) can be rewritten as
当 φ = π /4 时,使用的分束器具有 50:50 的分束比。考虑到 MZI 单位的插入损耗 IMZI 的影响,(5) 可以改写为
查看源 \begin{equation*}\begin{array}{l} {M_1} = \left[ {\begin{array}{cc} {{e^{j\left( {{\alpha _0} + {\alpha _1}} \right)}} - {e^{j\left( {{\alpha _1} + {\beta _0}} \right)}}}&{j{e^{j\left( {{\alpha _0} + {\alpha _1}} \right)}} + j{e^{j\left( {{\alpha _1} + {\beta _0}} \right)}}} \\ {j{e^{j\left( {{\beta _0} + {\beta _1}} \right)}} + j{e^{j\left( {{\alpha _0} + {\beta _1}} \right)}}}&{{e^{j\left( {{\beta _0} + {\beta _1}} \right)}} - {e^{j\left( {{\alpha _0} + {\beta _1}} \right)}}} \end{array}} \right] \\ \times {I_{MZI}} \end{array} \tag{6}\end{equation*}
Fig. 3 shows a 16×16 thermal phase shifter array with three columns of crossing used to reorder the 16 signals in addition to the MZI units. For example, crossing C can be expressed as
图 3 显示了一个 16×16 热移相器阵列,其中有三列交叉,用于对 16 个信号以及 MZI 单元进行重新排序。例如,交叉 C 可以表示为
查看源 \begin{align*}&{U_1} = \left[ {\begin{array}{cccc} {{M_1}}&{}&{}&{} \\ {}&{{M_2}}&{}&{} \\ {}&{}& \ddots &{} \\ {}&{}&{}&{{M_8}} \end{array}} \right] \tag{7}\\& R = {U_4} \times {C_3} \times {U_3} \times {C_2} \times {U_2} \times {C_1} \times {U_1}\tag{8}\end{align*} where R is an expression for a 16×16 weight matrix controlled by 88 phases. This pseudo-real-valued non-universal grid structure requires significantly fewer MZI units than conventional singular value decomposition (SVD) designs, reducing area and power overheads while maintaining accuracy.
其中 R 是由 88 个相位控制的 16×16 权重矩阵的表达式。与传统的奇异值分解 (SVD) 设计相比,这种伪实值非通用网格结构所需的 MZI 单位要少得多,在保持精度的同时减少了面积和功耗开销。
B. Simulink-based Platform
B. 基于 Simulink 的平台
The electro-optic modulator and thermal phase shifter within the photonic chip are modeled at the behavioral level according to the principle, while the remaining modules are modeled using components in Simscape. The Simulink platform diagram is illustrated in Fig. 4.
光子芯片内的电光调制器和热移相器根据该原理在行为层面进行建模,而其余模块则使用 Simscape 中的组件进行建模。Simulink 平台图如图 4 所示。
The ADC adopts the Flash ADC in the Simulink library and introduces Gaussian noise as the input referred noise. The DAC utilizes the Binary Weighted DAC from the library and introduces the output referred noise. The PD needs to take the real part of the optical signal output from the photonic chip and convert it to an electrical signal. A low-pass filter is employed in Simulink to model the bandwidth of the PD. Additionally, a gain module is utilized to represent the optoelectronic responsivity, and dark current noise is introduced at the output. A low-pass filter is used in TIA to represent the bandwidth, a gain module is used to represent the transconductance gain, and input referred current noise is added.
ADC 采用 Simulink 库中的 Flash ADC,并引入高斯噪声作为折合到输入端的噪声。DAC 利用库中的二进制加权 DAC 并引入折合到输出的噪声。PD 需要将光子芯片输出的光信号的实部转换为电信号。Simulink 中使用低通滤波器对 PD 的带宽进行建模。此外,增益模块用于表示光电响应度,并在输出端引入暗电流噪声。TIA 中使用低通滤波器来表示带宽,使用增益模块来表示跨导增益,并添加折合到输入端的电流噪声。
Experimental Results 实验结果
A. Photonic Chip Test A. 光子芯片测试
The die photo of the photonic chip is shown in Fig. 5, including electro-optic modulator, thermal phase shifter and photodetector, etc., with an overall area of 5 mm × 7.5 mm. Voltage values suitable for optical computing are obtained by modifying the gradient descent variable of the Pytorch framework. The convolutional layers required for the neural network are obtained by biasing voltages on the thermal phase-shifting array and tested in practice. The FPGA (XCZU29DR-ffvf1760-1-i) generates the required voltage values, the tunable laser (CBDX-SC-SC-SC-SC-FA) generates the light source, and the calculation results are converted by the TIA (LMH32401) and quantified by the FPGA. The photonic chip achieves a Top5 accuracy of 86.4% on the ImageNet image dataset.
光子芯片的芯片照片如图 5 所示,包括电光调制器、热移相器和光电探测器等,总面积为 5 mm × 7.5 mm。通过修改 Pytorch 框架的梯度下降变量来获得适合光学计算的电压值。神经网络所需的卷积层是通过对热相移阵列上的电压进行偏置来获得的,并在实践中进行了测试。FPGA (XCZU29DR-ffvf1760-1-i) 生成所需的电压值,可调谐激光器 (CBDX-SC-SC-SC-SC-FA) 生成光源,计算结果由 TIA (LMH32401) 转换并由 FPGA 量化。光子芯片在 ImageNet 图像数据集上实现了 86.4% 的 Top5 准确率。
B. Low-speed DAC Test B. 低速 DAC 测试
Fig. 6 presents the silicon die photograph of the prototype. The ranges of 128-channel outputs are all from 0 to 13V. The signal-to-noise ratio (SNR) is more than 40dB, and the conversion speed of DAC is more than 1Msps. By configuring the relevant PINs, the output can be performed in three modes: low voltage, high voltage (4×amplification) and high voltage (6×amplification). The power consumption of the whole chip is less than 3W.
图 6 显示了原型的硅晶片照片。128 通道输出的范围都是从 0 到 13V。信噪比 (SNR) 超过 40dB,DAC 的转换速度超过 1Msps。通过配置相关的 PIN,可以在三种模式下进行输出:低电压、高电压 (4×amp化) 和高电压 (6×amplification)。整个芯片的功耗小于 3W。
C. Simulink Simulation C. Simulink 仿真
The platform simulation is carried out under Simulink. Except for the photonic chip and the low-speed DAC extracting from tape-out tests, the remaining electrical peripheral circuits refer to the commercial chips. The high-speed ADC is referred to AD9094, the high-speed DAC is referred to AD9776A, and the TIA is referred to HMC799.
平台仿真在 Simulink 下进行。除了光子芯片和从流片测试中提取的低速 DAC 外,其余的电气外围电路都是指商用芯片。高速 ADC 称为 AD9094,高速 DAC 称为 AD9776A,TIA 称为 HMC799。
After setting the relevant non-ideal parameters, the effective number of bits (ENOB) of a single channel is measured to be 6.8-bit under the Nyquist frequency. An error analysis of the photonic matrix yields a mean square error of 2.6068×10-5 V2 and a relative error of 2.18%. The image quality was tested by adding salt-and-pepper noise to the input image and comparing the various error analyses with the ideal 4×4 mean filter to observe the filtering effect under the influence of non-ideal factors. The peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) index obtained from the test are shown in Table II.
设置相关的非理想参数后,在奈奎斯特频率下测得单个通道的有效位数 (ENOB) 为 6.8 位。光子矩阵的误差分析得出均方误差为 2.6068×10-5 V2,相对误差为 2.18%。通过在输入图像中加入盐和胡椒噪声,并将各种误差分析与理想的 4×4 均值滤波器进行比较,以观察非理想因子影响下的滤波效果,从而测试图像质量。从测试中获得的峰值信噪比 (PSNR) 和结构相似性 (SSIM) 指数如表 II 所示。
Through Table II it can be visualized that the optical matrix error has the greatest impact on the filtering effect, but a relatively high-quality image can still be obtained. Next, the errors are enlarged to analyze the filtering effect in adverse case, and the obtained images are illustrated in Fig. 7.
通过表 II 可以直观地看到,光学矩阵误差对滤波效果的影响最大,但仍然可以获得相对高质量的图像。接下来,放大误差以分析不利情况下的滤波效果,所得图像如图 7 所示。
Fig. 8 depicts the impact of various non-ideal factors on the image classification results with the Modified National Institute of Standards and Technology (MNIST) database. Error1 is the noise of high-speed ADC, high-speed DAC, PD and TIA, error2 is the gain and offset error, error3 is the nonlinearity of the electro-optic modulator, error4 is the resolution of the low-speed DAC, and error5 is the noise of the low-speed DAC. It can be seen that with the continuous accumulation of non-ideal factors, the accuracy rate also decreases to varying degrees. Then each non-ideal factor is analyzed separately. The resolution and noise of the low-speed DAC have the greatest impact on the test results. The SNR of the low-speed DAC must be at least 52 dB to achieve an 80% image classification accuracy according to the test.
图 8 描述了使用修改后的美国国家标准与技术研究院 (MNIST) 数据库,各种非理想因素对图像分类结果的影响。误差 1 是高速 ADC、高速 DAC、PD 和 TIA 的噪声,误差 2 是增益和失调误差,误差 3 是电光调制器的非线性度,误差 4 是低速 DAC 的分辨率,误差 5 是低速 DAC 的噪声。由此可见,随着非理想因子的不断积累,准确率也不同程度地下降。然后分别分析每个非理想因子。低速 DAC 的分辨率和噪声对测试结果的影响最大。根据测试,低速 DAC 的 SNR 必须至少为 52 dB,才能达到 80% 的图像分类精度。
Conclusion 结论
This article models and analyzes the optoelectronic computing and deployment platform. Combined with the test results of the commercial chips and our tape-out chips, a reliable theoretical foundation and practical exploration for future hybrid optoelectronic computing platform is provided.
本文对光电计算和部署平台进行建模和分析。结合商用芯片和我们的流片芯片的测试结果,为未来的混合光电计算平台提供了可靠的理论基础和实践探索。