|
Текст на английском на тему Система сбора информации и данных, с переводом. 30000 знаков с переводом. Задачи и понятие сбора информации
Principle of operation
An FPGA chip is the same custom ASIC chip consisting of the same transistors that assemble triggers, registers, multiplexers, and other logic gates for conventional circuits. Of course, it is impossible to change the order of connection of these transistors. But architecturally, the chip is built in such a cunning way that it is possible to change the switching of signals between larger blocks: they are called CLBs – programmable logic blocks.
You can also modify the logical function that the CLB performs. This is achieved due to the fact that the entire chip is permeated with static RAM configuration memory cells. Each bit of this memory either controls some signal switching key or is part of the truth table of the logic function that the CLB implements.
Since the configuration memory is built on Static RAM technology, then, firstly, when the FPGA power is turned on, the chip must be configured, and secondly, the chip can be reconfigured almost an infinite number of times.
ClB blocks reside in a switching fabric that defines the input and output connections of clB blocks. At each intersection of the conductors there are six switching keys controlled by their configuration memory cells. By opening and closing others, it is possible to ensure different switching of signals between clBs:
The CLB very simplistically consists of a block that defines a Boolean function from several arguments (it is called the Look Up Table, LUT) and a trigger (flip-flop, FF). In modern FPGAs, the LUT has six inputs, but the figure shows three for simplicity. The LUT output is fed to the CLB output either asynchronously (directly) or synchronously (via a FF trigger operating at the system clock speed).
It is interesting to look at the principle of LUT implementation. Suppose we have some Boolean function y = (a & b) | c. Its schematic representation and truth table are shown in the figure. The function has three arguments, so it takes 2^3 = 8 values. Each of them corresponds to its own combination of input signals. These values are calculated by the program for the development of the FPGA firmware and are recorded in special cells of the configuration memory.
The value of each cell is fed to its own LUT output multiplexer input, and the input arguments of the Boolean function are used to select a particular value of the function. ClB is the most important hardware resource of FPGA. The amount of CLB in modern FPGA crystals can be different and depends on the type and capacity of the crystal. Xilinx has crystals with CLB numbers ranging from about four thousand to three million.
In addition to the CLB, there are a number of other important hardware resources inside the FPGA. For example, hardware multiplication units with accumulation or DSP blocks. Each of them can do multiplication operations and addition of 18-bit numbers each clock cycle. In top crystals, the number of DSP blocks can exceed 6000.
Another resource is the blocks of internal memory (Block RAM, BRAM). Each block can store 2 KB. The total capacity of such memory, depending on the crystal, can reach from 20 KB to 20 MB. Like CLB, BRAM and DSP blocks are connected by a switching fabric and permeate the entire crystal. By linking CLB, DSP, and BRAM blocks, you can get very efficient data processing schemes.
Applications and Benefits of FPGA
The first FPGA chip, created by Xilinx in 1985, contained only 64 CLBs. At the time, the integration of transistors on chips was much lower than it is now, and digital devices often used "loose logic" chips. There were separate chips of registers, counters, multiplexers, multipliers. For a particular device, its own printed circuit board was created, on which these low-integration chips were installed.
The use of FPGAs made it possible to abandon this approach. Even the FPGA on the 64 CLB significantly saves space on the PCB, and the availability of reconfiguration has added the ability to update the functionality of devices after manufacture during operation, as they say "in the field" (hence the name - field-programmable gate array).
Due to the fact that any hardware digital circuit can be created inside the FPGA (the main thing is that there are enough resources), one of the important applications of the FPGA is the prototyping of ASIC chips.
AsIC development is very complex and costly, the cost of an error is very high, and the issue of logic testing is critical. Therefore, one of the stages of development even before the start of work on the physical topology of the circuit was its prototyping on one or more FPGA crystals.
For the development of ASICs, special boardsare produced containing many FPGAs connected to each other. The prototype chip works at much lower frequencies (maybe dozens of megahertz), but allows you to save on identifying problems and bugs.
However, in my opinion, there are more interesting applications of FPGA. The flexible structure of the FPGA allows you to implement hardware circuits for high-speed and parallel data processing with the ability to change the algorithm.
Let's think about how the CPU, GPU, FPGA and ASIC are fundamentally different.
The CPU is universal, you can run any algorithm on it, it is the most flexible, and it is easiest to use it thanks to the huge number of programming languages and development environments.
At the same time, due to the versatility and sequential execution of CPU instructions, performance is reduced and the power consumption of the circuit is increased. This happens because for each useful arithmetic operation, the CPU performs many additional operations related to reading instructions, moving data between registers and cache, and other body movements.
On the other side is the ASIC. On this platform, the required algorithm is implemented in hardware due to the direct connection of transistors, all operations are associated only with the execution of the algorithm and there is no way to change it. Hence the maximum performance and the lowest power consumption of the platform. But it is impossible to reprogram the ASIC.
To the right of the CPU is the GPU. Initially, these chips were designed for graphics processing, but now they are also used for mining general-purpose computing. They consist of thousands of small computing cores and perform parallel operations on an array of data.
If the algorithm can be parallelized, then it will be possible to achieve significant acceleration on the GPU compared to the CPU. On the other hand, sequential algorithms will be implemented worse, so the platform is less flexible than the CPU. Also, for GPU development, you need to have special skills, know OpenCL or CUDA.
Finally, FPGA. This platform combines the efficiency of ASIC with the ability to change the program. FPGAs are not universal, but there is a class of algorithms and tasks that will show better performance on them than on the CPU and even GPU. The complexity of FPGA development is higher, but new development tools make this gap smaller.
The decisive advantage of FPGA is the ability to process data at the rate of their receipt with minimal reaction delay. As an example, you can imagine a smart network router with a large number of ports: when an Ethernet packet arrives at one of its ports, you need to check a lot of rules before choosing an output port. You may need to change some of the package fields or add new ones.
Using an FPGA allows you to solve this problem instantly: the bytes of the packet have just begun to flow into the chip from the network interface, and its header is already being analyzed. The use of processors here can significantly slow down the processing speed of network traffic. It is clear that it is possible to make a custom ASIC chip for routers that will work most efficiently, but what if the packet processing rules have to change? Only FPGA will help to achieve the required flexibility in combination with high performance.
Thus, FPGAs are used where high data processing performance, the shortest reaction time, as well as low power consumption are needed.
FPGA in the "cloud".
In cloud computing, FPGAs are used for fast counting, speeding up network traffic and accessing data arrays. This also includes the use of FPGA for high-frequency trading on exchanges. FPGA cards with PCI Express and an optical network interface manufactured by Intel (Altera) or Xilinx are inserted into the servers.
Cryptographic algorithms, DNA sequence comparison, and scientific tasks like molecular dynamics are excellent for FPGA. Microsoft has been using FPGA for a long time to speed up the Bing search service, as well as to organize Software Defined Networking inside the Azure cloud.
The boom of machine learning has not bypassed FPGA either. Xilinx and Intel offer FPGA-based tools for working with deep neural networks. They allow you to get FPGA firmware that implement a particular network directly from frameworks like Caffe and TensorFlow.
Moreover, you can try all this without leaving home and using cloud services. For example, you can rent a virtual machine from Amazon with access to an FPGA board and any development tools, including machine learning.
What else is interesting to do on an FPGA? Yes, what are they not doing! Robotics, unmanned vehicles, drones, scientific instruments, medical equipment, custom mobile devices, smart surveillance cameras and so on.
Traditionally, FPGAs were used for digital processing of one-dimensional signals (and competed with DSP processors) in radar devices, radio signal transceivers. With the growth of chip integration and increased performance, FPGA platforms have become increasingly used for high-performance computing, for example, for processing two-dimensional signals "on the edge of the cloud" (edge computing).
This concept is easiest to understand using the example of a video camera for analyzing car traffic with a car number recognition function. You can take a camera with the ability to transmit video over Ethernet and process the stream on a remote server. With the increase in the number of cameras, the load on the network will also increase, which can lead to system failures.
Instead, it is better to implement number recognition on a computer installed directly in the body of the video camera and transmit machine numbers to the cloud in text format. To do this, you can even take a relatively inexpensive FPGA with low power consumption to get by with a battery. At the same time, it remains possible to change the logic of the FPGA, for example, when changing the standard of license plates.
As for robotics and drones, it is especially important to fulfill two conditions in this area — high performance and low power consumption. The FPGA platform fits perfectly and can be used, in particular, to create flight controllers for drones. They are already making UAVs that can make decisions on the fly.
How to develop a project on FPGA?
There are different levels of design: low, block and high. A low level involves the use of languages like Verilog or VHDL, in which you manage development at the register transfer level (RTL — register transfer level). In this case, you form registers, as in a processor, and define logical functions that change data between them.
FPGA circuits always operate at certain clock frequencies (usually 100-300 MHz), and at the RTL level you determine the behavior of the circuit up to the clock of the system frequency. This painstaking work leads to the creation of the most efficient circuits in terms of performance, FPGA chip resource consumption and power consumption. But serious skills in circuit engineering are required here, and the process is not fast with them.
At the block level, you are mainly engaged in connecting ready-made large blocks that perform certain functions to obtain the functionality of a system on a chip (system-on-chip) that you need.
At a high level of design, you will no longer control the data at every clock cycle, instead you will concentrate on the algorithm. There are compilers or translators from the C and C++ languages to the RTL level, for example Vivado HLS. It is quite smart and allows you to broadcast a wide class of algorithms to the hardware level.
The main advantage of this approach over RTL languages is the acceleration of algorithm development and especially algorithm testing: C++ code can be run and verified on a computer, and it will be much faster than testing algorithm changes at the RTL level. Of course, you will have to pay for convenience — the scheme may not be as fast and will take more hardware resources.
Often we are willing to pay this price: if we use the translator correctly, the efficiency will not suffer much, and there are enough resources in modern FPGAs. In our world with a critical time to market indicator, this turns out to be justified.
Often you need to combine all three development styles in one design. Let's say we need to make a device that we could embed into a robot and give it the ability to recognize objects in a video stream — for example, road signs. Let's take the video sensor chip and connect it directly to the FPGA. For debugging, we can use an HDMI monitor, also connected to an FPGA.
The frames from the camera will be transmitted to the FPGA via an interface that is obviously determined by the sensor manufacturer (USB does not roll here), processed and output to the monitor. To process frames, you will need a framebuffer, which is usually located in the external DDR memory installed on the printed circuit board next to the FPGA chip.
If the manufacturer of the video sensor does not provide an IP Interface for our FPGA chip, then we will have to write it ourselves in RTL, counting clock cycles, bits and bytes in accordance with the data transfer protocol specification. We will most likely take the Preprocess, DDR Controller and HDMI IP blocks ready-made and simply connect their interfaces. And the HLS block, which performs the search and processing of incoming data, we can write in C++ and broadcast using Vivado HLS.
Most likely, we will still need some kind of ready-made library of the detector and classifier of road signs adapted for use in FPGA. In this example, of course, I am giving a very simplified design flowchart, but it reflects the logic of the work correctly.
Consider the design path from writing RTL code to getting a configuration file to upload to the FPGA.
So, the RTL code is written, which implements the desired projected scheme. Before testing it on real hardware, you need to make sure that it is correct and correctly solves the required task. To do this, RTL modeling is used in a computer simulator.
We use our own circuit, presented so far only in RTL code, and place it on a virtual stand, where we supply sequences of digital signals to the circuit inputs, register output diagrams, time dependencies of the output signals and compare them with the expected results. Usually there are errors and we return to writing RTL.
Then the logically verified code is fed to the input of the synthesizer program. It converts the textual description of the circuit into a linked list of digital elements from the library available for this FPGA chip. This list will display elements such as LUT, triggers, and links between them. At this stage, the elements are not yet tied to specific hardware resources in any way. To do this, you need to impose constraints on the circuit — in particular, specify which physical I/O pins of the FPGA chip the logical inputs and outputs of your circuit are connected to.
In these restrictions, it is also required to specify at which clock frequencies the circuit should work. The output of the synthesizer and the constraints file are given to the Implementation processor, which, among other things, deals with placement and tracing (Place and Route).
The Place process binds each as-yet depersonalized element from the netlist to a specific element inside the FPGA chip. Next, the Route process starts working, which tries to find the optimal connection of these elements for the corresponding configuration of the FPGA switching matrix.
Place and Route operate based on the restrictions we have imposed on the circuit: I/O pins and clock frequency. The clock frequency period greatly affects the Implementation: it should not be less than the time delay on the logic elements in the critical circuit between two consecutive triggers.
Often it is not possible to meet this requirement immediately, and then you need to go back to the initial stage and change the RTL code: for example, try to reduce the logic in the critical circuit. After successful completion of the Implementation, we know which elements are where and how they are connected.
Only after that, the process of creating a binary FPGA firmware file starts. It remains to load it into real hardware and check if it works as expected. If problems arise at this stage, it means that the modeling was incomplete and all errors and shortcomings were not eliminated at this stage.
You can go back to the simulation stage and simulate an emergency situation, and if this does not work, as a last resort, a debugging mechanism is provided directly in the running hardware. We can easily specify which signals we want to track in time, and the development environment will generate an additional logic analyzer circuit, which is placed on a chip next to the circuit being developed, connects to the signals of interest and stores their values in time. The saved time diagrams of the desired signals can be uploaded to a computer and analyzed.
There are also high-level development tools (HLS, High-level synthesis), and even ready-made frameworks for creating neural networks in FPGAs. These tools generate RTL code in VHDL or Verilog languages at the output, which then goes down the Synthesis → Implementation → Bitstream generation chain. They are quite possible to use, but in order to use them effectively, you need to have at least a minimal understanding of RTL-level languages. |
|
|