Abstract
Li Auto is the leader among the new vehicle companies, and also the leader in the digitalization of the whole vehicle. The futuristic digital cockpit has attracted the attention of countless users, and behind it lies a huge set of data system that records all kinds of real-time status of the vehicle in operation. These large-scale data are uploaded, aggregated and analyzed multidimensionally, and the results are utilized in after-sales service, business operation and production optimization, helping Li Auto to realize an all-round digital drive.
At present, Li Auto's data infrastructure platform carries the vehicle data of more than 600,000 vehicles, and the daily new data scale exceeds 10 billion scale/day. Consequently, with the growth of vehicle scale and the refinement of monitoring indicators, the overall data scale is still continuing to expand at a high speed, and the platform is facing the combined challenges of performance, cost and operation and maintenance.
Li Auto started to contact YMatrix hyper-converged database in the fourth quarter of 2021, and after rigorous testing, it started to use YMatrix cluster to host the uploaded data of the vehicle and the related query analysis business, and the cluster has been running steadily for one year. YMatrix provides Li Auto with a set of high-performance, low-cost, and efficient and easy-to-use time-series data solutions: it supports batch and disordered writing of massive and highly concurrent data, and has been tested by the production environment with 600 million points/second writing, with the guarantee of ACID; it supports expansion in seconds, without interruption of the business, and provides petabyte-level data storage capacity, which helps the platform comfortably cope with the high-speed increase in the business data; and with a 2/3 reduction of the server usage, not only the delay of data entry is reduced dramatically, but also the performance of the system query is significantly improved; the technical architecture is simplified, so that the development time of business indicators is greatly compressed, which effectively improves the efficiency of development, operation and maintenance.
Extreme Writes, 600 Million Points/Second Peak Performance
Currently, Li Auto has more than 600,000 vehicles running online, and each vehicle of Li Auto's is dotted with various types of sensors, monitoring more than 5,000 vehicle operation indicators. These indicators are collected through the CAN bus in the car, time stamped, summarized and packaged, and finally uploaded back to the cloud data platform through the mobile network 24 hours a day. Theoretically, all the data will be uploaded in time sequence, but the real vehicle operating environment is very complex. For example, when the vehicle will be driven into the mountainous areas and other areas where the mobile signal can not be covered, the uploading of the vehicle data will be interrupted; after a period of time when the network is restored, the data during the period of interruption will need to be supplemented by uploading. Furthermore, the large amount of data and different frequency of collection causes some data will need to be uploaded in batches.
It can be seen that the Li Auto car data writing is very complex, facing two major challenges. On the one hand, it is necessary to deal with the complex data upload environment, which requires the platform to support delayed, disordered, batch writing. On the other hand, compared with the general IoT scenarios where a single device does not exceed 100 indicators of the monitoring, the Li Auto vehicles bike more than 5,000 indicators, and with no doubt the scale of the data will grow exponentially, which is a truly massive and highly concurrent writing scenario and requires the database itself to have extremely high throughput and stability.
YMatrix Hyperconverged Timing Database provides the following features for Li Auto's data platform in terms of time-series data writing:
- Support disordered, delayed, and batch writes in addition to upsert;
- Provide MatrixGate: MatrixGate is a data writing component developed by YMatrix specifically for IoT scenarios, designed to take full advantage of distributed data systems, providing more than 10 times write performance compared to direct Insert writes;
- Provide MARS2, a self-developed storage engine optimized for time-series data;
- Provide an ultra-wide table data model that can effectively deal with 5000 indicators;
- Support ACID;
- Support dynamic addition or deletion of indicators.
Based on the above features, YMatrix helps Li Auto's data platform to withstand the test of 600 million points/second write scale in a real production environment, and drastically compresses the write latency from the past 2 hours to less than 10 seconds, thus achieving the write-as-you-go use of huge amounts of data. Compared to normal timing scenarios (usually less than 100,000 points per second), this write scale is more than 1,500 times larger, which is the peak of timing write scenarios, fully proving YMatrix's excellent data writing capability.
Furthermore, thanks to YMatrix's powerful capabilities, the platform helps the business side be able to analyze millisecond data. On the business side, the finer time granularity can provide more accurate information to locate and analyze the cause of the problem, but it also means a 1,000 times increase in the data scale compared to the second-level data used in daily life. In the past, due to the write bottleneck constraints of the OpenTSDB cluster, millisecond-level data would cause unacceptable latency on the business side, but now YMatrix can support low-latency data entry within milliseconds, thus making millisecond data analysis possible.
Significantly reduce construction costs and easily cope with rapid business growth
Another challenge faced by Li Auto's data platform is that with the rapid growth of vehicles running online, the query demand from the business side also grows as the volume of data continues to grow, so the platform is always facing the pressure of continuous capacity expansion. According to the current business growth rate, it is estimated that the platform will need to be expanded once every six months. In addition to the direct hardware investment, each expansion is a complex and painful process, and may also interfere with the normal operation of the business. Therefore, under the premise of ensuring performance and functionality, the platform needs to find a more economical cluster construction solution, as well as a technical solution that can maximize the guarantee of continuous business operation.
YMatrix replaces the original 50-node OpenTSDB cluster with a 14-node cluster under the same scale of data, which reduces the amount of cluster servers by 2/3, and helps the platform to realize effective cost control in terms of hardware investment in the first place. At the same time, YMatrix hyper-converged database provides a set of smooth expansion solutions, one can realize the smooth expansion of the cluster scale without interruption of business. The whole process can be operated completely through the UI visualization interface, which provides the operation and maintenance personnel with a more simple, intuitive, process-oriented operation experience, and the expansion is no longer a journey of constant accidents and has become a SOP (standard process) with rules to follow. YMatrix also provides a more intuitive, process-oriented operation experience for O&M personnel. In addition, YMatrix also provides more than 100 core operation and maintenance monitoring indicators, covering write performance, query performance, cluster status and other aspects, which provides rich monitoring data support for operation and maintenance personnel to accurately grasp the platform's operation status, as well as fault location and attribution analysis.
Query efficiency and significant reduction in metrics development time
The original OpenTSDB platform does not support complex time-series queries such as aggregation query and window query, so similar complex queries need to be programmed independently in Hive and Flink clusters, which has a certain technical threshold for code development and higher maintenance cost in the later stage. YMatrix natively supports comprehensive time-series query functions, such as aggregation query, window function, jump query, difference query, etc. Business developers can obtain query results directly through YMatrix by using standard SQL language, and the development time of indexes is dramatically reduced from several days to less than 1 hour. At the same time, YMatrix has also greatly improved the query performance, for example, common queries such as single indicator detail query, the time consumed is reduced to less than 1 second, the maximum reduction of more than 90%.
Conclusion
"Under massive data, complex network environment and strict performance requirements, YMatrix has stably supported our vehicle signal acquisition and processing business. Not only the delay of data entry has been greatly reduced, but also the performance of system query has been significantly improved despite that the server usage has been reduced by 2/3." said Nie Lei, the person in charge of Li Auto's infrastructure platform.
Looking to the future, Li Auto's data platform hopes to further utilize the hyper-convergence features of YMatrix, develop and utilize the OLAP capabilities of YMatrix, complete the replacement of Hive data warehouse, further simplify the platform architecture from a technical perspective, and realize the integration and management of time-series data and other types of data from a business perspective, so as to assist in more in-depth exploration of the value of data.