Recently I have been asked to design a solution for monitoring system that can monitor different aspects of the system including time-series data, configuration data, alerts data, hardware failures, and SNMP traps. The system needs to be scalable, highly available and fault tolerant.
Since the system involves a storage box that requires monitoring of various statistics and components, so I'd chosen to store the time series data in a TDengine which is a highly scalable time series database and is a good choice for high cardinality data because it uses a partitioned storage model and columnar encoding to efficiently store and query large volume of data with a wide range of unique values. It can also handle data-ingestion, storage and retrieval efficiently in a number of ways.
To handle configuration data, I'd chosen to use Puppet as a configuration management tool that allows the admin to configure all the major tools and components of other systems. Puppet can automate the delivery and operation of applications and infrastructure whereas the configuration data is stored in PostgreSQL database which is a relational database management system that supports SQL queries and it is highly reliable, scalable, performant, free and open source. The updates and retrieval of configuration data efficiently can be handled using cache, batch updates, asynchronous updates and indexes. The data consistency and integrity in the configuration data can be ensured by transactions, foreign keys, validation and backups.
To collect and monitor statistic I have used storage box, management tool to collect statistics on the VD and pool level; then used a disk performance monitoring tool to collect statistics on the VD level; thereafter used an agent less monitoring tool to collect statistics on storage boxes that are not accessible through other methods. I further, processed the statistics to calculate through latency IOPS for each VD and pool; then calculated aggregated data per storage box by summing the values of the metrics for all the VDs and VDs in the storage box. Finally stored the statistics in our database.
Now, I have to decide how frequently will I collect the statistics and how to process and store them? Since the frequency of collection will depend on the specific needs of the organisation, like if the organisation is experiencing performance problems, they may want to collect the statistics more frequently otherwise collecting the statistics less frequently would work. So I chose to collect it more frequently based on organisational requirement.
After that, I need to monitor the hardware components of the storage box including fan temperature sensors, PSU, PDU, etc. So I'd chosen the SNMP to collect the data from the storage box's hardware components. But if SNMP is not available then used IPMI to collect the same data. In case if they storage box is running Windows, I used WMI to collect the data from those boxes.
To detect failures and report them in the user interface of the monitoring system, I set threshold for each hardware component. If the data from those components exceeds a threshold, an alert gets generated. I also used predictive analytics to identify potential failures before they occur. For example, while monitoring the temperature sensors, if the temperature is trending upwards, an alert generated. Apart from this on a regular basis, a human operator, also review the data from those storage box's hardware components to identify any potential problems.
To raise call home tickets for critical hardware component failures. I configured the storage box to send SNMP traps to the monitoring system when a critical hardware component failure occurs; then monitoring system further sends email alerts and webhooks to a call home system. Further on, I have used Grafana Alerting system to alert all stakeholders via email SMS or phone calls. And those stakeholders can also view the metrics and statistics on Grafana Kiosk dashboard.
The snmptrapd daemon was installed on storage boxes to send SNMP traps to a Telegraf server to collect that further stores the traps in the TDengine time series database. Then Grafana was used to visualise and analyse metrics from TDengine that stored SNMP traps data received from Telegraf.
Now to design the database schema to optimise the read queries for yearly data, I used star schema, which is a relational database schema that is well suited for analytical workload.
Now I need to optimise the read queries to minimise query execution time and improve performance. Although TDengine uses cache-first policy, but to reduce the load on the database, I had also used Materialize over TDengine via plug-in to create materialised views of the preprocessed data to further optimise query performance and freshness of queries that involved complex aggregations or joins.
Finally, to monitor the health and performance of all the components of this monitoring system, I had used Kafka, Prometheus and Grafana. Kafka collected the health and performance data and feed it to Prometheus to store the data. Prometheus can be accessed by Grafana and Grafana kiosks to display the data on a dashboard for IT support team.
Thus, to summarise the whole system, I had created four subsystems, those are storage box, monitoring system, configuration, management, system, health and performance monitoring system, alert and notification management system.
Storage box monitoring system is the core system that collected various types of data from the storage box, such as throughput, latency, IOPS at the vd, vd & pool level, aggregated data. This data is sent to a Telegraf cluster, which is a plugin-driven server agent that can collect and send metrics from different sources. The Telegraf cluster feeds the data into a TDengine cloud database, which is a high-performance time-series database designed for IoT applications. The TDengine cloud database can be accessed by Grafana cloud and Grafana kiosks, which are web-based platforms that display the data on interactive dashboards. Grafana cloud is a hosted version of Grafana that can be accessed from anywhere, while Grafana kiosks are dedicated devices that show the dashboards in full-screen mode. To reduce the load on the TDengine cloud database, Materialize is used as a caching layer between TDengine cloud and Grafana cloud. Materialize is a streaming SQL materialized view engine that can handle high-throughput and low-latency queries.
Comments
Post a Comment