Skip to main content

How to design a solution for storage monitoring system?

Recently I have been asked to design a solution for monitoring system that can monitor different aspects of the system including time-series data, configuration data, alerts data, hardware failures, and SNMP traps. The system needs to be scalable, highly available and fault tolerant. 

Since the system involves a storage box that requires monitoring of various statistics and components, so I'd chosen to store the time series data in a TDengine which is a highly scalable time series database and is a good choice for high cardinality data because it uses a partitioned storage model and columnar encoding to efficiently store and query large volume of data with a wide range of unique values. It can also handle data-ingestion, storage and retrieval efficiently in a number of ways.

To handle configuration data, I'd chosen to use Puppet as a configuration management tool that allows the admin to configure all the major tools and components of other systems. Puppet can automate the delivery and operation of applications and infrastructure whereas the configuration data is stored in PostgreSQL database which is a relational database management system that supports SQL queries and it is highly reliable, scalable, performant, free and open source. The updates and retrieval of configuration data efficiently can be handled using cache, batch updates, asynchronous updates and indexes. The data consistency and integrity in the configuration data can be ensured by transactions, foreign keys, validation and backups.

To collect and monitor statistic I have used storage box, management tool to collect statistics on the VD and pool level; then used a disk performance monitoring tool to collect statistics on the VD level; thereafter used an agent less monitoring tool to collect statistics on storage boxes that are not accessible through other methods. I further, processed the statistics to calculate through latency IOPS for each VD and pool; then calculated aggregated data per storage box by summing the values of the metrics for all the VDs and VDs in the storage box. Finally stored the statistics in our database. 

Now, I have to decide how frequently will I collect the statistics and how to process and store them? Since the frequency of collection will depend on the specific needs of the organisation, like if the organisation is experiencing performance problems, they may want to collect the statistics more frequently otherwise collecting the statistics less frequently would work. So I chose to collect it more frequently based on organisational requirement.

After that, I need to monitor the hardware components of the storage box including fan temperature sensors, PSU, PDU, etc. So I'd chosen the SNMP to collect the data from the storage box's hardware components. But if SNMP is not available then used IPMI to collect the same data. In case if they storage box is running Windows, I used WMI to collect the data from those boxes.

To detect failures and report them in the user interface of the monitoring system, I set threshold for each hardware component. If the data from those components exceeds a threshold, an alert gets generated. I also used predictive analytics to identify potential failures before they occur. For example, while monitoring the temperature sensors, if the temperature is trending upwards, an alert generated. Apart from this on a regular basis, a human operator, also review the data from those storage box's hardware components to identify any potential problems.

To raise call home tickets for critical hardware component failures. I configured the storage box to send SNMP traps to the monitoring system when a critical hardware component failure occurs; then monitoring system further sends email alerts and webhooks to a call home system. Further on, I have used Grafana Alerting system to alert all stakeholders via email SMS or phone calls. And those stakeholders can also view the metrics and statistics on Grafana Kiosk dashboard.

The snmptrapd daemon was installed on storage boxes to send SNMP traps to a Telegraf server to collect that further stores the traps in the TDengine time series database. Then Grafana was used to visualise and analyse metrics from TDengine that stored SNMP traps data received from Telegraf.

Now to design the database schema to optimise the read queries for yearly data, I used star schema, which is a relational database schema that is well suited for analytical workload.

Now I need to optimise the read queries to minimise query execution time and improve performance. Although TDengine uses cache-first policy, but to reduce the load on the database, I had also used Materialize over TDengine via plug-in to create materialised views of the preprocessed data to further optimise query performance and freshness of queries that involved complex aggregations or joins.

Finally, to monitor the health and performance of all the components of this monitoring system, I had used Kafka, Prometheus and Grafana. Kafka collected the health and performance data and feed it to Prometheus to store the data. Prometheus can be accessed by Grafana and Grafana kiosks to display the data on a dashboard for IT support team. 

Thus, to summarise the whole system, I had created four subsystems, those are storage box, monitoring system, configuration, management, system, health and performance monitoring system, alert and notification management system. 

Storage box monitoring system is the core system that collected various types of data from the storage box, such as throughput, latency, IOPS at the vd, vd & pool level, aggregated data. This data is sent to a Telegraf cluster, which is a plugin-driven server agent that can collect and send metrics from different sources. The Telegraf cluster feeds the data into a TDengine cloud database, which is a high-performance time-series database designed for IoT applications. The TDengine cloud database can be accessed by Grafana cloud and Grafana kiosks, which are web-based platforms that display the data on interactive dashboards. Grafana cloud is a hosted version of Grafana that can be accessed from anywhere, while Grafana kiosks are dedicated devices that show the dashboards in full-screen mode. To reduce the load on the TDengine cloud database, Materialize is used as a caching layer between TDengine cloud and Grafana cloud. Materialize is a streaming SQL materialized view engine that can handle high-throughput and low-latency queries.

Comments

Popular posts from this blog

Unlock protected blocks in Siemens SIMATIC Step 7

Recently I'd been called by Hindalco's Fabrication Plant division to unlock the protected blocks in Siemens SIMATIC Step 7. They were in need to unlock those blocks since an year because of 1 million Rupees of loss per month. They want to re-program those blocks but it was locked by the man who'd done the setup. From the people working in that department, I came to know that they were trying to call that man (someone from Italy) right here but he's not coming. Actually, what he'd done was that he'd locked some of the blocks and deleted the source file. And Siemens didn't provide any feature to unlock. Department people also told me that even the people working in Siemens don't know how to do it. Being a software engineer I know that any thing can be reverse engineered. So I took up the challenge. How did I unlocked the blocks? The first thing I'd done was searched about this software at Google and read about what is this software all about. Aft

App: Calculate your job experience or age in years

Usually, all of those who works have to put years of experience on their resume or curriculum vitae. But 90% people put it wrong when they convert their experience in years only. Although they know the exact number of months and years but the conversion, they do is wrong. This happens because there are 12 months while the digit after decimal would be 0-9, i.e., 10 digits. Which means that we have to represent the number of months in terms of year. So here I have provided a small gadget to calculate it. Just put the date when you had started working in the From Date field and put current date in the To Date field. You can also calculate your age as well with this tool by putting your date of birth in the From Date field and put current date in the To Date field. As an alternative, you can use the hassle-free and simple to use  Date Differentiator  micro webapp. Bookmark it so you may easily access it later.

How to convert JIRA story into sub-task or defect?

Many times we entangled in the situation where we have made a  story  in JIRA but later on realised that it should have to be  defect  or in other case,  sub-task  of another  story . Story → Sub-task So the workaround for converting the story into defect is given below: Open your  story Click on  more  option Click on the  Convert to sub-task  option in the dropdown You would be asked to choose  Parent  story, so chose relevant story After submit, your  story  gets converted into  sub-task Story → Defect Now if you want the story to be converted into defect, then you should first convert it into sub-task. Thereafter, you can convert that sub-task into defect as given below: Open the  sub-task Click on  more  option Click on the  Convert to issue  option in the dropdown You would be asked to fill up relevant fields required for raising a  defect , fill them up as required After submit, your  sub-task  gets converted into  defect .