Systems monitoring platform integrating artificial intelligence for incident response in servers

Authors

DOI:

https://doi.org/10.51252/rcsi.v5i2.811

Keywords:

alerts, IT governance, Grafana, LLM, Loki, observability

Abstract

The increasing complexity of IT management and the need to monitor critical infrastructure metrics, such as CPU usage, memory, storage, and service logs, detect failures, and respond quickly to alerts, imply the adoption of advanced technologies that enable comprehensive monitoring and efficient response. This work developed a server monitoring system with alerts sent via Telegram. Additionally, it integrates artificial intelligence to provide immediate solutions to server incidents, using tools such as Grafana and Prometheus for metric collection and Grafana Loki for log management. The OpenAI API was incorporated to analyze the logs and enhance alerts with a detailed diagnosis. A total of 311 tests were conducted, where the results showed that the system notified incidents in an average of 1.02 seconds, while the GPT model completed the analysis in an average of 2.17 seconds, allowing root causes of problems to be identified and timely recommendations for resolution to be generated. It is concluded that the integration of artificial intelligence and proactive monitoring improves incident management, suggesting future applications in IoT environments to enrich monitoring.

References

Adamyaa, D., Dinesh, S., Priya, & Soumya. (2022). Building a Monitoring Framework for a Distributed Cloud Application Using Prometheus and Chef—An Overview. International Journal for Research in Applied Science and Engineering Technology, 10(7), Article 7. https://doi.org/10.22214/ijraset.2022.45716 DOI: https://doi.org/10.22214/ijraset.2022.45716

Ahmed, S., Singh, M., Doherty, B., Ramlan, E., Harkin, K., & Coyle, D. (2022). AI for Information Technology Operation (AIOps): A Review of IT Incident Risk Prediction. 2022 9th International Conference on Soft Computing & Machine Intelligence (ISCMI), 253-257. https://doi.org/10.1109/ISCMI56532.2022.10068482 DOI: https://doi.org/10.1109/ISCMI56532.2022.10068482

Alia, P. A., Prayogo, J. S., Kriswibowo, R., & Setyadi, A. T. (2024). Implementation Open Artificial Intelligence ChattGPT Integrated With Whatsapp Bot. Advance Sustainable Science, Engineering and Technology, 6(1), Article 1. https://doi.org/10.26877/asset.v6i1.17909 DOI: https://doi.org/10.26877/asset.v6i1.17909

Anam, K., Rofi, D. N., & Meiyanti, R. (2023). Monitoring System for Temperature and Humidity Sensors in the Production Room Using Node-Red as the Backend and Grafana as the Frontend. Journal of Systems Engineering and Information Technology (JOSEIT), 2(2), Article 2. https://doi.org/10.29207/joseit.v2i2.5222 DOI: https://doi.org/10.29207/joseit.v2i2.5222

Behutiye, W., Tripathi, N., & Isomursu, M. (2024). Adopting Scrum in Hybrid Settings, in a University Course Project: Reflections and Recommendations. IEEE Access, 12, 105633-105650. Scopus. https://doi.org/10.1109/ACCESS.2024.3434662 DOI: https://doi.org/10.1109/ACCESS.2024.3434662

Bhanage, D. A., Pawar, A. V., Kotecha, K., & Abraham, A. (2023). Failure Detection Using Semantic Analysis and Attention-Based Classifier Model for IT Infrastructure Log Data. IEEE Access, 11, 108178-108197. https://doi.org/10.1109/ACCESS.2023.3319438 DOI: https://doi.org/10.1109/ACCESS.2023.3319438

Broniewski, A., Tirmizi, M. I., Zimányi, E., & Sakr, M. (2023). Using MobilityDB and Grafana for Aviation Trajectory Analysis. undefined-undefined. https://doi.org/10.3390/engproc2022028017 DOI: https://doi.org/10.3390/engproc2022028017

Butarbutar, R. T. B. D., Sasmita, G. M. A., & Pratama, I. P. A. E. (2023). Development of a Notification-Based Network Security Monitoring System Using Network Development Life Cycle (NDLC). JITTER : Jurnal Ilmiah Teknologi dan Komputer, 4(3), 1933. https://doi.org/10.24843/JTRTI.2023.v04.i03.p01 DOI: https://doi.org/10.24843/JTRTI.2023.v04.i03.p01

Chan, Y. W., Fathoni, H., Yen, H. Y., & Yang, C. T. (2022). Implementation of a Cluster-Based Heterogeneous Edge Computing System for Resource Monitoring and Performance Evaluation. IEEE Access, 10, 38458-38471. https://doi.org/10.1109/ACCESS.2022.3166154 DOI: https://doi.org/10.1109/ACCESS.2022.3166154

Ehrlinger, L., & Wöß, W. (2022). A Survey of Data Quality Measurement and Monitoring Tools. Frontiers in Big Data, 5, 850611. https://doi.org/10.3389/fdata.2022.850611 DOI: https://doi.org/10.3389/fdata.2022.850611

Erdei, R., & Toka, L. (2023). Minimizing Resource Allocation for Cloud-Native Microservices. Journal of Network and Systems Management, 31(2), 35. https://doi.org/10.1007/s10922-023-09726-3 DOI: https://doi.org/10.1007/s10922-023-09726-3

Gaol, F. L., Santoso, S., & Matsuo, T. (2022). Design and development of the application monitoring the use of server resources for server maintenance. Open Engineering, 12(1), 524-538. Scopus. https://doi.org/10.1515/eng-2022-0055 DOI: https://doi.org/10.1515/eng-2022-0055

Garcia, L. A., OliveiraJr, E., Leal, G. C. L., & Morandini, M. (2021). A Unified Feature Model for Scrum Artifacts from a Literature and Practice Perspective. 296-305. https://doi.org/10.5753/eres.2020.13740 DOI: https://doi.org/10.5753/eres.2020.13740

Ghimire, D., & Charters, S. (2022). The Impact of Agile Development Practices on Project Outcomes. Software, 1(3), Article 3. https://doi.org/10.3390/software1030012 DOI: https://doi.org/10.3390/software1030012

Hadikusuma, R. S., Lukas, L., & Bachri, K. O. (2023). Survey Paper: Optimization and Monitoring of Kubernetes Cluster using Various Approaches. Sinkron, 8(3), Article 3. https://doi.org/10.33395/sinkron.v8i3.12424 DOI: https://doi.org/10.33395/sinkron.v8i3.12424

Iqromullah, R., Khairil, K., & Suryana, E. (2023). Security System Implementation And Monitoring Networks At Sma N 10 City Of Bengkulu. Jurnal Media Computer Science, 2(2), Article 2. https://doi.org/10.37676/jmcs.v2i2.4431 DOI: https://doi.org/10.37676/jmcs.v2i2.4431

Jani, Y. (2024). Unified Monitoring for Microservices: Implementing Prometheus and Grafana for Scalable Solutions. Journal of Artificial Intelligence, Machine Learning and Data Science, 2(1), 848-852. https://doi.org/10.51219/JAIMLD/yash-jani/206 DOI: https://doi.org/10.51219/JAIMLD/yash-jani/206

Kadenic, M. D., Koumaditis, K., & Junker-Jensen, L. (2023). Mastering scrum with a focus on team maturity and key components of scrum. Information and Software Technology, 153, 107079. https://doi.org/10.1016/j.infsof.2022.107079 DOI: https://doi.org/10.1016/j.infsof.2022.107079

Kuang, J., Liu, J., Huang, J., Zhong, R., Gu, J., Yu, L., Tan, R., Yang, Z., & Lyu, M. R. (2024). Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: A Hybrid Approach. Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, 369-380. https://doi.org/10.1145/3639477.3639745 DOI: https://doi.org/10.1145/3639477.3639745

Lamothe, M., Guéhéneuc, Y.-G., & Shang, W. (2021). A Systematic Review of API Evolution Literature. ACM Computing Surveys, 54, 1-36. https://doi.org/10.1145/3470133 DOI: https://doi.org/10.1145/3470133

Penkov, St., & Taneva, A. (2021). Chat Programs in the Frame of Control System. IFAC-PapersOnLine, 54(13), 52-56. https://doi.org/10.1016/j.ifacol.2021.10.417 DOI: https://doi.org/10.1016/j.ifacol.2021.10.417

Santosa, I., & Mulyana, R. (2023). The IT Services Management Architecture Design for Large and Medium-sized Companies based on ITIL 4 and TOGAF Framework. JOIV : International Journal on Informatics Visualization, 7(1), 30. https://doi.org/10.30630/joiv.7.1.1590 DOI: https://doi.org/10.30630/joiv.7.1.1590

Santoso, G., Setiawan, J., & Sulaiman, A. (2023). Development of OpenAI API Based Chatbot to Improve User Interaction on the JBMS Website. G-Tech: Jurnal Teknologi Terapan, 7(4), Article 4. https://doi.org/10.33379/gtech.v7i4.3301 DOI: https://doi.org/10.33379/gtech.v7i4.3301

Sassa, A. C., Almeida, I. A. de, Pereira, T. N. F., & Oliveira, M. S. de. (2023). Scrum: A Systematic Literature Review. International Journal of Advanced Computer Science and Applications, 14(4), Article 4. https://doi.org/10.14569/IJACSA.2023.0140420 DOI: https://doi.org/10.14569/IJACSA.2023.0140420

Simili, E., Stewart, G., Roy, G., Skipsey, S., & Britton, D. (2021). A hybrid system for monitoring and automated recovery at the Glasgow Tier-2 cluster. EPJ Web of Conferences, 251. https://doi.org/10.1051/epjconf/202125102047 DOI: https://doi.org/10.1051/epjconf/202125102047

Sun, Y., Ye, K., & Xu, C.-Z. (2020). PLMSys: A Cloud Monitoring System Based on Cluster Performance and Container Logs. En Q. Zhang, Y. Wang, & L.-J. Zhang (Eds.), Cloud Computing – CLOUD 2020 (Vol. 12403, pp. 111-125). Springer International Publishing. https://doi.org/10.1007/978-3-030-59635-4_8 DOI: https://doi.org/10.1007/978-3-030-59635-4_8

Yu, Q., Zhao, N., Li, M., Li, Z., Wang, H., Zhang, W., Sui, K., & Pei, D. (2024). A survey on intelligent management of alerts and incidents in IT services. Journal of Network and Computer Applications, 224, 103842. https://doi.org/10.1016/j.jnca.2024.103842 DOI: https://doi.org/10.1016/j.jnca.2024.103842

Downloads

Published

2025-07-20

How to Cite

Espinosa-Luna, B. H., Castillo-Oliva, J., García-Gutiérrez, W. F., & Mendoza-de-los-Santos, A. C. (2025). Systems monitoring platform integrating artificial intelligence for incident response in servers. Revista Científica De Sistemas E Informática, 5(2), e811. https://doi.org/10.51252/rcsi.v5i2.811