Originally posted on forbes.
AIOps is an exciting area where artificial intelligence is leveraged to automate infrastructure operations and DevOps. It reduces the number of incidents through proactive monitoring and remediation. Public cloud providers and large-scale data center operators are already implementing AIOps to reduce their cost of operations.
One of the classic use cases of AIOps is the proactive scaling of elastic infrastructure. Instead of constantly monitoring the CPU or RAM utilization to trigger an auto-scale event, a deep learning model gets trained on a dataset representing the timeline, the inbound traffic, and the number of compute instances serving the application. The model then predicts the optimal capacity. The shift from reactive to proactive scaling saves thousands of dollars for retail companies with consumer-facing websites during events like Black Friday and Cyber Monday.
But ML-driven scaling is just the tip of the AIOps iceberg. Amazon Web Services already enabled this feature in the form of EC2 predictive scaling for its users.
The power of AIOps lies in its ability to automate the functions typically performed by DevOps engineers and Site Reliability Engineers (SRE). It will significantly improve the CI/CD pipelines implemented for software deployment by intelligently monitoring the mission-critical workloads running in staging and production environments.
Large Language Models (LLMs) such as GPT-3 from OpenAI will revolutionize software development, deployment, and observability, which is crucial for maintaining the uptime of workloads.
GitHub Copilot, a feature that brought AI-enabled pair programming to developers, writes compact and efficient code, significantly accelerating the development cycle. Behind the scenes, GitHub Copilt uses Codex, an ML model based on GPT-3. Codex can write programs in dozens of languages, including Python and Go. It’s been trained on 159 GB of Python code from 54 million GitHub repositories. With plug-ins for popular IDEs such as VS Code and Neovim, Codex empowers developers to automate most of their code.
Once the code is committed, AI reviews and analyzes to find blindspots in programs that may prove expensive. Amazon CodeGuru is a classic example of an AI-driven tool to analyze and profile code. It identifies critical issues and recommends ways to improve the quality of code.
A modern CI/CD pipeline takes the code that passed all the tests and approvals and packages them into artifacts such as container images or JAR files. This step involves identifying the dependencies of the software and including them in the packaging. DevOps engineers are responsible for writing Dockerfile that defines the software’s dependencies and the base image. This step is as crucial as software development. A mistake can prove to be expensive, leading to performance degradation. DevOps engineers can rely on LLMs to generate the most optimal definition for packaging the software. The below image shows the output from chatGPT generating a Dockerfile.
Once the software is packaged as container images, the deployment comes into the picture. DevOps engineers write YAML files targeting the Kubernetes environment. LLMs trained on popular YAML definitions can effectively generate the most optimized markup to deploy microservices. Below is a screenshot of chatGPT generating the Kubernetes YAML definition to deploy the container.
When the software is deployed into production, observability is needed to contextualize the monitoring of the entire stack. Instead of tracking individual metrics such as CPU and RAM utilization, observability brings events, logs, and traces into the context to quickly identify the root cause of a problem. SREs then swing into action to remediate and get the application back to life. The mean time between failures (MTBF) directly impacts the SLAs offered by the operations team.
While GPT-3-based models such as Codex, GitHub Copilot and chatGPT assist developers and operators, the same GPT-3 model can come to the rescue of the SREs. An LLM model trained on logs emitted by popular open source software can analyze and find anomalies that may lead to potential downtime. Combined with the observability stack, these models automate most of the actions a typical SRE performs. Observability companies such as New Relic, ScienceLogic, and Datadog have integrated machine learning into their stack. The promise of this integration is to bring self-healing of applications with minimal administrative intervention.
Large Language Models and proven time-series analysis are set to redefine the functions of DevOps and SRE. They will play a significant role in ensuring that the software running in the cloud and modern infrastructure is always available.