Customers are migrating from traditional data centers and adopting cloud to take advantage of the agility and scalability offered by public and private cloud services.
This shift has led to a significant change in IT operations due to the dynamic and constantly changing nature of cloud infrastructure and services requiring newer monitoring approach, tools, and solutions. There is a need for monitoring at cloud-scale infrastructure.
This article provides a viewpoint for an effective cloud operations solution for modern cloud infrastructure and applications, however complex or dynamic they may be.
CloudOps and why it is important.
Cloud adoption is not enough. Once workloads are on cloud, they need to be efficiently managed to ensure that they are running optimally and securely. This is important not only for agility, scalability and resiliency expected from cloud adoption but also to optimize the cost.
Here comes cloud operations (CloudOps). It is about the practices and processes enabled by technology solutions that are used to manage cloud environments. It includes tasks such as provisioning, scaling, monitoring, and troubleshooting, ensuring performance, security, and compliance. CloudOps is different from traditional IT operations (ITOps) which is more to do with managing static IT infrastructure in data centers whereas CloudOps is about management and maintenance of IT systems that are hosted on cloud(s) across various service models (IaaS, PaaS, SaaS, BPaaS) and deployment models (Private, Public, Hybrid and Multi Clouds).
Challenges in managing Cloud with current ITOps practices.
- Elastic, Dynamic, Ephemeral Infrastructure and Compute – Workloads running on cloud present a different challenge from operations perspective. In traditional IT deployments we typically have hardware and virtual machine-based compute. On cloud we have both traditional workloads running on virtual machines (IaaS) as well as ones running on smaller units of compute such as containers or even on serverless. These are elastic, dynamic and ephemeral and last only for the duration of execution.
- Scaled Monitoring – Containers and Serverless compute can easily scale from one to hundreds and even thousands in no time with dynamic scaling. This dynamic and constantly changing infrastructure presents a challenge and requires newer methods, tools and need to monitor application and infrastructure at cloud-scale.
- Diverse Tools with little Integration – Enterprises currently have multiple tools to monitor different aspects of application and infrastructure (e.g., CMDB, Logging, Monitoring, Reporting, Analytics, etc.) from various vendors. These diverse tools are not integrated and report in silos. This leads to lack of visibility into the holistic performance and leads to inefficiencies due to need of manual intervention and increased cost.
- Limited Automation – Current IT Ops automation focusses primarily on enterprise ITMS such as Service Request, Problem Management workflow automation, Patching and Alerting, etc. and not much on problem resolution and prevention.
- Metrics Measurement – Existing metrics and KPIs do not give a clear measure of operations efficiency due to disparate processes and tools.
Key Principles for a CloudOps solution
When we look at IT Operations transformation from current ITOps to nextgen CloudOps, following are the key principles of such as a solution:
- Integrated and Unified – Bringing together IT system’s tools and processes to create an integrated and unified operations. Such as integrated logging, monitoring, analytics, and management
- Unified Dashboard – Having a dashboard for 360-degree visibility into all aspects of CloudOps such as system health, performance trends and business metrices.
- Real-time / Near real-time – Collecting, analyzing, and responding to data (logs, metrices, etc.) received from various systems in real time or near real-time. This enables an organization to quickly detect and respond to any issues arising immediately rather than waiting for a breakdown or scheduled monitoring or checks.
- Automated – Use of automation technologies as much as possible to automate various CloudOps tasks and give machine the first right of refusal.
Technology Implications – CloudOps Solution
Technology solution for CloudOps platform comprise of various capabilities catering to individual areas. Solution capability map below depicts the various solutions components of a NextGen CloudOps delivery platform. All these capabilities are part of the overall target cloud operating model (ToM) created for the enterprise. ToM provides a standardized cloud management and operations service across customer’s on-prem and cloud environments.
Exhibit 1: CloudOps Solution Capability Map (pls refer to the end of the article for the diagram)
a) Unified Dashboard and Reporting
Getting aggregate system health with new metrices that focuses on business SLA, aggregate system health and performance trend of services rather than isolated datapoints from the hosts. For example, rather than focusing on host-level issue such as elevated CPU, focus could be on latency for web application and if that starts to surge, an action is taken immediately. Similarly, visualization of aggregate system health in timeseries graphs, Heat Maps, Host Maps, etc., are needed.
b) Monitoring and Observability
Instrumenting, gathering, and monitoring data from all aspects of the infrastructure, including compute resources, applications, and cloud services, to analyze the connections between them. This also involves making the collected metrics readily available on a centralized platform, allowing for a comprehensive understanding of the system’s performance and operation for observability across diverse and distributed technology mix, business process and customer journey
c) Log Management
Integrated tagging and labelling for compute, cloud services, APIs, security, network, firewall, etc. that can help in identifying and aggregating log and metrics data and can help in identifying and resolving the problem quickly.
Moving from tactical to strategic automation. This includes typical runbook automation for day-to-day tasks including operational automation, shift-left tasks and scripted incident/alert resolution to activities like user access management, hotfix and patch deployment, database housekeeping, backup task management etc. This can also include automation for Event Correlation and Handling, self-healing capabilities for automatically resolving and preventing issues and moving to a everything-as-a-code environment.
e) Security and Compliance
This ensures that cloud platform always meets security and compliance requirements. This includes Security Event monitoring, Endpoint Protection, Vulnerability Checks, Threat detection, Patch and Certificate management, etc. to ensuring meeting industry standards and regulatory compliance.
f) Availability and Resiliency
Ensuring that system is always operations and can handle unexpected events. This includes implementing Early Warning Systems, Redundancy and Failover mechanism (Backup and Restore), Disaster recovery plan, Archival Systems. This also includes proactive monitoring to identify any failure points before they occur.
g) Application Services
Interfacing with application services that includes application development, change and maintenance, integration, non-functional requirement, and quality assurance and ensuring controls required for CloudOps are designed and built in the applications itself rather than a post release exercise.
h) Platform Engineering Services
It involved core platform engineering services for the cloud platform such as service design, provisioning, capacity management, service catalogue management, etc.
i) Integrated Service Management (iSM)
Implementing ITSM processes, service catalog and request fulfilment, major incident management (MIM), self-service and knowledge management to ensure that service needs are meeting overall organization’s goals and objectives.
j) FinOps / Cost Management
This allows organizations to understand and baseline cloud needs, provide visibility into cloud-services related spends, ability and tools to optimize usage, implement recommendations, cost transparency with stakeholders and mechanism to charge-back or show-back costs to the lines of business (LoBs). Essentially it is monitoring and control of cloud cost from a holistic perspective. It includes defining and implementing a clear cloud cost management framework to manage cloud economics – consumption & metering, performance, optimization (initial and on a continuous basis), trend analytics and cost assignment. For details on Cloud Cost Management, you may refer to the following article
Practices and approaches that integrates the three areas of efficiency, security, and quality of the product. It includes automated solution for various areas such as CI/CD pipeline management, self-service, orchestration, change, release and deploy.
l) Ops Analytics
Monitoring and optimizing performance of cloud-based systems and applications using data and analytics tools. This includes gathering log and metrics data from various sources such as cloud native monitoring services, APM and infrastructure management tools, to gain insights into system performance trends for business SLA measurement and to make informed decisions to optimize the system.
CloudOps helps in realizing the benefits envisaged during cloud adoption. It involves practices and processes underpinned by technology solutions for various aspects of CloudOps. Monitoring and logging, Automation, Analytics and Security are the key pillars of such a solution.
Migrating from a traditional ITOps to CloudOps is a journey. All organizations adopting cloud, should consider CloudOps early in their cloud adoption journey so that all the operational requirements are captured, and operational controls are built into the cloud infrastructure and application getting built or migrated onto it.