In the past few months, we looked at the first five of the seven domains that are on the CompTIA Cloud+ certification entry-level exam (number CV0-001). This month, the focus turns to the last two domains as we conclude the discussion.
The sixth domain — Systems Management — has four topic areas that are beneath it:
- Explain policies and procedures as they relate to a cloud environment
- Given a scenario, diagnose, remediate and optimize physical host performance
- Explain common performance concepts as they relate to the host and the guest
- Implement appropriate testing techniques when deploying cloud services
Together, these four topics — and this domain — combine for 11 percent of the exam questions. Once again, being an entry-level exam, there is a heavy focus on definitions and knowledge, as opposed to actual implementation. That said, each of the four topic areas are examined in order below.
Policies and Procedures
Network and IP planning must be done carefully and it is something you will never be fully satisfied with and constantly tweaking. A back-out plan should be in place and, as you make said tweaks, it is essential to monitor regularly and document everything.
As you document, you should always keep in mind what you would want to find if you suddenly had to fill in for another administrator who unexpectedly had to be away and try to create that level of help for anyone who might need to secede you.
Change management best practices (a good article on which can be found here) require you to create usable documentation. They also expect a level of configuration control and an approval process that approves with caution yet does not unnecessarily bog things down with inflexible formalities. While asset accountability is not usually thought of as a glamorous part of any workplace, it is a vital one.
Configuration standardization can be accomplished by using either a Change Manager, a Change Team/Board, or some combination of the two. It is imperative that changes take into consideration all stakeholders.
To automate configuration management, you can employ a CMDB — Configuration Management Database — to help with the approval process and configuration control as well as aid with documentation (for more of an overview, look here).
Capacity Management involves monitoring for changes and looking to see what is trending. Maintenance Windows should be identified and used for server upgrades and patch installation to minimize downtime and service disruptions as much as possible.
Systems Life Cycle Management tools, of which Microsoft Operations Framework (MOF) and the Information Technology Infrastructure Library (ITIL) are two, provide guidelines and structure for operations. As an example, (MOF) can be broken into four components:
Microsoft classifies MOF as a solution accelerator, and the latest version of it can be found and downloaded here.
Optimizing Physical Host Performance
Every administrator wants to get as much performance out of their systems as possible and some key areas are:
Disk performance: Indexing and caching can help with this. Key metrics to watch are data throughput, and the number of requests
Disk tuning: Defragmentation can decrease file load times but only if the drives are hard drives and not solid state. Important metrics to watch/tweak are block sizes and controller-related variables
Disk latency: While this technically fits in with the categories above, it can be monitored by keeping an eye on the average response time
Swap disk space: While the performance associated with almost every file improves if the data is contiguous, there is nowhere this is so much the case as with the swap file/space being used. Be sure to set aside more than enough space to allow this to grow without needing to be in more than one location
I/O tuning: SQLIO, while intended for use with SQL Server, is a tool from Microsoft that can be used to determine I/O capacity of a given configuration (it can be downloaded here). Once you armed with that, you can more easily tweak the I/O request size.
There are any number of performance management and monitoring tools available. Some to know for the exam are:
Performance Logs and Alerts
Windows Management Instrumentation (WMI) objects (more information is available here). Two sets to monitor are VirtualMachine counters and VirtualNetwork counters SAR (system activity) and STRACE (system calls/signals).
Regardless of the tool(s) used, focus on eliminating bottlenecks and document everything you do. As a general rule, the more of everything (RAM, processors, hard disk space) the better. Be sure to monitor usage and mix intensive-resources to avoid excessive demands on any weak areas. You should regularly create baselines w/load profiles and include in them counters related to throughput, transactions, and latency.
When it comes to hypervisor configuration best practices, you want to monitor I/O throttling and CPU wait time. You also want to tweak Memory Ballooning variables as much as possible. Memory Ballooning makes the guest aware of the low memory on the host and transfers the memory shortage from the host to the VM where the guest can make decisions about which pages to page out w/out hypervisor's needing to oversee.
It is vital to thoroughly test the impact of any and all changes to the virtual environment and try to head off problems before they occur. Common issues in, and out, of the virtual environment include:
- Disk Failure
- HBA Failure
- Memory Failure
- NIC Failure
- CPU Failure
Input/output operations per second (IOPS) is the de facto standard for measuring disk performance. Know that there are two types of disk operations that occur — reading and writing — and optimization is possible with either operation.
File systems are proprietary and each offers something to the market. You want to pick and choose the file system you'll use according to your needs and focus on how to maximize file system performance for that particular file system. Look, as well, at metadata performance: create, remove, and check as much as possible.
Caching can be done with RAM and some controllers. When used with controllers, it delay writes (which can leave you in a vulnerable position in the event of a crash), but can speed performance during read operations by reading ahead sequentially during each operation.
Bandwidth can often be a bottleneck and network performance hinges on it. Throughput can be increased by aggregating multiple resources to appear as one — using either bonding or teaming. Jumbo frames are very large Ethernet frames: by sending a lot at once, they are less processor intensive.
Network latency focuses on delays: PRTG's Network Latency Test is an example of a tool that can be used and it can be found here. Know that DSL and cable can have higher rates than a dedicated T-1.
Hop counts measure/identify how many "stops" are there along the way and can be seen by using tracert. QoS (Quality of Service) is used to identify data and prioritize it and is particularly useful win working with a load balancer to balance the load. Multipathing creates redundant routes while scaling can be done vertically, horizontally, or diagonally.
Vertical scaling is also known as scaling up and involves adding resources (memory, processors, etc.) to one node. Horizontal scaling is also known as scaling out and it involves adding more nodes (think of a web farm). Diagonal scaling is a hybrid of the other two.
Testing techniques, naturally, differ based on what you are wanting to test. You can test almost anything mentioned previously (latency, bandwidth, storage, load balancing, etc.), as well as replication (Repadmin.exe and Dcdiag.exe - part of Windows Support Tools - are the primary tools for monitoring replication with Windows Server), and application servers (closely monitor all event logs with Event Viewer, and other tools). There are many load-testing services available that will allow you to test application performance.
A vulnerability assessment can/should be done to find the weaknesses before others do. This usually involves using a vulnerability scanner - a software application that checks your network for known security holes in networks, computers, or even applications. Some of the most well-known vulnerability scanners are:
It is important to have separation of duties (SoD) during testing to prevent error and fraud. An additional benefit of having more than one person involved is a redundancy in knowledge and ability.
Business Continuity in the Cloud
The seventh and final domain — Business Continuity in the Cloud — has but two topic areas beneath it:
- Compare and contrast disaster recovery methods and concepts
- Deploy solutions to meet availability requirements
When you look at this domain, the first thing to notice is that it is the least heavily weighted — making up just 8 percent of the exam questions. When you look closer, though, you realize that with only two topic areas, each is worth 4 percent and that makes them worth more than any other topic area on the exam and thus understanding them fully is imperative as you study. That said, both of the topic areas are examined in order below.
Disaster Recovery Methods and Concepts
RAID was discussed in a previous article looking at the Infrastructure domain (RAID falls beneath exam objective 3.2), but the varieties to know are RAID 0, 1, 5, 6, 1+0, 0+1. Redundancy is often used synonymously with RAID, but it is important to not overlook redundancy in every component. Failover is the act of switching over to the redundant system.
Failback means an alternative plan that may be used in an emergency. This too can involve replication. Site mirroring allows you to have another location that can be used if this one is destroyed and that location will fall into one of three categories: hot, warm, or cold.
A hot site is a location that can provide operations within hours of a failure. This type of site would have servers, networks, and telecommunications equipment in place to reestablish service in a short time. Hot sites provide network connectivity, systems, and preconfigured software to meet the needs of an organization.
Databases can be kept up-to-date using network connections. These types of facilities are expensive, and they're primarily suitable for short-term situations. A hot site may also double as an offsite storage facility, providing immediate access to archives and backup media.
A hot site is also referred to as an active backup model. Many hot sites also provide office facilities and other services so that a business can relocate a small number of employees to sustain operations. Given the choice, every organization would choose to have a hot site. Doing so is often not practical, however, on the basis of cost.
A warm site provides some of the capabilities of a hot site, but it requires the customer to do more work to become operational. Warm sites provide computer systems and compatible media capabilities. If a warm site is used, administrators and other staff will need to install and configure systems to resume operations.
For most organizations, a warm site could be a remote office, a leased facility, or another organization with which yours has a reciprocal agreement. Another term for a warm site/reciprocal site is active/active model. Warm sites may be for your exclusive use, but they don't have to be.
A warm site requires more advanced planning, testing, and access to media for system recovery. Warm sites represent a compromise between a hot site, which is very expensive, and a cold site, which isn't preconfigured. An agreement between two companies to provide services in the event of an emergency is called a reciprocal agreement.
Usually, these agreements are made on a best-effort basis: There is no guarantee that services will be available if the site is needed. Make sure your agreement is with an organization that is outside your geographic area. If both sites are affected by the same disaster, the agreement is worthless.
A cold site is a facility that isn't immediately ready to use. The organization using it must bring along its equipment and network. A cold site may provide network capability, but this isn't usually the case; the site provides a place for operations to resume, but it doesn't provide the infrastructure to support those operations.
Cold sites work well when an extended outage is anticipated. The major challenge is that the customer must provide all the capabilities and do all the work to get back into operation. Cold sites are usually the least expensive to put into place, but they require the most advanced planning, testing, and resources to become operational — occasionally taking up to a month to make operational.
Almost anywhere can be a cold site; if necessary, users could work out of your garage for a short time. Although this may be a practical solution, it also opens up risks that you must consider. For example, while you're operating from your garage, will the servers be secure should someone break in?
Herein lies the problem. The likelihood that you'll need any of these facilities is low — most organizations will never need to use these types of facilities. The costs are usually based on subscription or other contracted relationships, and it's difficult for most organizations to justify the expense.
In addition, planning, testing, and maintaining these facilities is difficult; it does little good to pay for any of these services if they don't work and aren't available when you need them. One of the most important aspects of using alternative sites is documentation.
To create an effective site, you must have solid documentation of what you have, what you're using, and what you need in order to get by.
Geographical diversity in redundancy can keep your data accessible in the event a natural disaster affects one part of the country. A monitoring center, for example, might have their primary operations in Boulder, Colorado and another center in Indianapolis, Indiana providing redundancy in data storage and retrieval to help ensure uninterrupted access to the data.
Acronyms associated with redundancy and recovery are:
- RTO = Recovery Time Objectives
- RPO = Recovery Point Objective
- MTBF = Mean Time Between Failure
- MTTR = Mean Time To Recovery
- (MTTF = Mean Time to Failure)
You always want to identify the mission critical requirements and make sure they are met at a minimum.
The key to fault tolerance is high availability. This can be helped by local clustering / geoclustering and a focus on non-high availability resources. It is conceivable that fault tolerance of hardware alone is not enough: to achieve high availability, make sure the path is not the single point of failure (SPoF) — use multiple paths to provide connectivity redundancy.
The goal of load balancing is to distribute the load across multiple systems to prevent overloading any one server. More on it can be found here.
Summing It Up
There are seven domains on the CompTIA Cloud+ certification exam (CV0-001) and this month we walked through the topics covered on the sixth and seventh of them. Combined, all seven articles should aid in your study for the exam.