When it comes to trends, few match the speed with which higher education schools are adding data analytics — or something similar sounding — to their computer science or information technology programs. Sometimes this involves adding new courses on data visualization or governance to an existing curriculum.
At other times, it entails repackaging previously titled statistics courses to include more of these topics and to have a less drab sounding title. The fuel behind this trend is employers reporting that they are having difficulty finding job candidates with those skills and thus increasing the amount they are paying for workers who do possess them.
Mirroring the trend seen in academia, certification providers are also looking to serve this market by creating new certifications to authenticate skills candidates possess, whether they pursue a formal education or not. One of the many vendors now offering something new in this space is tech industry association CompTIA.
CompTIA went live with its new Data+ certification, the first in a planned quartet of data-centric credentials, earlier this year in February. Once earned, the certification is good for three years.
The Data+ certification requires passing a single exam (number DA0-001; currently priced at $239) that currently consists of 90 questions that must be answered in 90 minutes. The minimum passing score is 675 (on a scale of 100-900) and the questions are predominantly multiple-choice.
While no experience is required, it is suggested that candidates have worked between 18 and 24 months in an analyst role. A basic understanding of analytical tools, statistics, and data visualization is also recommended.
The five domains and their weighting, as well as objectives, are as follows:
Domain 1: Data Concepts and Environments — 15 percent
1.1 Identify basic concepts of data schemes and dimensions
Know the difference between relational and non-relational databases as well as the Snowflake and Star schema concepts.
1.2 Compare and contrast different data types
Think of programming and the various types of data that can be declared and the results of doing so (the amount of space allotted for each type, the way it can be compressed, etc.).
1.3 Compare and contrast common data structures and file formats
Be able to explain the difference between structured and unstructured data as well as common delimiters used within files (commas, and tabs).
Domain 2: Data Mining — 25 percent
2.1 Explain data acquisition concepts
Know the various ways of collecting data including web scraping, surveys, sampling, and observations.
2.2 Identify common reasons for cleansing and profiling datasets
One of the primary reasons for cleansing a dataset is to reduce redundancy. Other issues to deal with related to the data include handling missing values, working with non-parametric values, handling outliers, and deciding how to work with invalid entries.
2.3 Given a scenario, execute data manipulation techniques
Transposing data may be necessary to normalize it before it can be worked with. It may also need to be concatenated (or appended), merged, or otherwise manipulated before being analyzed.
2.4 Explain common techniques for data manipulation and query optimization
While much of the previous topic dealt with manipulation, optimization can be accomplished by indexing, creating subsets, or through the implementation of temporary tables.
Domain 3: Data Analysis — 23 percent
3.1 Given a scenario, apply the appropriate descriptive statistical methods
While any number of data analysis tools can come up with remarkable findings, often it is necessary to offer a general description of the data set itself and common variables to look at include the mean, median, mode, range, variance, and standard deviation.
3.2 Explain the purpose of inferential statistical methods
Many more inferential methods exist, but the ones CompTIA expects knowledge of are t-tests, Z-score, p-values, and Chi-squared. A knowledge of correlation and regression are also necessary as well as being able to identify the difference between Type I (false positive) and Type II (false negative) errors.
3.3 Summarize types of analysis and key analysis techniques
Trend analysis, performance analysis, link analysis, and exploratory analysis are the four to focus on for this topic.
3.4 Identify common data analytics tools
A basic knowledge of the existence of numerous tools is expected as well as knowing which tool would be applicable for which analysis. Being a “vendor neutral” exam does not mean that various vendors' tools are not tested on, just that there is not a bias of one over another.
Domain 4: Visualization — 23 percent
4.1 Given a scenario, translate business requirements to form a report
While any report created on the data needs to contain the pertinent information, it is important to realize that each report should be written speak to a particular audience and the language used (acronyms, industry-speak, etc.) should not be above the level of that audience.
4.2 Given a scenario, use appropriate design components for reports and dashboards
A standard report will include elements such as a cover page, version number, table of contents, body, and any appendices.
4.3 Given a scenario, use appropriate methods for dashboard development
Some considerations when designing a dashboard include whether the data shown will be live (dynamic) or not (static), and any filtering. It is also crucial to consider access permissions needed, approvals, and customer types (tailor the information for the viewer). Know, as well, that wireframes are traditionally used to mockup the dashboard during development
4.4 Given a scenario, apply the appropriate type of visualization
Borrowing heavily from a similar topic on the Project+ exam, you need to be able to identify the differences between charts that can be produced. A quick Google search (or even Wikipedia) will show you the various chart types, but make sure you can recognize — and know the differences — between the following:
• Line chart
• Pie chart
• Bubble chart
• Scatter plot
• Bar chart
• Heat map
• Geographic map
• Tree map
• Stacked chart
• Word cloud
4.5 Compare and contrast types of reports
This is mostly common sense — know that some reports are needed more frequently than others, some focus on historic data while other are real time, and much of the differences are driven by the specific purpose behind why each report was created.
Domain 5: Data Governance, Quality, and Controls — 14 percent
5.1 Summarize important data governance concepts
Collecting data is one thing, while protecting it and shielding it from unauthorized eyes is another. Much of the focus of this topic is similar to what might be found on the Security+ exam in terms of the recognizing access and security requirements, storage options (as well as the pros and cons of each), and the classification of data into PII (personally identifiable information), PHI (personal health information), or PCI (payment card industry).
5.2 Given a scenario, apply data quality control concepts
Quality dimensions as well as rules and metrics come into play here. It is important to understand that there are five key dimensions data can be measured on: consistency, accuracy, completeness, integrity, and limitations.
5.3 Explain master data management (MDM) concepts
Sort of a catchall, the goal is to understand how business and technology preserve shared master data assets through processes and standardization. This becomes of essential importance when compliance issues come into the scene and/or there are mergers and acquisitions taking place and multiple databases must be combined.
With only 19 topic areas beneath the five domains, this exam is one of the more compact of the CompTIA certification exams and many of the topics are commonsensical. What can trip you up is the 20 different data analytics tools beneath 3.4, and the fact that a brief knowledge of each is expected.
The table below lists those as well as a good starting point or additional information on each. It is worth noting that some minor discrepancies exist between the list in the CompTIA objectives and actual product names (Rapid mining versus Rapid Miner, for example):
Data Analytics Tool — Overview Starting Point
Apex — https://apex.oracle.com/en/platform/powered-by-oracle/
AWS QuickSight — https://aws.amazon.com/quicksight/
BusinessObjects — https://www.sap.com/products/technology-platform/bi-platform.html
Dataroma — This was acquired by Salesforce and has been rolled beneath their umbrella of products
Domo — https://www.domo.com/
IBM Cognos — https://www.ibm.com/products/cognos-analytics
IBM SPSS Modeler — https://www.ibm.com/products/spss-modeler
IBM SPSS — https://www.ibm.com/spss
Microsoft Excel — While knowledge of the whole program is helpful, the area to focus on is the Data Analysis ToolPak which (once installed) is found beneath the Data tab and allows for many types of statistical testing on the data.
MicroStrategy — https://www.microstrategy.com/en
Minitab — https://www.minitab.com/en-us/
Power BI — https://powerbi.microsoft.com/en-us/
Python — You are not expected to have mastered the entire programming language to sit for this exam, but rather to know that it is commonly used for data analysis purposes.
Qlik — https://www.qlik.com/us/
R — As with Python, you are not expected to have mastered all of the programming language, but to know that it is commonly used for data mining and statistical analysis.
Rapid mining — https://rapidminer.com/
SAS — https://www.sas.com/en_us/home.html
Stata — https://www.stata.com/
Structured Query Language (SQL) — While you don’t need to know every command available in Python and R, you should know the major command possibilities in SQL
Tableau — https://www.tableau.com/tableau-analysts
When looking at this lengthy list, it helps to compare it against the recommended software CompTIA suggests to those studying for the exam. While the latter is not meant to be a comprehensive list, it shines a light on what should be used in a lab or practice environment and that can signal what is likely to be weighted heavier on an exam.
In this case, the recommended list includes SQL, Eclipse, Anaconda, R Studio, Microsoft Office Suite (for Excel), Tableau, and Power BI.