This feature first appeared in the Summer 2017 issue of Certification Magazine. Click here to get your own print or digital copy.
Historically, I've considered myself an "old school" database administrator (DBA). For me, database performance must be a second priority behind a tightly controlled schema that constrains data in tables and table relationships. #DataNerd
Alas, technology moves on, and today we find the relational model subsumed by the faster, more scalable NoSQL databases and the realm of "big data." This makes sense in a way, given how enormous today's business data sets are. For instance, consider the human genome, or the petabytes of data collected by a worldwide network of appliance sensors.
For this article, we will begin by defining the five-dollar buzzwords you can expect see in the big data space, chief among them "big data" and "NoSQL" themselves. We will continue by examining the top-level big data specializations. We will finish by aligning industry certifications with their corresponding big data specializations.
Defining our terms
Here is how MongoDB, itself the world's most popular non-relational database, defines big data:
Big Data refers to technologies and initiatives that involve data that is too diverse, fast-changing or massive for conventional technologies, skills and infrastructure to address efficiently.
Put another way, the volume, velocity, or variety of data makes it, as they say, too hot to handle without specialized tools. Nowadays we have the Internet of Things (IoT), a massive network of systems that use sensors to gather and aggregate an unspeakable volume of data. For instance, Facebook accepts 500 terabytes of new data every day!
Add to the huge data volumes the velocity at which that data changes. I marvel at the data velocity behind online gaming systems like Xbox Live, in which millions of users around the world modify leaderboard statistics every second. Finally, data variety. From plaintext log files to binary multimedia files, today's databases need to be agile and possess dynamic schemas that adapt to ever-changing data definitions.
If you're familiar with relational database management systems, then you know the entire point of these databases is a controlled schema and strong data consistency. Even experienced systems administrators and DBAs consider scaling out relational systems into failover clusters a "dark art."
Engineers design big data database systems for speed, distributed processing, and failover redundancy. Whereas relational databases use Structured Query Language (SQL) as their data definition language, big data databases typically employ a Not Only SQL, or NoSQL, design pattern. It's a misnomer to read NoSQL as "Not SQL" because many big data databases use a SQL-like query syntax.
NoSQL databases have a limited or absent schema, and instead of relying upon related tables to build data models, they use one or more of the following methods:
- Key-value: Data is represented as a series of key-value pairs
- Document: Key-value pair data is organized into data structures called "documents"
- Wide column: Data is organized as column sets and distributed across multiple nodes
- Graph: Data is stored as a mesh network (for example, a Facebook user is linked to her friends, and the friends have friends, etc., forming a radial data graph)
Thus far I have spoken broadly; you may wonder who are the major players in the big data space. Let me sum it up for you and then we'll proceed into big data specializations:
- Relational databases: Microsoft SQL Server, Oracle Database, MySQL
- Document-oriented non-relational databases: MongoDB, CouchDB, RavenDB
- Wide-column store non-relational database: Apache HBase (part of the Hadoop technology stack), Apache Cassandra
Of course, there are plenty of other technologies out there. For this article I have chosen to focus on the biggest of the big data players: Hadoop and MongoDB. Big Data specializations At the highest level of abstraction, any big data solution needs to be:
After all, there is no point in deploying a big data database unless the system is highly available and can withstand failures. Likewise, a big data solution is worthless if it can aggregate huge data volumes but provides no way to search for trends and display meaningful reports.
The Engineering big data specialization is that of a big data architect. This job role involves designing big data solutions for maximum performance, scale, and reliability. Big data engineers understand how relational and non-relational systems work, how to operate in an open-source environment, and how to apply programming algorithms to solve business problems.
To that point, the big data engineer is a proficient programmer, particularly in Java.
The Administration big data specialization is responsible for implementing the big data solution proposed by the engineer. Remember that one of the biggest selling points of big data is high availability; thus, the big data administrator needs expert-level proficiency with network storage, failover clustering, and large scale orchestration and configuration management platforms.
As an example, consider Apache Hadoop. Hadoop is not a single product but a portfolio of related technologies including, but not limited to:
- Hadoop Distributed File System (HDFS): High-performance storage format
- MapReduce: Distributed processing engine
- Spark: An alternative distributed processing engine that can run in other big data systems
- YARN: Cluster resource manager
- HBase: The actual wide-column store database
- Hive: Data warehouse layer and query language
- Pig: Data analysis programming/ scripting language
The Analytics big data specialization is a "Sherlock Holmes" job role. This person dives headfirst into the large data sets found in big data systems and seeks to isolate trends, hidden patterns, or other valuable data that the business can use to make better decisions.
The analytics specialization involves a fair bit of programming. The three "go to" programming languages used by big data analysts are Java, R, and Python.
The Data Science big data specialization is a work role similar to that of a statistician. The data scientist layers additional insight technologies such as machine learning and predictive modeling to unearth insights missed by analysts.
Moreover, the data scientist is responsible for generating "pretty picture" reports using industry leading data visualization software such as Tableau or QlikView.
The role of certification
You may consider a career in big data from one of several different viewpoints:
Relational database administrator or developer who sees that big data is the way of the future
IT Ops professional who wants to move into database design, development, and/or administration
IT newcomer who wants to maximize your career prospects
Big data certification is weird (that's a technical term) inasmuch as most big data technology is open-source. Therefore, you'll find that big data certifications are tied to a particular vendor's implementation of a big data technology stack. Let's take a closer look at a few of the major big data certifications and frame them within their associated big data job specialization.
Cloudera is a software company that provides its own big data solution stack derived from the Apache Hadoop codebase. In fact, you will find that Cloudera, as well as most other big data technology providers, have employees who contribute to the underlying source projects.
Although Cloudera certifications use their own Hadoop implementation, you can claim validated Apache Hadoop expertise on your resume because Cloudera is Hadoop.
Cloudera Certified Associate (CCA) is the entry-level credential; your choices here are:
- CCA Spark and Hadoop Developer: fits the Engineer specialization
- CCA Data Analyst: fits the Analyst specialization
- CCA Administrator: fits the Administrator specialization The Cloudera Certified Professional (CCP) tier has one title: CCP Data Engineer, which fits the Engineer specialization.
The Cloudera exams are online proctored exams that can be taken from the convenience of your home or office. The registration cost is $295 per exam, per attempt. The exams themselves consist of traditional multiple-choice items as well as practical items.
That's right — you will be connected to a live Hadoop cluster located in Cloudera's cloud, and then be required to solve a number of case study problems by manipulating the technology.
Certified Analytics Professional (CAP) Certification
The CAP is a globally-recognized data analytics credential created by the Institute for Operations Research and the Management Sciences (INFORMS), an international consortium of research professionals.
To earn the CAP credential, you need to pass a single, $695 computer-based exam that consists of 100 multiple-choice items. The prerequisites to qualify to take the exam, however, are steep:
- Bachelor's degree
- Five years of documented analytics experience
The advantages of the CAP are that it qualifies your big data analytics expertise in a vendor-neutral context, and that it's a prestigious title. The disadvantage is that you are unable to take the exam unless you fulfill the prerequisites.
Data Science Certifications
The entry-level data science credentials center on the most popular data visualization tools. Tableau Software offers Qualified Associate and Qualified Professional titles in their Tableau Desktop 10 and Tableau Server 10 products.
Qlik offers three certifications for their QlikView product: Designer, Developer, and System Administrator.
First sandbox, then certify
Since much of the big data technology out there is open-source, you don't have to get certified to jump in and mess around. My best advice to you is to dive into the Hadoop stack, dig around, and see where your interest and aptitude take you. Once you've determined which big data specialization matches your interests and goals, then you can map out a certification strategy.