Job profile: Become a data engineer
Posted on
November 12, 2019
by
What do you need to know to become a successful data engineer? And how engineers differ from data scientists?

A good data engineer is instrumental in feeding highly accurate information to data scientists. Data engineers are responsible for the creation and maintenance of the analytics infrastructure that enables almost every other function in the data world. They are responsible for the development, construction, maintenance, and testing of data architectures such as databases and large-scale processing systems.

As part of this larger responsibility, Data Engineers are also in charge of the creation of data set processes used in modeling, mining, acquisition, and verification. As we explore this position, we will break down each of its duties and discover, perhaps, why someone might want to become a Data Engineer and what may be in store for the future of profession.

What does a data engineer do?

First off, a data engineer handles Big Data infrastructure. What specifically that entails depends on what you use. If you are using AWS or Azure, then you will be responsible for spinning up servers, connectors, VPNs, gateways, Hadoop — name any cloud technology that allows the connection of people or consumers with their data and the data engineer is going to be responsible for that.

A data engineer straddles both the familiar engineering role, which typically addresses traditional network and server infrastructure, and the relatively new realm of big data. They also dance and dabble in the database programming realm, as well as in more traditional development areas. Because they feed and facilitate data scientists' insights, they are sometimes referred to as the data-scientist's right-hand-man.

What do you need to know to become a successful data engineer? And how engineers differ from data scientists?

The more hard-core Hadoop is usually the cutoff between Data Engineer and Data Scientist. Hadoop is a collection of open source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of Big Data using the MapReduceprogramming model. It was originally designed for computer clusters built from commodity and common hardware — Big Data solves big problems.

While a data scientist handles the analysis of Big Data, a data engineer handles the care and grooming of the database. You can't have a great solution for Big Data unless your hardware is right and your database is operating correctly. Space is always going to be a factor when you are dealing with data. You, as a data engineer, will be responsible for this area of concern.

What about the data sets? Well that responsibility is shared among the data engineer and the data scientist. Grooming the data to understand what to pull and what not to pull is a full-time job. Making the queries optimized, so that the sets of data only have exactly what you need is an amazing feat of skill and tedious, focused work. Just as a quick aside, a data engineer needs to have all of the soft skills that everyone in IT should continually work on.

Keeping up with the profession

The big idea behind Big Data analytics is fairly clear-cut: Find interesting patterns hidden in large amounts of data, train machine learning models to spot those patterns, and implement those models into production to automatically act upon them. Understanding data is the most important job of any organization and supporting all of the tools that facilitate that understanding is the primary role of the data engineer.

There are a few trends that are going to take off and only get bigger in regard to Big Data and the Data Engineer's position. Deep Learning will get deeper. Organizations will expand deep learning beyond its initial use cases, like computer vision and natural language processing (NLP), and find new and creative ways of implementing the powerful technology.

This will also prop up demand for GPUs, which are the favored processors for training deep learning models. It's unclear if new processor types, including ASICs, TPUs, and FPGAs, will become available. But there's clearly demand for faster training and inference too. Large financial institutions have already found that neural network algorithms are better at spotting fraud than traditional machine learning approaches, and the exploration into new use cases will continue in 2019.

As technology advances, the skills mix does too. In 2019, you can expect to see continued huge demand for anybody who can put a neural network into production. The cloud is big, and getting bigger. In 2018, the three biggest public cloud vendors grew at a rate approaching 50 percent. With an array of big data tools and technology — not to mention cheap storage for housing all that data — it will be hard to resist the allure of the cloud.

Certification and education

As a data engineer, what can/should you do to keep up on the emerging technology? Data engineering typically requires a more hybrid approach to education than other, more traditional careers. While a lawyer, for example, generally studies a particular branch of law, data engineers often start out with a broad-based computer science or information technology degree that is then enhanced by vendor-specific certification programs and training materials.

A few of the certifications that will put you further ahead in 2019 are as follows:

What do you need to know to become a successful data engineer? And how engineers differ from data scientists?

Probably the best one to get currently is Google's Professional Data Engineer. This certification establishes that the student is familiar with data engineering principles and can function as either an associate or a professional in the field.

Next we have IBM Certified Data Engineer: Big Data. This certification focuses more on Big Data-specific applications of data engineering skill sets rather than general skills, but is considered a gold standard by many.

I am currently attempting this one but the CCP Data Engineer from Cloudera. This one is specific to Cloudera's solutions and shows that the student has experience in ETL tools and analytics.

Secondary certifications, such as the MCSE (Microsoft Certified Solutions Expert), cover a wide range of topics, but have specific sub-certifications such as MCSE: Data Management and Analytics. I have a lot of these, they are very valuable.

Go for IT

While all of these are amazing to get, they don't replace a solid understanding of business requirements and customer service skills. Business acumen is an essential commodity for any employee, let alone for a data engineer trying to help their organization solve some complicated problems.

If you decide to become a data engineer, just know that the road is focused and narrow — but you will have plenty of fellow travelers and a bright light at the end of the journey. I wish you a safe journey and a happy end.

About the Author
Nathan Kimpel

Nathan Kimpel is a seasoned information technology and operations executive with a diverse background in all areas of company functionality, and a keen focus on all aspects of IT operations and security. Over his 20 years in the industry, he has held every job in IT and currently serves as a Project Manager in the St. Louis (Missouri) area, overseeing 50-plus projects. He has years of success driving multi-million dollar improvements in technology, products and teams. His wide range of skills includes finance, as well as ERP and CRM systems. Certifications include PMP, CISSP, CEH, ITIL and Microsoft.

Posted to topic:
Jobs and Salary

Important Update: We have updated our Privacy Policy to comply with the California Consumer Privacy Act (CCPA)

CompTIA IT Project Management - Project+ - Advance Your IT Career by adding IT Project Manager to your resume - Learn More