De Notes
De Notes
Data engineering is the process of designing and building systems that let people collect
and analyze raw data from multiple sources and formats. These systems empower
people to find practical applications of the data, which businesses can use to thrive.
Data analysis is challenging because the data is managed by different technologies and
stored in various structures. Yet, the tools used for analysis assume the data is
managed by the same technology and stored in the same structure. This rift can cause
headaches for anybody trying to answer questions about business performance.
Together, this data provides a comprehensive view of the customer. However, these
different datasets are independent, which makes answering certain questions — like
what types of orders result in the highest customer support costs — very difficult.
Data engineering unifies these data sets and lets you find answers to your questions
quickly and efficiently.
Acquisition: Finding all the different data sets around the business
Cleansing: Finding and cleaning any errors in the data
Conversion: Giving all the data a common format
Disambiguation: Interpreting data that could be interpreted in multiple ways
Deduplication: Removing duplicate copies of data
Once this is done, data may be stored in a central repository such as a data lake or data
lakehouse. Data engineers may also copy and move subsets of data into a data
warehouse.
Fortunately, once a data set has been fully cleaned and formatted through data
engineering, it’s easier and faster to read and understand. Since businesses are creating
data constantly, it’s important to find software that will automate some of these
processes.
The right software stack will extract a huge amount of information and value from your
data, which creates end-to-end journeys for the data known as “data pipelines.” As the
information travels through the pipeline, it may be transformed, enriched and
summarized several times.
ETL Tools: ETL (extract, transform, load) tools move data between systems. They
access data, then apply rules to “transform” the data through steps that make it more
suitable for analysis.
SQL: Structured Query Language (SQL) is the standard language for querying
relational databases.
Python: Python is a general programming language. Data engineers may choose to
use Python for ETL tasks.
Cloud Data Storage: Including Amazon S3, Azure Data Lake Storage (ADLS),
Google Cloud Storage, etc.
Query Engines: Engines run queries against data to return answers. Data engineers
may work with engines like Dremio Sonar, Spark, Flink, and others.