0% found this document useful (0 votes)
52 views

Data Integrity and Compliance

The document discusses the importance of data integrity for analysis. It provides examples of how data integrity can be compromised through replication, transfer, or manipulation of data. Issues like inconsistent date formats across a global company are used to illustrate how data integrity errors can occur and affect analysis. Checking data integrity is important before analysis to ensure the data is valid, complete, and consistent.

Uploaded by

lamnt.vnu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Data Integrity and Compliance

The document discusses the importance of data integrity for analysis. It provides examples of how data integrity can be compromised through replication, transfer, or manipulation of data. Issues like inconsistent date formats across a global company are used to illustrate how data integrity errors can occur and affect analysis. Checking data integrity is important before analysis to ensure the data is valid, complete, and consistent.

Uploaded by

lamnt.vnu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Data integrity and compliance

This reading illustrates the importance of data integrity using an example of a global
company’s data. Definitions of terms that are relevant to data integrity will be
provided at the end.

A strong analysis depends on the integrity of the data. If the data you're using is
compromised in any way, your analysis won't be as strong as it should be. Data
integrity is the accuracy, completeness, consistency, and trustworthiness of data
throughout its lifecycle. That might sound like a lot of qualities for the data to live up
to. But trust me, it's worth it to check for them all before proceeding with your
analysis. Otherwise, your analysis could be wrong. Not because you did something
wrong, but because the data you were working with was wrong to begin with.

When data integrity is low, it can cause anything from the loss of a single pixel in an
image to an incorrect medical decision. In some cases, one missing piece can make all
of your data useless.

Data integrity can be compromised in lots of different ways. There's a chance data
can be compromised every time it's replicated, transferred, or manipulated in any
way.

 Data replication is the process of storing data in multiple locations. If you're


replicating data at different times in different places, there's a chance your
data will be out of sync. This data lacks integrity because different people
might not be using the same data for their findings, which can cause
inconsistencies.
 There's also the issue of data transfer, which is the process of copying data
from a storage device to memory, or from one computer to another. If your
data transfer is interrupted, you might end up with an incomplete data set,
which might not be useful for your needs.
 The data manipulation process involves changing the data to make it more
organized and easier to read. Data manipulation is meant to make the data
analysis process more efficient, but an error during the process can
compromise the efficiency.
Finally, data can also be compromised through human error, viruses, malware,
hacking, and system failures, which can all lead to even more headaches. I'll stop
there. That's enough potentially bad news to digest.

In a lot of companies, the data warehouse or data engineering team takes care of
ensuring data integrity. Checking data integrity is a vital step in processing your data
to get it ready for analysis, whether you or someone else at your company is doing it.

Scenario: calendar dates for a global company


2

Calendar dates are represented in a lot of different short forms. Depending on where
you live, a different format might be used.

 In some countries, 12/10/20 (DD/MM/YY) stands for October 12, 2020.


 In other countries, the national standard is YYYY-MM-DD so October 12, 2020
becomes 2020-10-12.
 In the United States, (MM/DD/YY) is the accepted format so October 12, 2020
is going to be 10/12/20.
Now, think about what would happen if you were working as a data analyst for a
global company and didn’t check date formats. Well, your data integrity would
probably be questionable. Any analysis of the data would be inaccurate. Imagine
ordering extra inventory for December when it was actually needed in October!

A good analysis depends on the integrity of the data, and data integrity usually
depends on using a common format. So it is important to double-check how dates
are formatted to make sure what you think is December 10, 2020 isn’t really October
12, 2020, and vice versa.

Here are some other things to watch out for:


 Data replication compromising data integrity: Continuing with the example,
imagine you ask your international counterparts to verify dates and stick to
one format. One analyst copies a large dataset to check the dates. But because
of memory issues, only part of the dataset is actually copied. The analyst
would be verifying and standardizing incomplete data. That partial dataset
would be certified as compliant but the full dataset would still contain dates
that weren't verified. Two versions of a dataset can introduce inconsistent
results. A final audit of results would be essential to reveal what happened and
correct all dates.
 Data transfer compromising data integrity: Another analyst checks the dates
in a spreadsheet and chooses to import the validated and standardized data
back to the database. But suppose the date field from the spreadsheet was
incorrectly classified as a text field during the data import (transfer) process.
Now some of the dates in the database are stored as text strings. At this point,
the data needs to be cleaned to restore its integrity.
 Data manipulation compromising data integrity: When checking dates,
another analyst notices what appears to be a duplicate record in the database
and removes it. But it turns out that the analyst removed a unique record for a
company’s subsidiary and not a duplicate record for the company. Your
dataset is now missing data and the data must be restored for completeness.

Conclusion

Fortunately, with a standard date format and compliance by all people and systems
that work with the data, data integrity can be maintained. But no matter where your
3

data comes from, always be sure to check that it is valid, complete, and clean before
you begin any analysis.

Reference: Data constraints and examples

As you progress in your data journey, you'll come across many types of data
constraints (or criteria that determine validity). The table below offers definitions and
examples of data constraint terms you might come across.

Data constraint Definition Examples


Values must be of a certain If the data type is a date, a single number
Data type type: date, number, like 30 would fail the constraint and be
percentage, Boolean, etc. invalid
Values must fall between
If the data range is 10-20, a value of 30
Data range predefined maximum and
would fail the constraint and be invalid
minimum values
Values can’t be left blank or If age is mandatory, that value must be filled
Mandatory
empty in
Values can’t have a Two people can’t have the same mobile
Unique
duplicate phone number within the same service area
Regular expression Values must match a A phone number must match ###-###-####
(regex) patterns prescribed pattern (no other characters allowed)
Certain conditions for
Cross-field Values are percentages and values from
multiple fields must be
validation multiple fields must add up to 100%
satisfied
A database table can’t have two rows with
the same primary key value. A primary key is
(Databases only) value must an identifier in a database that references a
Primary-key
be unique per column column in which each value is unique. More
information about primary and foreign keys
is provided later in the program.
(Databases only) values for
Value for a column must be set to Yes, No, or
Set-membership a column must come from a
Not Applicable
set of discrete values
Foreign-key (Databases only) values for In a U.S. taxpayer database, the State
4

Data constraint Definition Examples


a column must be unique column must be a valid state or territory
values coming from a with the set of acceptable values defined in a
column in another table separate States table
The degree to which the
data conforms to the actual If values for zip codes are validated by street
Accuracy
entity being measured or location, the accuracy of the data goes up.
described
The degree to which the If data for personal profiles required hair and
Completeness data contains all desired eye color, and both are collected, the data is
components or measures complete.
The degree to which the
If a customer has the same address in the
data is repeatable from
Consistency sales and repair databases, the data is
different points of entry or
consistent.
collection

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy