Data Cleaning
Data Cleaning
Data Cleaning
DATA
CLEANING
April 2016
Dealing with messy data
The four steps of data cleaning: execution (i.e., misuse of instruments, bias, etc.).
Identifying and solving such inconsistencies goes
beyond the scope of this document. It is
recommended to refer to the ACAPS Technical Brief
How sure are you? to get an understanding of how to
deal with measurement errors.
Measurement errors: Data is generally intended to A large part of the data entry errors can be prevented
measure some physical process, subjects or objects, by using an electronic form (e.g. ODK) and conditional
i.e. the waiting time at the water point, the size of a entry.
population, the incidence of diseases, etc. In some
cases, these measurements are undertaken by human Processing errors: In many settings, raw data are pre-
processes that can have systematic or random errors processed before they are entered into a database.
in their design (i.e., improper sampling strategies) and This data processing is done for a variety of reasons:
1
Dealing with messy data
to reduce the complexity or noise in the raw data, to duplications by Data-cleaning errors
analyst
aggregate the data at a higher level, and in some
cases simply to reduce the volume of data being Adapted from Van den Broeck J, Argeseanu Cunningham S, Eeckels
stored. All these processes have the potential to R, Herbst K (2005)
produce errors.
Inaccuracy of a single measurement and data point
Data integration errors: It is rare for a database of may be acceptable, and related to the inherent
significant size and age to contain data from a single technical error of the measurement instrument.
source, collected and entered in the same way over Hence, data cleaning should focus on those errors
time. Very often, a database contains information that are beyond small technical variations and that
collected from multiple sources via multiple methods produce a major shift within or beyond the analysis.
over time. An example is the tracking of the number of Similarly, and under time pressure, consider the
people affected throughout the crisis, where the diminishing marginal utility of cleaning more and more
definition of “affected” is being refined or changed compared to other demanding tasks such as analysis,
over time. Moreover, in practice, many databases visual display and interpretation.
evolve by merging other pre-existing databases. This
merging task almost always requires some attempt to Understand when and how errors are produced
resolve inconsistencies across the databases during the data collection and workflow.
involving different data units, measurement periods, Resources for data cleaning are limited.
formats etc. Any procedure that integrates data from Prioritisation of errors related to population
multiple sources can lead to errors. The merging of numbers, geographic location, affected groups and
two or more databases will both identify errors (where date are particularly important because they
there are differences between the two databases) and contaminate derived variables and the final
create new errors (i.e. duplicate records). Table 1 analysis.
below illustrates some of the possible sources and
types of errors in a large assessment, at three basic The following sections of this document offer a step
levels: When filling the questionnaire, when entering by step approach to data cleaning.
data into the database and when performing the
analysis. C. First Things First
Table 1: Sources of data error
The first thing to do is to make a copy of the original
Sources of error
data in a separate workbook and name the sheets
Stage Lack or excess of Outliers and
data inconsistencies
appropriately, or save in a new file.
Measurement Form missing Correct value filled out ALWAYS keep the source files in a separate folder and
Form double, in the wrong box
change its attribute to READ-ONLY, to avoid
collected repeatedly Not readable
Answering box or Writing error modification of any of the files.
options left blank Answer given is out of
More than one option expected (conditional)
selected when not
allowed
range D. Screening Data
Entry Lack or excess of Outliers and To prepare data for screening, tidy the dataset by
data transferred inconsistencies carried
from the over from transforming the data in an easy to use format.
questionnaire questionnaire Within a tidied dataset:
Form of field not Value incorrectly
entered entered, misspelling - Fonts have been harmonised
Value entered in Value incorrectly - Text is aligned to the left, numbers to the right
wrong field changed during
- Each variable has been turned into a column and
Inadvertent deletion previous data cleaning
and duplication Transformation each observation into a row.
during database (programming) error - There are no blank rows
handling
Processing Lack or excess of Outliers and - Column headers are clear and visually distinct.
and Analysis data extracted from inconsistencies carried - Leading spaces have been deleted
the database over from the database
Data extraction, Data extraction, coding
Afterwards, examine data for the following possible
coding or transfer or transfer error errors:
error Sorting errors
Deletions or (spreadsheets)
2
Dealing with messy data
4
Dealing with messy data
5
Dealing with messy data
available data points, and ignores only those recommended to refer to a statistician in case of a
missing values if they exist on some variables. In small dataset with a large number of missing values.
this case, pairwise deletion will result in different
sample sizes for each variable. Pairwise deletion is Pragmatically, for needs assessment with few
useful when the sample size is small or if missing statistical resources, creating a copy of the variable
values are large because there are not many and replacing missing values with the mean or
values to start with. median may often be enough and preferable to losing
cases in multivariate analysis from small samples.
Try conducting the same test using both deletion
methods to see how the outcome changes. Note that There are several methods to deal with missing
in these techniques, "deletion" means exclusion within data, including deleting cases with missing values,
a statistical procedure, not deletion (of variables or imputing and the maximum likelihood approach.
cases) from the dataset. However, providing an explanation on why data
are missing ("women could not be interviewed",
A second option is to delete all cases with missing “the last questionnaire section could not be filled
values. Thus, you are left with complete data for all due to lack of time”) may be much more
cases. The disadvantage to this approach is that the informative to end user’s than a plethora of
sample size of the data is reduced, resulting in a loss statistical fixes.
of statistical power and increased error in estimation Set up a dummy variable with value 0 for those
(wider confidence intervals). It can also affect the who answered the question and value 1 for those
representativeness of a sample: after removing the who did not. Use this variable to show the impact
cases with non-random missing values from a small of different methods.
dataset, the sample size could be insufficient. In Look for meaning in non-random missing values.
addition, results may be biased in case of non-random Maybe the respondents are indicating something
missing values. The characteristics of cases with important by not answering one of the questions.
missing values may be different than the cases
without missing values.
Under certain conditions, maximum likelihood Create a change log within the workbook, where all
approaches have also proven efficient to dealing with information related to modified fields is sourced. This
missing data. This method does not impute any data, will serve as an audit trail showing any modifications,
but rather uses all the data available for the specific and will allow a return to the original value if required.
cases to compute maximum likelihood estimates. Within the change log, store the following information:
Table (if multiple tables are implemented)
Detailing technicalities, appropriateness and validity of Column, Row
each techniques goes beyond the scope of this Date changed
document. Ultimately, choosing the right technique Changed by
depends on how much data are missing, why this data Old value
is missing, patterns, randomness and distribution of New value
missing values, the effects of the missing data and Comments
how the data will be used for analysis. It is strongly
6
Dealing with messy data
Make sure to document what data cleaning steps Recoding a categorical variable (e.g. ethnicity,
and procedures were implemented or followed, by occupation, an “other” category, spelling
whom, how many responses were affected and for corrections, etc.).
which questions. Recoding a continuous variable (e.g. age) into a
ALWAYS make this information available when categorical variable (e.g. age group).
sharing the dataset internally or externally (i.e. by Combining the values of a variable into fewer
enclosing the change log in a separate worksheet) categories (e.g. grouping all problems caused by
access constraints).
Combining several variables to create a new
I. Adapt Process variable (e.g., the food consumption score, building
an index based on a set of variables).
Defining a condition based on certain cut-off points
Once errors have been identified, diagnosed, treated
(e.g., population “at risk” vs. “at acute risk”).
and documented and if data collection/entry is still
Changing a level of measurement (e.g. from
ongoing, the person in charge of data cleaning should
interval to ordinal scale).
give instructions to enumerators or data entry
operators to prevent further mistakes, especially if
Conceptually, a distinction is needed between:
they are identified as non-random. Feedback will
Activities related to recoding qualitative data: i.e.
ensure common errors are not repeated and will
responses to open questions
improve the assessment validity and the precision of
Activities that include transforming and deriving
outcomes. Main recommendations or corrections can
include: new values out of others, such as creating
calculations (i.e. percentage), parsing, merging, etc.
Programming of data capture, data
Here, the analyst is re-expressing what the data
transformations, and data extractions may need
says (i.e. re-expressing deviation as a % change,
revision.
weighted or moving average, etc.). The data has
Corrections of questions in the questionnaire form.
(normally) already gone through a cleaning stage
Amendment of the assessment protocol, design,
before to be transformed.
timing, enumerators training, data collection, and
quality control procedures.
For both types, recoding variables or values can serve
In extreme cases, it may be necessary to re-
both the purpose of cleaning dirty data and/or
conduct some field assessment (few sites) or
transforming clean data. This section focuses
contact again key informants or enumerators to
primarily on the former rather than the re-expression
ask additional information or more details or
of values which will be tackled more extensively in
confirm some records.
another chapter of the data handbook on data
transformation.
Data cleaning often leads to insight into the nature
and severity of error-generating processes.
Recoding categorical variables starts with a full listing
Identify basic causes of errors detected and use
of all variants generated by a variable, together with
that information to improve data collection and the
their frequencies. The variant list can be copied into a
data entry process to prevent those errors to re-
new sheet, to create a table of variants and their
occurring.
desired replacements. ALWAYS keep a copy of the
Reconsider prior expectations and/or review or
original values, and try out different recoding schemes
update quality control procedures. before settling on a final one.
8
Dealing with messy data
9
Dealing with messy data
10
Dealing with messy data
data that have been entered after will not be Removing duplicate rows
correct. Finding and replacing text
Changing the case of text
Planning and budgeting for data cleaning is Removing spaces and nonprinting characters from
essential. text
Organizing data improves efficiency, i.e. by sorting Fixing numbers and number signs
data on location or records by enumerator. Fixing dates and times
Prevention is better than cure. It is far more Merging and splitting columns
efficient to prevent an error than to have to find it Transforming and rearranging columns and rows
and correct it later. Reconciling table data by joining or matching
The responsibility for generating clean data Third-party providers
belongs to everyone, enumerators, custodian and
users. Openrefine (ex-Google Refine) and LODRefine are
Prioritisation reduces duplication. Concentrate on powerful tools for working with messy data, cleaning
those records where extensive data can be cleaned it, or transforming it from one format into another.
at the lowest cost or that are of most value to end Videos and tutorials are available to learn about the
users. different functionalities offered by this software. The
Feedback is a two-way street: data users or analyst facets function is particularly useful as it can very
will inevitably carry out error detection and must efficiently and quickly gives a feel for the range of
provide feedback to data custodians. Develop variation contained within the dataset.
feedback mechanisms and encourage users to
report errors. Detailed data cleansing tutorials and courses are also
Education and training improve techniques: Poor available at the school of data:
training of enumerators and data entry operators is http://schoolofdata.org/handbook/recipes/cleanin
the cause of a large proportion of the errors. Train g-data-with-spreadsheets/
on quality requirements (readability, etc.) and http://schoolofdata.org/handbook/courses/data-
documentation, cleaning/
Data cleaning processes need to be transparent
and well documented with a good audit trail to Two specialized tools to accomplish many of these
reduce duplication and to ensure that once tasks are used in ACAPS. The first one is Trifacta
corrected, errors never re-occur. Wrangler, the new version of Data Wrangler by the
Documentation is the key to good data quality. Stanford Visualization Group. Trifacta Wrangler is a
Without good documentation, it is difficult for user-friendly tool that can automatically find patterns
users to determine the appropriateness for use of in the data based on things selected, and
the data and difficult for custodians to know what automatically makes suggestions of what to do with
and by whom data quality checks have been those patterns. Beautiful and useful. The other
carried out. cleaning star software is Data monarch from
Datawatch, integrating a lot of wrangling, cleaning and
N. Tools and Tutorials for Data enrichment functionalities that can take hours in
Microsoft excel.
Cleaning
O. Sources and Background
Spreadsheets such as Excel offer the capability to
easily sort data, calculate new columns, move and Readings
delete columns, and aggregate data. For data cleaning
of humanitarian assessment data, ACAPS developed ACAPS. 2013. How to Approach a Dataset –
a specific Technical Note providing a step by step Preparation. Available at:
approach in Excel and detailing cleansing operations, http://www.acaps.org/resourcescats/downloader
supported by a demo workbook. /how_to_approach_a_dataset_part_1_data_prepar
ation/163/1375434553
For generic instructions about how to use excel And its auxiliary workbook, available at:
formulas, functionalities or options to clean data,
several Microsoft office guidance notes are available:
Spell checking
11
Dealing with messy data
http://www.acaps.org/resourcescats/downloader http://fsg.afre.msu.edu/survey/Data_Cleaning_Gu
/how_to_approach_a_dataset_data_management/ idelines_SPSS_Stata_1stVer.pdf
164
Munoz, J. 2005. A Guide for Data Management of
ACAPS. 2012. Severity Rating, A Data Management Household Surveys. Santiago, Chile, Household
Note. Sample Surveys in Developing and Transition
http://www.acaps.org/resourcescats/downloader Countries.
/severity_rating_data_management_note/87/1376 http://unstats.un.org/unsd/hhsurveys/
302232
Osborne, J. W. 2013. Best Practices in Data
Benini, A.. 2011. Efficient Survey Data Entry – A Cleaning: A Complete Guide to Everything You
Template for Development NGOs. Friends in Need to Do Before and After Collecting Your Data.
Village Development Bangladesh (FIVDB). California, SAGE.
http://aldo-
benini.org/Level2/HumanitData/FIVDB_Benini_Effi Psychwiki. Retrieved 7 September 2009.
cientDataEntry_110314.pdf Identifying Missing Data.
http://www.psychwiki.com/wiki/Identifying_Missi
Buchner, D. M. Research in Physical Medicine and ng_Data
Rehabilitation.
http://c.ymcdn.com/sites/www.physiatry.org/res Psychwiki. Retrieved 11 September 2009. Dealing
ource/resmgr/pdfs/pmr-viii.pdf with Missing Data.
http://www.psychwiki.com/wiki/Dealing_with_Mis
Chapman, A. D. 2005. Principles and Methods of sing_Data
Data Cleaning – Primary Species and Species-
Occurrence Data. Psychwiki. Retrieved 7 September 2009. Missing
http://www.gbif.org/orc/?doc_id=1262 Values.
http://www.psychwiki.com/wiki/Missing_Values
Den Broeck, J. V., Cunningham, S. A., Eeckels, R.,
Herbst, K. 2005. Data Cleaning: Detecting, Sana, M., Weinreb, A. A. 2008. Insiders, Outsiders,
Diagnosing, and Editing Data Abnormalities. South and the Editing of Inconsistent Survey Data.
Africa, Africa Centre for Health and Population Sociological Methods & Research, Volume 36,
Studies. Number 4, SAGE Publications.
http://www.academia.edu/1256179/Insiders_Out
Dr. Limaye, N. 2005. Clinical Data Management – siders_and_the_Editing_of_Inconsistent_Survey_D
Data Cleaning. ata
Henning, J. 2009. Data Cleaning. The Analysis Institute. 2013. Effectively Dealing
http://blog.vovici.com/blog/bid/19211/Data- with Missing Data without Biasing Your Results.
Cleaning http://theanalysisinstitute.com/missing-data-
workshop/
Joint IDP profiling Service (JIPS). Retrieved July
2013. Manual Data Entry Staff. Wikipedia. Retrieved 31 July 2013. Data Cleansing.
http://jet.jips.org/pages/view/toolmap http://en.wikipedia.org/wiki/Data_cleansing
Review records
Annex 1 – Checklist for Data
If a sampling strategy was used, the records must
Cleaning be verified first. Verify if all the sites have been
entered, including those where the assessment
Prepare for data cleaning was not completed (this is not relevant in case of
Planning is essential. Make sure tools, material and purposive sampling). Compare records with
contacts for cleaning data are available: assessment teams field trip reports or the
The questionnaire forms spreadsheet where you tracked the visited
The contacts of team leaders or enumerators, in locations.
case they need to be contacted for questions Assign and check a unique ID for each site or
The original database household).
A translator, if necessary Check for duplicate cases as a regular routine for
Visual analysis software (i.e. tableau public) each of the data rows. Remove any blank cases
Spreadsheet (excel) or database (Access, Stata, where the key variables have been entered but
etc.) software. there are no data in any of the variables. Verify first
Some would add coffee and music, and a place that the blank cases should be removed and how
without noise and disturbance. this could affect other data in the row.
Identify the data custodian. He/she will generally be Screen, diagnose and treat data
responsible for managing and storing the data, as well First clean filter questions, i.e. when the population
as for the supervision of the data cleaning, the is asked if they did or had a particular activity
consolidation of the changes and the update and based on a response (yes/no). In that case there
maintenance of the change log. should be data in the following table in the
questionnaire (or column in the database) if the
Establish, document and communicate response is “yes” or there should be no data if the
Train the data entry operators on the how the response is “no”.
questionnaire is populated. Explain the instructions Review the skip rules within the questionnaire and
given to enumerators. If possible, include data run the checks in the database to look for invalid or
entry staff in the data collectors training to missing values in variables based on the skip rules.
facilitate internal communication. Clean questions with min or max response values
Establish decision rules for when to change a value (“tick three options only”, what are the top three
and when NOT. priorities among the 5 following choice”, etc.).
Establish procedures to document data that was Inspect the remaining variables sequentially and as
modified or not collected, i.e. “missing”, or “not they are recorded in the data file. Create a general
collected”. summary table of descriptive statistics, where for
Explain how to use the change log file. each variable the min, max, mean, median, sum
and count are available.
Communicate to data entry operators or analysts
the procedures to be followed and who to inform
of detected errors.
Establish communication channels for
communicating detected errors. Written
communication is recommended.
For rapid assessments where data analysis,
mapping and visualization generally coincides with
data entry and cleaning, communicate regularly to Screenshot of summary statistics table from Aldo
analysts, GIS officers and graphic designers which Benini, ACAPS Technical note on how to approach a
parts of the datasets are clean and usable. dataset, preparation
Establish clear reporting procedures in case
additional errors are identified. Plan with the team
If the variable is a categorical/qualitative variable,
which variables are a priority for cleaning.
check if spelling is consistent and run a frequency
count:
13
Dealing with messy data
o Look at the counts to see if those are consistent manner. Reasons for recoding are:
reasonable for the sample – is the set of data spelling corrections, date (day, month, year)
complete? formatting, translation, language style and
o All values should have labels if the variable is simplification, clustering, pre-fixes to create better
categorical. Check for the range of values. sorting in tables, combination (in categorical
If the variable is a continuous/interval variable, run variables), rounding (in continuous variables), and
descriptive statistics such as min, max, mode, possibly others.
mean and median. Sort the file in various ways (by individual variables
o Look at minimum and maximum values. Are or groups of variables) to see if data errors that
they reasonable? Look especially if “0” are were not found previously can be identified.
really “0” and not missing values.
o Is the mean and median as expected? Final considerations
Inspect data for missing values (blanks, explicit If the data are cleaned by more than one person,
missing-value codes). Decide: then the final step is to merge all the spreadsheets
o Which blank cells need to be filled with zeros together so that there is only one database. The
(because they represent genuine negative comments or change logs that are made as the
observations, such as ("no", "not present", cleaning progresses should be compiled into one
"option not taken", etc.) document. Problem data should be discussed in
o Which to leave blank (if the convention is to the documentation file.
use blanks for missing or not applicable) Update cleaning procedures, change log and data
o Which to replace with some explicit missing documentation file as the cleaning progress.
value code (if we want all missing to be Provide feedbacks to enumerators, team leaders or
explicitly coded). data entry operators if the data collection and entry
Verify that in binary variables (yes/no), the positive process is still ongoing. If the same mistakes are
value is coded as “1”, the negative as “0”. made by one team or enumerators, make sure to
Check for the distribution of the values (use box inform the culprit.
plots if available). Look at the extremes and check Be prepared. Data cleaning is a continued process.
them against the questionnaire even if the value is Some problems cannot be identified until analysis
possible and may seem reasonable. If it is an has begun. Errors are discovered as the data is
extreme, other variable may be incorrect as well. being manipulated by analysts, and several stages
Look out for the 5 smallest/largest values. of cleaning are generally required as
Compare the data between two or more variables inconsistencies are discovered. In rapid
within the same case to check for logical issues. assessments, it is very common that errors are
I.e., can the head of the household be less than 17 detected even during the peer review process.
years old? Compare age with marital status. Is the
person too young to have been married? Do the
proportions sum up to 100%?
Where there are questions asking about a “unit”,
the data must standardized to a specific unit, i.e.
when a response is collected using the unit
specified by the respondent. For instance, units for
area can be acre, hectare and square meters. To
standardize the area unit, a lookup table can be
used to merge in the conversion value to convert
all areas to hectares.
Check for consistencies within a set of cases: If
there is a spouse, it is expected the spouse will be
a different gender. The child of the head of
household is not expected to be older than the
head. The parent of the head cannot be younger
than the head.
Recode variables. Replace unhelpful entries (e.g.
misspellings, verbose descriptions, category
“others”, etc.) with more suitable variants, in
14
Dealing with messy data
Requirements
Assessment and survey experience
Large scale data entry experience required
Education
Degree in statistics or demographics and/or a degree in IT
Experience
2-3 years of experience with statistics institute and/or relevant work experience
Proven experience with data cleansing and management of large volumes of quantitative and qualitative data.
Proven experience with management and operation of databases.
Language
Fluent in written and spoken English (or survey language use).
Skills
Professionalism;
Excellent written and oral communication skills;
Good knowledge of word processing software (Word, Excel, PowerPoint, email);
Understanding of the principles of statistical and demographic analysis;
Understanding of survey techniques;
Excellent report drafting skills.
Strong typing skills
Strong proofreading skills
Excellent command of IT tools; High level of computer literacy.
Rigour and accuracy.
Proven ability to meet deadlines. Ability to work well under pressure;
Good interpersonal skills and ability work in a multi-cultural environment. Strong ability to work in teams;
Experience working with the international humanitarian community is an advantage.
Education
Secondary education, diploma in information/data management an asset
Experience
1-2 years of experience with statistics institute and/or relevant work experience
Proven experience with data entry and management of large volumes of quantitative and qualitative data.
Proven experience with management and operation of databases.
Language
Fluent in written and spoken English (or survey language use).
Skills
Strong typing skills
Data entry skills
Strong proofreading skills
Analytical skills.
Excellent command of IT tools;
High level of computer literacy.
Rigour and accuracy.
Proven ability to meet deadlines.
Good interpersonal skills and ability work in a multi-cultural environment.
Experience working with the international humanitarian community is an advantage.
16
Dealing with messy data
Requirements
Assessment and survey experience
Large scale data entry experience required
Education
Degree in statistics or demographics and/or a degree in IT
Experience
3-5 years of experience with statistics institute and/or relevant work experience
Proven experience with data entry and management of large volumes of quantitative and qualitative data.
Proven experience with management and operation of databases.
Language
Fluent in written and spoken English (or international language use).
Ensuring that procedures for checking, coding and entering data are followed;
Monitoring data entry staff;
Checking the quality of the work conducted by data entry staff during the data checking, coding and entry and
providing all assistance necessary;
Keeping a documented overview of the daily work; producing a daily report on data checking, coding, and entry.
Write procedures for data cleaning and editing. Supervise data cleaning. Consolidate the data change logs from
data entry operators. Document data problems. Update regularly the master database with last changes.
Ordering questionnaires and returning them to the archives after the data has been entered;
Ensuring technical documents are kept in a good condition;
Ensuring working hours are respected, as well as order and discipline in the workplace;
17