Research Guides: Data Applications Services: Data Wrangling

What Is Data Wrangling?

Often times, data is messy. It can have missing values or incorrect formatting. It may need to be merged from two or more different sources or have the number of variables reduced. In fact, it has been found that the process of cleaning and preparing data constitutes upwards of 80% of the data analysis process (Dasu and Johnson. 2003). This process of cleaning and preparing data for analysis is known as data wrangling (sometimes called data munging or data remediation). The data wrangling process can be a time consuming and arduous task, but analyzing bad data can lead to unreliable results. It is a critical stage in the data analysis process. Schedule a consultation for more information or assistance with data wrangling.

There are four main steps in the data wrangling process:

Discovery
Transformation
Validation
Publishing

1. Discovery

The discovery step is where you think about what kind of data you will need to answer your research questions. Then, you will need to find the data that you plan to use and begin a cursory examination of the data to discern how you will prepare your data for analysis in the following data wrangling steps.

Here are some resources for finding data and creating a data management plan:

Find Data

Create a Data Management Plan

2. Transformation

The transformation step consists of three components: structuring, cleaning, and enriching.

Data structuring

The raw data that you have gathered will need to be structured into compatible formats based on the analytical model you will use to interpret the data. For instance, if you are merging two or more datasets, you want to make sure that when you merge them, it's in a form that fits your analytical approach.

Data cleaning

Now that you have your data all together, you will need to remove the errors in the data that would distort the accuracy of your analysis. Some of the steps taken might include: deleting duplicate observations, removing missing cells, standardizing inputs, etc. Basically, you want to make sure there are as few errors as possible.

Enriching data

Your data is now in a more usable form. However, now you need to decide whether you have all of the data needed for your analysis. Your data may be enriched or augmented by adding values from other datasets, if needed. If you do enrich your data, you will need to repeat the transformation step to ensure that your data is suitable for analysis.

Online Resources:

OpenRefine

Stanford Visualization Group Data Wrangler

3. Validation

The validation step is where you check all of the work done in the transformation step to ensure the consistency and quality of your data. This step is often accomplished through automated processes that require some programming skills.

4. Publishing

Now that your data has been validated, it is ready to be put into an open file format and shared for future analysis.

For more information on publishing, please visit Data Sharing.

References

Stobierski, T. (2021). Data wrangling: what it is & why it’s important. HBR On Line, https://online.hbs.edu/blog/post/data-wrangling.

Coursera Staff (2023). What Is Data Wrangling? Definition, Steps, and Why It Matters. Coursera, https://www.coursera.org/articles/data-wrangling.

Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. John Wiley & Sons.

Osborne, J. W. (2013). Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Sage publications.

Data Applications Services: Data Wrangling