top of page

Steps to Data Wrangling

Data wrangling can be a long and difficult process, but these are 6 steps to follow.


  1. Discovering: In this step, you have to understand what the purpose of your data will be. For example, if you are trying to create a targeted advertisement, the purpose of the data is to inform you which products the user likes. How you wrangle the data also depends on what type of data it is as well.

  2. Structuring: This is where you actually organize the data. Companies often make tables to arrange the data, and they add several rows and columns. This makes it easier for analysis.

  3. Cleaning: This is when all the unnecessary data is filtered out. There could be many outliers or errors that skew the data, and all of this needs to be cleaned out. Additionally, null values are changed so that there is consistency, and this increases the data quality. For example, soccer could be referred to as football in some places, but they mean the same thing.

  4. Enriching: In this step you look at your data and see what more information can be derived. Basically you dig deeper to see what types of new data you can interpret from what you already have.

  5. Validating: This step is important because it is essentially a security check for the data, and it is done with repetitive programming sequences. The validation rules consist of consistency, quality, and security. For example, dates may be written as MM/DD/YY, DD/MM/YY, MM/DD/YYYY, or even DD/MM/YYYY. Through validation, these programing sequences change the dates so that they are all written in the same form.

  6. Publishing: Finally, the data is prepared for use in downstream. The data can now be analyzed and used effectively.


Recent Posts

See All

Logistic Regression

Logistic Regression is a machine learning model used for classification. When a prediction of a dependent variable consists of 2 values...

What is Operational Research?

Operational research is a field of study in which scientists analyze patterns to make predictions for the future. This enables decision...

Comments


bottom of page