Data Cleaning of Absenteeism Dataset with Python

Absenteeism Data Cleaning and Preprocessing

This project showcases my data cleaning and preprocessing skills using the Python Pandas library within a Jupyter Notebook environment. The objective was to take a raw dataset on employee absenteeism and transform it into a clean, well-structured format suitable for analysis and machine learning modeling.

Tools Used

  • Python
  • Pandas
  • Jupyter Notebook

Data Cleaning and Preprocessing Steps

The following steps were taken to clean and prepare the dataset.

Step 1: Initial Data Loading and Exploration

The first step was to load the dataset and get a high-level overview of its structure, data types, and missing values.


import pandas as pd

# Load the raw data
raw_csv_data = pd.read_csv("Absenteeism-data.csv")

# Display the dataframe
raw_csv_data

# Get a summary of the dataframe
print(raw_csv_data.info())
              

Output of raw_csv_data.head():

Screenshot of the raw data's first five rows

Output of raw_csv_data.info():

Screenshot of the data info summary
Step 2: Dropping Unnecessary Columns

The 'ID' column was identified as a unique identifier for each employee and not a feature that would be useful for analysis. Therefore, it was dropped.


# Create a copy to preserve the original data
df = raw_csv_data.copy()

# Drop the 'ID' column
df = df.drop(['ID'], axis=1)
              

Step 3: Handling Categorical Data ('Reason for Absence')

The 'Reason for Absence' column contained categorical data. To make it suitable for machine learning, I used one-hot encoding to convert these categories into numerical dummy variables.


# Get dummy variables for the 'Reason for Absence' column
reason_columns = pd.get_dummies(df['Reason for Absence'], drop_first=True)
              

Step 4: Date and Time Manipulation

The 'Date' column was converted from a string to a datetime object. From this, I extracted the month and the day of the week to create new, potentially insightful features.


# Convert 'Date' column to datetime objects
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

# Extract month value
df['Month Value'] = df['Date'].apply(lambda x: x.month)

# Extract day of the week
df['Day of the Week'] = df['Date'].apply(lambda x: x.weekday())
              

Step 5: Final Cleaned Dataset

After all the cleaning and preprocessing steps, a final, clean dataframe was created, ready for analysis.


# Create a final cleaned dataframe
df_cleaned = df.copy()

# Display the head of the cleaned dataframe
print(df_cleaned.head(10))
              

Output of df_cleaned.head(10):

Screenshot of the final cleaned data's first ten rows

Results & Impact

Through this data cleaning project, I successfully transformed a raw, messy dataset into a clean and structured one, demonstrating my ability to:

  • Identify and handle irrelevant or redundant data.
  • Convert categorical features into a numerical format suitable for modeling.
  • Perform feature engineering by extracting valuable information from existing columns (like dates).
  • Follow a systematic and organized data cleaning workflow.
  • Prepare a dataset that is ready for exploratory data analysis, visualization, and machine learning.
This project is a strong testament to my foundational skills in data preparation, which are crucial for any data-driven role.

Learnings and Takeaways

This project was a great exercise in applying data cleaning techniques in a practical scenario. Key takeaways include:

  • Deepened my proficiency in using the Pandas library for a wide range of data manipulation tasks.
  • Gained practical experience in feature engineering, specifically with datetime objects.
  • Reinforced the importance of a step-by-step, methodical approach to data cleaning.
  • Improved my ability to document and present a data cleaning process clearly in a Jupyter Notebook.

Project information