Getting started with EDA and Feature Engineering

Saurav Yadav
3 min readFeb 27, 2022

Why do we need Exploratory Data Analysis?

Sometimes the data we get may not be perfect for our machine learning models. It may have some NaN values, some duplicate values, and many more, to get rid of that problem and make our data clean and usable for machine learning models.

Why do we need Feature Engineering?

Feature Engineering encapsulates various data engineering techniques such as selecting relevant features, handling missing data, encoding the data, and normalizing it. It is one of the most crucial tasks and plays a major role in determining the outcome of a model.

Image from analyticsvidhya

Performing EDA and Feature Engineering on Flight Prediction.

For dataset, you can get either from Kaggle or my GitHub id

Steps for EDA and Feature Engineering

  1. Import some important Libraries

2. Load the dataset

This is how our data looks like

Fetching some important information from the dataset

Now here we have the information and can see most of the columns are of “object” Dtype. We need to convert them to int or float so that we can train them for our models

LETS START FEATURE ENGINEERING

  1. Now we have already seen how ‘Date_of_Journey” looks like (24/02/2019) Now we want that column to be as type of int so to get rid of this problem we can divide this column into 3 different columns of “Date”, “Month” and “Year”.

2. Drop the “Date_of_Journey” column.

Now we divided “Date_of_Jouney” in three different columns of integer type, we will do this with each and every column remaining.

3. Now checking for the “Arrival_Time” column and convert the column into two different columns and then into integer type.

Doing this with same type of columns like “Dep_Time”, “Duration”.

4. Now converting some category based columns into “label encoding” or “one hot encoding” and dropping some useless column like “Route”.

Checking the info of our dataset.

Now all the columns are of integer type.

This is how data looks like after doing some feature engineering and EDA on dataset

CONCLUSION

Convert the dataset into some meaningful columns and try to remove useless features, Use label encoding, use some plotting techniques. This dataset was short and easy if you want the full code with proper EDA and Feature engineering you can fork it from my GitHub.

--

--

Saurav Yadav

Hi, This is AI | Machine learning enthusiast with some good hands in Mathematics. Skills = [‘Data analysis’, ‘Data minig’, ‘DS/Algos(Beginner), ‘Mathematics’]