Data Pre-processing

(These tutorials will be given on the basis that you know basics of python.)

The first step of any algorithm that you start is always going to be data pre-processing.



So, lets understand it via dataset. Link below is the dataset that is used.

 https://drive.google.com/file/d/1cKxzNrnIMq4RxdMcBz3nlr7YtYaPhn5_/view?usp=sharing

Country

Age

Salary

Purchased

France

44

72000

No

Spain

27

48000

Yes

Germany

30

54000

No

Spain

38

61000

No

Germany

40

Yes

France

35

58000

Yes

Spain

52000

No

France

48

79000

Yes

Germany

50

83000

No

France

37

67000

Yes

 

This is that data we are going to use in this tutorial.

First, let’s understand the dataset.

This is he dataset from an E-commerce business across the Europe. So, here the countries from which someone has ordered  or not is given. Age of a person is given. Salary of a person ordering is given and at last whether he purchased the item or not that is given.

 

Let’s define what is independent variable and what is a dependent variable?

So, in this case Country, Age and Salary are independent variables and the purchase decision of a person is a dependent variable because it dependent on Country, Age and Salary. The Purchase is done on the basis of these independent variables.

Let’s start by importing the libraries we will be needing:-

   import numpy as np

   import pandas as pd

   import os

 

After Importing the libraries if you want to know the path of your data set then follow this:-

 os.getcwd()

 
Output:- 'c:\\Users\\burhan\\Desktop\\Machine Learning in Python\\Machine Learning A-Z (Codes and Datasets)\\Part 1 - Data Preprocessing\\Data.csv
 
This is the path that my data set was in, you will be having different path in your computer.
Now, importing the dataset by using the pandas library and giving the dataset a name.
 
 stats = pd.read_csv(‘Data.csv’)
 stats

 

Output:-

Country

Age

Salary

Purchased

France

44

72000

No

Spain

27

48000

Yes

Germany

30

54000

No

Spain

38

61000

No

Germany

40

Yes

France

35

58000

Yes

Spain

52000

No

France

48

79000

Yes

Germany

50

83000

No

France

37

67000

Yes

 

After importing data set let’s divide the dataset into 2 parts one is independent variables(X) and dependent variables(y)

 X = stats.iloc[:, :-1].values

 y = stats.iloc[:, -1].values

 

Here you can take any variable instead of X and y.

 print(X)

Output:-

['France' 44.0 72000.0]

['Spain' 27.0 48000.0]

['Germany' 30.0 54000.0]

['Spain' 38.0 61000.0]

['Germany' 40.0 nan]

['France' 35.0 58000.0]

['Spain' nan 52000.0]

['France' 48.0 79000.0]

['Germany' 50.0 83000.0]

['France' 37.0 67000.0]]

 

 print(y)

Output :-

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

 

If you understand the data assigned to variable X, what is nan?

nan is the empty values that are not given in the original dataset if you see it closely.

So, now to fill that empty places follow the next step

Here, we will be using a python library named ‘scikit learn’. We will be filling the empty places with the average of other integers given in the data. That is,

nan in Age will be the mean of all the other Ages.

nan in Salary will be the mean of all the other Salaries.

Why we are taking average and not a random integer? Because the prediction accuracy could not be achieved due to random integers.

  from sklearn.impute import SimpleImputer

 imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

 imputer.fit(X[:, 1:])

 X[:, 1:] = imputer.transform(X[:, 1:])

 print(X)

Output:-

[['France' 44.0 72000.0]

['Spain' 27.0 48000.0]

['Germany' 30.0 54000.0]

['Spain' 38.0 61000.0]

['Germany' 40.0 63777.77777777778]

['France' 35.0 58000.0]

['Spain' 38.77777777777778 52000.0]

 ['France' 48.0 79000.0]

['Germany' 50.0 83000.0]

['France' 37.0 67000.0]]

Here impute is the class of scikit learn library from which SimpleImputer is imported. You can checkout scikit learn documentation link given after the blogs.

We have put the strategy = mean and to fill the missing values we have used missing values = np.nan

imputer is taken as variable, you can take any variable. Here imputer.fit() indicates that variable X is fitted into this function to fill the nan values by mean of their respective integers column.

Imputer.transform() is used to transform the whole set by filling the integers instead of nan.

 

Now, the countries that are given in string form could not be integrated to the training and test set because the algorithm will accept the data if it is numeric. So, we will be converting the country names(Categorical Data) into a binary vector by using OneHotEncoder(), another function of scikit learn library.

 from sklearn.compose import ColumnTransformer

 from sklearn.preprocessing import OneHotEncoder

 ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

 X = np.array(ct.fit_transform(X))

 print(X)

Output:-

[[1.0 0.0 0.0 44.0 72000.0]

[0.0 0.0 1.0 27.0 48000.0]

[0.0 1.0 0.0 30.0 54000.0]

[0.0 0.0 1.0 38.0 61000.0]

[0.0 1.0 0.0 40.0 63777.77777777778]

[1.0 0.0 0.0 35.0 58000.0]

[0.0 0.0 1.0 38.77777777777778 52000.0]

[1.0 0.0 0.0 48.0 79000.0]

[0.0 1.0 0.0 50.0 83000.0]

[1.0 0.0 0.0 37.0 67000.0]]

 

If you see carefully the country names are changed to binary vector i.e. for France = 1 0 0, for Spain = 0 0 1 and for Germany = 0 1 0. 

Here, in ColumnTransformer() is used to choose the column on which the one hot-encoding will be done. In transformer ‘encoder’ means that you transforming the data into encode form, then in which it is to be transformed is specified i.e. in this case OneHotEncoder() and finally a column is selected for the transformation(0-column). Remainder in ColumnTranform() signifies that no other column will be Transformed to one hot-encode other than the specified.

Here when applying fit_transform() method np.array is necessary to be the prefix because when we do, the future fit() method in training set will be expecting the numpy array i.e. np.array.

 

 from sklearn.preprocessing import LabelEncoder

 le = LabelEncoder()

 y = le.fit_transform(y)

 print(y)

Output:-

[0 1 0 0 1 1 0 1 0 1]

Here, y variable is also a categorical data so we have to convert them into binary code, so we used a function from scikit learn called LabelEncoder().

Define the LabelEncoder() as any variable as you wish. fit_transform() is a method used to fit the y variable and transform the data into numeric value like above.

 

Now, Splitting the data into training set and test set

What is training set?

It is a set on which the data is trained to identify the pattern and helps the machine to predict based on that data.

What is test data?

It is a set which is used in comparing the data that is predicted on the basis of training set.  

 from sklearn.model_selection import train_test_split

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 1)

Here, we have to determine that what percentage should the test data be. I recommend to always keep it under 0.3, to increase the accuracy of the model because the more data you feed the machine the prediction will be accurate.

 print(X_train)

Output:-

[[0.0 0.0 1.0  38.77777777777778  52000.0]

[0.0 1.0 0.0  40.0  63777.77777777778]

[1.0 0.0 0.0  44.0  72000.0]

[0.0 0.0 1.0  38.0  61000.0]

[0.0 0.0 1.0  27.0  48000.0]

[1.0 0.0 0.0  48.0  79000.0]

[0.0 1.0 0.0  50.0  83000.0]

[1.0 0.0 0.0  35.0  58000.0]]

 

 print(X_test)

Output:-

[[0.0 1.0 0.0 30.0 54000.0]

[1.0 0.0 0.0 37.0 67000.0]]

 

 print(y_train)

Output:-

[0 1 0 0 1 1 0 1]

 

 print(y_test)

Output:-

[0 1]

 

Now, here the step we are going to apply is called Feature Scaling.

 from sklearn.preprocessing import StandardScaler

 sc = StandardScaler()

 X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

 X_test[:, 3:] = sc.fit_transform(X_test[:, 3:])

Here, Age and Salary section will be converted to Standardized form for the machine to understand the data easily.

 

 print(X_train)

Output:-

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]

[0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]

[1.0 0.0 0.0 0.566708506533324 0.633562432710455]

[0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]

[0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]

[1.0 0.0 0.0 1.1475343068237058 1.232653363453549]

[0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]

[1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]

 

 print(X_test)

Output :-

[[0.0 1.0 0.0 -1.0 -1.0]

[1.0 0.0 0.0 1.0 1.0]]

  

Link for scikit learn library:- https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

Comments