Data Pre-processing
(These tutorials will be given on the basis that you know basics of python.)
The first
step of any algorithm that you start is always going to be data pre-processing.
So, lets
understand it via dataset. Link below is the dataset that is used.
https://drive.google.com/file/d/1cKxzNrnIMq4RxdMcBz3nlr7YtYaPhn5_/view?usp=sharing
|
Country |
Age |
Salary |
Purchased |
|
France |
44 |
72000 |
No |
|
Spain |
27 |
48000 |
Yes |
|
Germany |
30 |
54000 |
No |
|
Spain |
38 |
61000 |
No |
|
Germany |
40 |
Yes |
|
|
France |
35 |
58000 |
Yes |
|
Spain |
52000 |
No |
|
|
France |
48 |
79000 |
Yes |
|
Germany |
50 |
83000 |
No |
|
France |
37 |
67000 |
Yes |
This is that
data we are going to use in this tutorial.
First, let’s
understand the dataset.
This is he
dataset from an E-commerce business across the Europe. So, here the countries
from which someone has ordered or not is
given. Age of a person is given. Salary of a person ordering is given and at
last whether he purchased the item or not that is given.
Let’s define
what is independent variable and what is a dependent variable?
So, in this
case Country, Age and Salary are independent variables and the purchase
decision of a person is a dependent variable because it dependent on Country,
Age and Salary. The Purchase is done on the basis of these independent
variables.
Let’s start
by importing the libraries we will be needing:-
import numpy
as np
import pandas
as pd
import os
After
Importing the libraries if you want to know the path of your data set then follow
this:-
os.getcwd()
Output:- 'c:\\Users\\burhan\\Desktop\\Machine Learning in Python\\Machine Learning A-Z (Codes and Datasets)\\Part 1 - Data Preprocessing\\Data.csv
This is the path that my data set was in, you will be having different path in your computer.Now, importing the dataset by using the pandas library and giving the dataset a name.
stats = pd.read_csv(‘Data.csv’)
stats
Output:-
|
Country |
Age |
Salary |
Purchased |
|
France |
44 |
72000 |
No |
|
Spain |
27 |
48000 |
Yes |
|
Germany |
30 |
54000 |
No |
|
Spain |
38 |
61000 |
No |
|
Germany |
40 |
Yes |
|
|
France |
35 |
58000 |
Yes |
|
Spain |
52000 |
No |
|
|
France |
48 |
79000 |
Yes |
|
Germany |
50 |
83000 |
No |
|
France |
37 |
67000 |
Yes |
After
importing data set let’s divide the dataset into 2 parts one is independent
variables(X) and dependent variables(y)
X = stats.iloc[:, :-1].values
y = stats.iloc[:, -1].values
Here you
can take any variable instead of X and y.
print(X)
Output:-
['France'
44.0 72000.0]
['Spain'
27.0 48000.0]
['Germany'
30.0 54000.0]
['Spain'
38.0 61000.0]
['Germany'
40.0 nan]
['France'
35.0 58000.0]
['Spain'
nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
print(y)
Output :-
['No'
'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
If you understand the data
assigned to variable X, what is nan?
nan is the empty values
that are not given in the original dataset if you see it closely.
So, now to fill that empty
places follow the next step
Here, we will be using a
python library named ‘scikit learn’. We will be filling the empty places with
the average of other integers given in the data. That is,
nan in Age will be the
mean of all the other Ages.
nan in Salary will be the
mean of all the other Salaries.
Why we are taking average
and not a random integer? Because the prediction accuracy could not be achieved
due to random integers.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:])
X[:, 1:] = imputer.transform(X[:, 1:])
print(X)
Output:-
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0
79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
Here impute is the class of
scikit learn library from which SimpleImputer is imported. You can checkout
scikit learn documentation link given after the blogs.
We have put the strategy = mean and to fill the missing
values we have used missing
values = np.nan
imputer is taken as
variable, you can take any variable. Here imputer.fit()
indicates that variable X
is fitted into this function to fill the nan values by mean of their respective
integers column.
Imputer.transform() is used to transform the
whole set by filling the integers instead of nan.
Now, the countries that are
given in string form could not be integrated to the training and test set because
the algorithm will accept the data if it is numeric. So, we will be converting
the country names(Categorical Data) into a binary vector by using OneHotEncoder(), another function of
scikit learn library.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
print(X)
Output:-
[[1.0
0.0 0.0 44.0 72000.0]
[0.0
0.0 1.0 27.0 48000.0]
[0.0
1.0 0.0 30.0 54000.0]
[0.0
0.0 1.0 38.0 61000.0]
[0.0
1.0 0.0 40.0 63777.77777777778]
[1.0
0.0 0.0 35.0 58000.0]
[0.0
0.0 1.0 38.77777777777778 52000.0]
[1.0
0.0 0.0 48.0 79000.0]
[0.0
1.0 0.0 50.0 83000.0]
[1.0
0.0 0.0 37.0 67000.0]]
If you see carefully the country names
are changed to binary vector i.e. for France = 1 0 0, for Spain = 0 0 1 and for
Germany = 0 1 0.
Here, in ColumnTransformer() is used to choose the column on
which the one hot-encoding will be done. In transformer ‘encoder’ means that you
transforming the data into encode form, then in which it is to be transformed is
specified i.e. in this case OneHotEncoder() and finally a column is selected for
the transformation(0-column). Remainder in ColumnTranform()
signifies that no other
column will be Transformed to one hot-encode other than the specified.
Here when applying fit_transform() method np.array is necessary to be
the prefix because when we do, the future fit() method in training set will be
expecting the numpy array i.e. np.array.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)
Output:-
[0
1 0 0 1 1 0 1 0 1]
Here, y variable is also a
categorical data so we have to convert them into binary code, so we used a
function from scikit learn called LabelEncoder().
Define the LabelEncoder() as any variable as you
wish. fit_transform() is a method used to fit
the y variable and transform the data into numeric value like above.
Now, Splitting the data
into training set and test set
What is training set?
It is a set on which the
data is trained to identify the pattern and helps the machine to predict based
on that data.
What is test data?
It is a set which is used
in comparing the data that is predicted on the basis of training set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 1)
Here, we have to determine
that what percentage should the test data be. I recommend to always keep it
under 0.3, to increase the accuracy of the model because the more data you feed
the machine the prediction will be accurate.
print(X_train)
Output:-
[[0.0
0.0 1.0 38.77777777777778 52000.0]
[0.0
1.0 0.0 40.0 63777.77777777778]
[1.0
0.0 0.0 44.0 72000.0]
[0.0
0.0 1.0 38.0 61000.0]
[0.0
0.0 1.0 27.0 48000.0]
[1.0
0.0 0.0 48.0 79000.0]
[0.0
1.0 0.0 50.0 83000.0]
[1.0
0.0 0.0 35.0 58000.0]]
print(X_test)
Output:-
[[0.0 1.0 0.0 30.0 54000.0]
[1.0 0.0 0.0 37.0 67000.0]]
print(y_train)
Output:-
[0 1
0 0 1 1 0 1]
print(y_test)
Output:-
[0 1]
Now, here the step we are
going to apply is called Feature Scaling.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.fit_transform(X_test[:, 3:])
Here, Age and Salary
section will be converted to Standardized form for the machine to understand
the data easily.
print(X_train)
Output:-
[[0.0
0.0 1.0 -0.19159184384578545 -1.0781259408412425]
[0.0
1.0 0.0 -0.014117293757057777 -0.07013167641635372]
[1.0
0.0 0.0 0.566708506533324 0.633562432710455]
[0.0
0.0 1.0 -0.30453019390224867 -0.30786617274297867]
[0.0
0.0 1.0 -1.9018011447007988 -1.420463615551582]
[1.0
0.0 0.0 1.1475343068237058 1.232653363453549]
[0.0
1.0 0.0 1.4379472069688968 1.5749910381638885]
[1.0
0.0 0.0 -0.7401495441200351 -0.5646194287757332]]
print(X_test)
Output :-
[[0.0
1.0 0.0 -1.0 -1.0]
[1.0
0.0 0.0 1.0 1.0]]
Link for scikit learn library:- https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

Comments
Post a Comment