Even in your unknown,you have to take a leap of faith and just start moving toward what you want in life…
Taking a walk down memory lane, memories of how much my face hurt from smiling the moment I received my congratulatory email from SCA ,after taking the leap of faith to take the test and make my submission for a slot as a mentee in the data science track .
This feeling didn’t last for long as I started doubting myself and felt so overwhelmed thinking of how I could see this through but here I am writing an article on my final project as a mentee with SCA ,building a model with machine learning .
Machine Learning is not the future, It’s the present.
Recently , I was looking out for natural hair products that could help grow my hair because I was unsatisfied with the rate at which it was growing since I had my big chop two years ago.
Scrolling through my timeline sometime this month , I came across this ad for natural products and then clicked to check it out and maybe make a purchase. From that day on, I noticed whenever I’m logged into my instagram application, two or more natural products pops up on my timeline. Going through the materials for this month mentorship program by She Code Africa and realized this was machine learning in action.
The interest in Machine Learning for use in our various daily lives is expanding as the available amount of data increases with time.
What is Machine Learning?
According to Sas insights , a method of data analysis that automates analytical model building is Machine Learning.
It is a branch of artificial intelligence based on the idea, systems can learn from data, identify patterns and make decisions with minimal human intervention.
A machine is said to learn from past Experiences(data feed in) with respect to some class of Tasks.
For example, assume that a machine has to predict whether a customer will buy a specific product e.g anti-malaria ,this year or not. The machine will predict by looking at the data of products ,the customer had bought every year, therefore, if he buys anti-malaria every year, there is a high probability the customer will buy this year as well.
Machine learning is very crucial to data scientist because of ‘High-value predictions ,which can guide better decisions and smart actions in real-time without human intervention’. As models are exposed to new data, they are able to independently adapt learning from previous computations to produce reliable, repeatable decisions and results.
Types of Machine Learning
* Supervised Learning
This is when the model is getting trained on a labelled dataset. An algorithm is trained and at the end picks the model that accurately predicts some output based on the input data. The training process is continued until the level of performance is high enough.
We have two types of Supervised Learning techniques:
- Classification.
- Regression.
- Classification:It is a Supervised Learning task where output is having defined labels(discrete value). It can be either in binary classification, model predicts either 0 or 1 ; yes or no or be a multi class classification, model predicts more than one class.
There are two main types of classification problems:
- Binary or binomial classification: exactly two classes to choose between (usually 0 and 1, true and false, or positive and negative)
- Multiclass or multinomial classification: three or more classes of the outputs to choose from
- Regression: It is a Supervised Learning task where output is having continuous value. Here we predict a value closer to our output value and then evaluate by calculating error value. The smaller the error the greater the accuracy of our prediction model.
* Unsupervised Learning
It is a type of learning where we don’t give target to our model while training i.e. training model has only input parameter values.
My SCA Mentorship Programme Project.
For my data science mentorship program with SCA ,I worked on building a model from the study case “ Financial Inclusion In Africa” zindi competition using supervised machine learning.
The objective of this competition was to create a machine learning model to predict which individuals are most likely to have or use a bank account, to learn more about the data information.
The main dataset contains demographic information and what financial services are used by approximately 33,610 individuals across East Africa. This data was extracted from various Finscope surveys ranging from 2016 to 2018.
The data have been split between training and test sets. The test set contains all information about each individual except for whether the respondent has a bank account which is our main goal to accurately predict the likelihood that an individual has a bank account or not, i.e. Yes = 1, No = 0.
What Type of Supervised Learning is my Data?
From the goal , which is to accurately predict the likelihood that an individual has a bank account or not ,in the form Yes =1 or No =0, we can tell it’s a Binary classification problem.
Steps to Model
Step 1: Define Preprocessing Steps
I had downloaded the datasets and saved it in a folder and Opened a new notebook in my jupyter in the same directory the datasets had been stored to ensure reading in the file is seamless.
N.B: Whatever I do for my train dataset , I do to my test dataset.
- Initial set_up: Imported important python packages needed to build the model.
- Read the datasets: the train and test file using dataTest = pd.read_csv(‘Test_v2.csv’)
- Check the information about the datasets : Used print(dataTrain.info()) to know the type of data (i.e object or integer) in our dataset. There were more categorical data and since machine learning deals mostly with numerical data (binary). Categorical data are variables that contain label values rather than numeric values.
- Encode categorical data: Adopted the use of one-hot encoding to encode the categorical datas using pandas.get dummies. One-hot encoding turns your categorical data into a binary vector representation. check out this link for more information on pandas get dummies.
- Check for missing values: Most datasets comes with missing values , you can look up reasons for missing data and how they can be handled. Using the command dataTrain.is null().sum() ,the dataset had no missing values.
- Declare predictive feature and target variable: Allocated the variable ‘y’ to the predictive target and ‘X’ to the predictive features, after dropping the columns no longer needed to build my model.
- Split dataset into training and test set.
Step2: Define the Model
Logistic Regression Model Development and Training
In defining my model I adapted the most commonly used model logistic regression with the familiar LogisticsRegression
class. Logistic Regression is a predictive analysis algorithm based on the concept of probability.It is a calculation used to predict a binary outcome: either something happens, or does not. This can be exhibited as Yes/No, Pass/Fail etc. The confusion matrix plot shows that the model performs well on predicting class yes and performs poorly on predicting class no, this may be caused by the imbalance of data provided(the target variable has more ‘No’ values than ‘Yes’ values), the mean accuracy score is 0.885.
The model can be improved using feature scaling, class balancing, hyperparameter tuning-grid search e.t.c, to know more you can click this link .
LightGBM Model Development and Training
This is another model , it uses tree based learning algorithms. I need to convert our training data into LightGBM dataset format(this is mandatory for LightGBM training), then fit the trained data and predicted the model,which gave me an accuracy score of 0.886. This means that lightGBM model is a better one compared to logistics regression because its accuracy is greater. This model can be improved using any of these methods.
Asides the model used above, Naive Bayes Classifier, K-Nearest Neighbors, Support Vector Machines , Decision tree, can also be used to build classification data sets model. Here is a link to my work on github.
Overall, this article wont be complete without me saying Thank you to She Code Africa for this opportunity. like writing my first article ever!
Within three months of being a SCA datascience mentee , I learnt the art of discipline, time management, how to multitask , how to communicate with people and ask for help when I needed it, I learnt how to applaud colleagues when being praised for their good efforts, rather than be upset and push myself to be better and most importantly my knowledge of datascience went from zero to a hundred, after being supplied with great resources , assigned to one of the best mentors , Precious Kolawole and teamed up with good colleagues like Florence, Doyin, Purity and Eniola , it is beautiful how you all readily shared your knowledge and created time to tutor me when I seemed lost. To everyone dear to my heart who encouraged me on this voyage ,my ever supportive siblings who were always the first to engage my article and give back constructive criticism and my other two favourite persons , Ifeoluwa and Dimeji, you all rock!
I honestly wish this programme wouldn’t end this soon but I’ll take solace in the SCA slack community that has been made available to every tech African woman to keep learning , connecting with people and opportunities.