SCA Mentorship Quest 1.
Budding Data Scientist
“Mama I made it!” is the phrase I want to literally scream out at this point. I actually did scale through the first month out of the three months allocated to my datascience mentorship program organised by She Codes Africa.
The journey of a thousand miles begins with a single step.
The SCA mentorship program
Part of my new year resolutions was being more deliberate about my journey into tech. I realized my need for a community of like minds since I had no friends or acquaintance into tech, therefore I joined SCA slack community and came across the mentorship program.
Been searching for some form of accountability and getting into the mentorship program and having a mentor seemed like my opportunity to finally get one. I applied for the mentorship program which was quite challenging and didn’t get to complete the task and submit until few hours to the deadline for submission, at this point I didn’t think I would be selected.
Days later I got my first congratulatory email of the year from SCA on my selection as one of the mentees for SCA cohort 4 mentorship program. It seemed like a miracle.
Fastforward till now, It’s one of the best decisions and choices I made. I never could have been this productive and focused in just a month without a structured and detailed resources.
A summary of what I learnt in my first month
An introduction to python programming language and its operation as a data analyst was a very good way to start the journey as a data scientist because it has a very easy simple easy-to-use syntax.
The three best and most important Python libraries for data science are NumPy, Pandas, and Matplotlib.
NumPy : A library that makes a variety of mathematical and statistical operations easier; it is also the basis for many features of the pandas library. It is also known as the scientific computing libraries and uses arrays for input and output.
Pandas : A Python library created specifically to facilitate working with data. It is built upon Numpy ( to handle numeric data in tabular form) package and has inbuilt data structures to ease up the process of data manipulation.
Matplotlib : A visualization library that makes it quick and easy to generate charts from your data.
The mathematical concept of data science was the second task we were introduced to and I got to realize, learning it helps one in making informed decisions about how likely an event would take place, based on a pattern of collected data. Here’s a link to an article I wrote on it.
Using Python with pandas library, was useful for cleaning and sorting the data we were provided into a dataframe (table).
Data is the integral part of analysis and often stored in files(CSV ,Excel , JSON, XML,SQL etc).
Dataframe is a two dimensional datastructure that organizes the data into rows and columns. It can be created using list, dictionaries and numpy arrays.
In python programming we have different data types, some of the common ones are Integer, Floating-point number, Character, String and Boolean. Knowing your data types are really important especially when cleaning amd analyzing a data set and that leads us to the task for week three .
Why do we need to clean a data set? Data duplication or missing values (can affect the analysis process) therefore the need.
The process of converting data from the initial “raw” form into another format to prepare the data for further analysis is called data cleaning. It is also known as data pre-processing or data wrangling.
It includes operations like:
- Data Sorting : Arranging values in ascending or descending order.
- Data Filtration : To create a subset of available data.
- Data Reduction : To replace unwanted values.
- Data Access : To read or write datafiles.
- Data Processing : To perform aggregation , statistical and similar operations on specific values.
If you ever thought cleaning day old dishes is the hardest chore ever you should probably try cleaning a data set .
Honestly, it was a struggle for the first time data cleaning, I got overwhelmed to the point that I completely forgot how to push my work to github when I needed to submit my task to my mentor,but wait till you get a hang of the process , you will love every bit of cleaning like I eventually did.
Here is a link to the data Cleaning dataset article I wrote, it contains some basic steps I used in cleaning my data , in this article is also a github link to the first data I cleaned.