The way machines learn, as we have already seen, is closely related to the way humans learn. The key element is repetition. It all begins with data and the use of this data to find patterns and make predictions. Through the repetitious process of training our algorithm, it learns to modify itself and become more accurate. This process has a reason and a goal: To allow the computer to learn automatically without human intervention.
Machine learning is a very broad field that requires a lot of time and effort for one to fully understand. However, by breaking up the concepts into smaller parts and using universal terms, we will be able to grasp the key ideas. To begin, we define the 3 main types of machine learning methods: supervised learning, unsupervised learning, and reinforcement learning.
The fact of the matter is that data is changing the face of our world. From helping cure diseases to boosting a company's revenue, data is allowing our world to grow. For businesses, harnessing data can facilitate the finding of new cutomers, track social media interaction with the brand, improve customer retention rate, capture market trends, predict sales trends, and much more. In other words, for businesses, data is essential. However, it is not the acquisition or availability of data that allows businesses to profit from it. Instead, it is the way data is analyzed that gives businesses the oportunity to grow. The question, therefore, is simple: How can we analyze the data we acquire in a way that allows us to benefit the most from it?
While there are a multitude of conventional ways to analyze data, our mission is to benefit from powerful machine learning models that analyze our data. Although we will also be using other visual models that we find practical for our particular dataset, the machine learning aspect will be our main goal. After all, it is this machine learning aspect that will allow our product to be scaled in such a way that it can serve as a unseful tool for a wide variety of businesses.
To begin with, let's import one of our scraped datasets and use some visual models to find useful patterns.
Use the following dropdown to select the model you want to view. We encourage you to analyze each model and find interesting patterns within our data. Each model will be accompanied by a few words to describe it and offer brief insight. Feel free to review this link which contains the raw dataset from which the models are based on. The data is very similar to the datasets available in the previous page. However, we have included some additional attributes.
While the models we used on our dataset allowed us to analyze our particular dataset effectively, the reality is that in subsequent projects we may have to scrape data that is of a different variety of types and/or sizes. What are the different models you have available to you if you want to visualize your own data? To help give you insight into the wide variety of data visualization models available, please visit this link. While the data represented will not be our own, the purpose for pointing you towards them is to help you understand the different ways data can be visualized and give you an idea of what models you may find relevant for your own projects.
We have seen a few visualization models we can use to analyze our data. While this allows us to further our understanding of the value our data can provide, the true value comes from being able to make accurate predictions for future products. After all, our goal with this project is to not only serve as a tool for learning but also as a practical tool for businesses. With this in mind, we will be showing you how to use the data we have collected to train a linear regression machine learning model. We will define key terms and then walk you through the process of training such a model.
In order to properly understand the following section, we must define a few terms:
Linear regression is used to model the relationship between two variables by fitting a linear equation to observed data. One of these variables is an explanatory variable while the other is a response variable.
A response variable (also known as a dependent variable) is the focus of a question in a study or experiment. It is that variable whose variation depends on other variables. The response variable is the subject of change within an experiment, often as a result of differences in the explanatory variables. In our models, the response variable will be "Rating".
An explanatory variable (also known as an independent variable) is the variable which influences or predicts the values. It explains changes in the response variable. In our models, we will make use of various explanatory variables such as "Five_Stars", "Days_Since_First_Listed", and others to predict a reponse.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
Pandas is the most popular python library that is used for data analysis. It offers data structures and operations for manipulating numerical tables and time series.
Scikit-learn is a free software machine learning library for the Python programming language. It allows for various classification, regression, and clustering algorithms.
Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
Feel free to come back and re-read a definition if you stumble across one of these terms and don't quite understand what it means. Now that we have defined the terms we will be working with, it is time to begin setting up our environment and train some models.
We will be using the Statsmodels library for teaching purposes. However, it is recommended you spend most of your time on the scikit-learn library since it allows for greater machine learning functionality. Throughout this lesson we will be showing snippets of code that were ran in a Jupyter Notebook. If you wish to learn more about Jupyter Notebook, click here. The following snippet shows our import of the pandas and matplotlib libraries:
Now that we have imported the necessesary libraries, we will load our data file. It is a "csv" file named "shoes_all.csv" that includes around 85 data entries; each one a shoe product scraped from Amazon.
We will now use the matplotlib library to visualize our data in a scatterplot. We will create three graphs each with the "Rating" attribute on the y-axis. On the x-axis we will call three different attributes: "Days_Since_First_Listed", "Five_Stars", and "Rating_Count". Let's see if any of these three scatterplots show a clear correlation.
The graph with "Five_Stars" on the x-axis is the only one that shows a clear correlation out of the three. Therefore, we will focus on the relationship between "Five_Stars" and "Rating". The approach for predicting a quantitative response using a single feature or attribute is called simple linear regression. We will start with simple linear regression because it makes it easier to understand what's going on. Simple linear regression takes the following form:
y = β0 + β1x
What does each term represent?
Together, β0 and β1 are called the model coefficients. Through training, our model will "learn" the values of these coefficients. And once we have learned them, we will be able to predict the "Rating" attribute, which is the goal of our model.
Generally (and in our case), coefficients are estimated using the least squares criterion, which means we find the mathematical line which minimizes the sum of squared residuals. To learn more about this process, visit this link.
What do the elements in this diagram represent?
How do the model coefficients relate to the least squares line?
Here is a graphical depiction of those calculations:
We will now use Statsmodels to estimate the model coefficients of our model. The first line will import the statsmodel library. The next line of code will create a fitted model in one line. The last line will print the coefficients:
How do we interpret the Five_Stars coefficient (β1)?
Now that we understand what the model coefficients mean and their importance in our model, we can use it to make predictions. We will first make this prediction manually and later use the Statsmodels library to confirm this prediction and automate the process. Our β0 value (the intercept) is 2.6. Our β1 value is 0.025374. Lets say our Five_Stars attribute value is 80. What would we predict the Rating value to be in this scenario?
y = 2.6 + (0.025374 * 80)
y = 4.63
The value of "Rating" would be 4.63. Lets see if this value is corroborated by our model.
We created a new DataFrame because the Statsmodels formula interface requires it. We gave the attribue "Five_Stars" a value of 80 and used the model to make a prediction on this value. The result, as you can see, is 4.629942 and nearly identical to our previously calculated value of 4.63.
Now that our model has calculated the coefficients necessary to make accurate predictions based on the data we have trained it with, we can show how the least squares line looks like on a graph.
How well does our model fit the data? To find out, we can calculate the R-squared value of our model. What is R-squared? R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.
The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. It is a simple equation:
R-squared = Explained variation / Total variation
This value is always a number between 0 and 1. In general, the higher the R-squared, the better the model fits the data. Let's calculate the R-squared of our model.
An R-squared of 0.80 is rather good; our model is a success.
Simple linear regression can easily be extended to include multiple features. This is called multiple linear regression:
y = β0 + β1x1 + ... + βnxn
Each x represents a different attribute or feature, and each has its own coefficient.
In the following snippet, you will see we used Statsmodels to estimate coefficients for multiple features including: "Five_Stars", "Rating_Count", "Days_Since_First_Listed", and "Fit_As_Expected". The output of the code will show us the values of these coefficients that the model has calculated.
How do we interpret these coefficients? An increase in the a unit of "Five_Stars" equals an increase in "Rating" by 2.453566e-02 which is equal to 0.02453566. This same logic is applied to the remaining features. As we can see, "Days_Since_First_Listed" has a negative coefficient, meaning that an increase in a single unit of "Days_Since_First_Listed" decreases the "Rating" by 2.663208e-07.
From these coefficients we can deduce that the only relevant features (the features that have a clear effect on the "Rating") from these four are "Five_Stars" and "Fit_As_Expected". Adding more features/attributes to the model will always increase the R-squared value. The following snippet shows the R-squared value of our multiple linear regression model:
We will now redo some of the Statsmodels code above in scikit-learn:
The first number output (2.3657) is the intercept of the model. The follwing numbers: [ 2.45356634e-02 1.37728661e-06 -2.66320762e-07 3.59436673e-03] each apply to one of the features in this order: "Five_Stars", "Rating_Count", "Days_Since_First_Listed", and "Fit_As_Expected".
We can use the model to predict a rating by giving it some inputs (the order of inputs will be the same as above):
Finally, we show the R-squared value of our model:
We hope this lesson has helped you come to a greater understanding of machine learning. The linear regression algorithm we have covered in this lesson serves as the first step if you wish to continue expanding your knowledge on the subject. Machine learning is such a large topic that it would take a miriad of lessons to completely cover. For that reason, we urge you to keep searching for ways to improve your understanding. We have provided a few useful resources in our "more info" page if you wish to continue this journey.
This concludes the section on machine learning and visual representations of data. Since the purpose of this website is primarily to inform, we wanted to provide enough information for basic understanding of the tools we used in the project. To further these concepts, we have developed a web app and encourage you to visit it once you feel comfortable with the concepts described on this page. This web app for dynamic visualizations is the focus of the next page.