The way machines learn, as we have already seen, is closely related to the way humans learn. The key element is *repetition*. It all begins with data and the use of this data to find patterns and make predictions. Through the repetitious process of training our algorithm, it *learns* to modify itself and become more accurate. This process has a reason and a **goal:** To allow the computer to learn *automatically* without human intervention.

Machine learning is a very broad field that requires a lot of time and effort for one to fully understand. However, by breaking up the concepts into smaller parts and using universal terms, we will be able to grasp the key ideas. To begin, we define the 3 main types of machine learning methods: **supervised learning**,
**unsupervised learning**, and **reinforcement learning.**

That concludes our overview of machine learning. In the following section we will discuss the practical applications of machine learning, specifically what models we can use to help us visualize our data and analyze it through novel perspectives.

The fact of the matter is that data is changing the face of our world. From helping cure diseases to boosting a company's revenue, data is allowing our world to grow. For businesses, harnessing data can facilitate the finding of new cutomers, track social media interaction with the brand, improve customer retention rate, capture market trends, predict sales trends, and much more. In other words, for businesses, **data is essential**. However, it is not the acquisition or availability of data that allows businesses to profit from it. Instead, it is the way data is *analyzed* that gives businesses the oportunity to grow. The question, therefore, is simple: How can we analyze the data we acquire in a way that allows us to benefit the most from it?

While there are a multitude of conventional ways to analyze data, our mission is to benefit from powerful machine learning models that analyze our data. Although we will also be using other visual models that we find practical for our particular dataset, the machine learning aspect will be our main goal. After all, it is this machine learning aspect that will allow our product to be scaled in such a way that it can serve as a unseful tool for a wide variety of businesses.

To begin with, let's import one of our scraped datasets and use some visual models to find useful patterns.
**Use the following dropdown to select the model you want to view**. We encourage you to analyze each model and find interesting patterns within our data. Each model will be accompanied by a few words to describe it and offer brief insight. Feel free to review this link which contains the raw dataset from which the models are based on. The data is very similar to the datasets available in the previous page. However, we have included some additional attributes.

While the models we used on our dataset allowed us to analyze *our* particular dataset effectively, the reality is that in subsequent projects we may have to scrape data that is of a different variety of **types** and/or **sizes**. What are the different models you have available to you if you want to visualize your own data? To help give you insight into the wide variety of
data visualization models available, please visit this link. While the data represented will not be our own, the purpose for pointing you towards them is to help you understand the different ways data can be visualized and give you an idea of what models you may find relevant for your own projects.

We have seen a few visualization models we can use to analyze our data. While this allows us to further our understanding of the value our data can provide, the true value comes from being able to make accurate predictions for future products. After all, our goal with this project is to not only serve as a tool for learning but also as a practical tool for businesses. With this in mind,
we will be showing you how to use the data we have collected to train a *linear regression* machine learning model. We will define key terms and then walk you through the process of training such a model.

In order to properly understand the following section, we must define a few terms:

*Linear regression* is used to model the relationship between two variables by fitting a linear equation to observed data. One of these variables is an **explanatory** variable while the other is a **response** variable.

A *response variable* (also known as a dependent variable) is the focus of a question in a study or experiment. It is that variable whose variation depends on other variables. The response variable is the subject of change within an experiment, often as a result of differences in the explanatory variables. In our models, the response variable will be "Rating".

An *explanatory variable* (also known as an independent variable) is the variable which influences or predicts the values. It explains changes in the response variable. In our models, we will make use of various explanatory variables such as "Five_Stars", "Days_Since_First_Listed", and others to predict a reponse.

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Pandas is the most popular python library that is used for data analysis. It offers data structures and operations for manipulating numerical tables and time series.

*Scikit-learn* is a free software machine learning library for the Python programming language. It allows for various classification, regression, and clustering algorithms.

*Statsmodels* is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

*Feel free to come back and re-read a definition if you stumble across one of these terms and don't quite understand what it means. Now that we have defined the terms we will be working with, it is time to begin
setting up our environment and train some models. *

We will be using the Statsmodels library for teaching purposes. However, it is recommended you spend most of your time on the scikit-learn library since it allows for greater machine learning functionality. Throughout this lesson we will be showing snippets of code that were ran in a *Jupyter Notebook*. If you wish to learn more about Jupyter Notebook, click here.
The following snippet shows our import of the pandas and matplotlib libraries:

Now that we have imported the necessesary libraries, we will load our data file. It is a "csv" file named "shoes_all.csv" that includes around 85 data entries; each one a shoe product scraped from **Amazon**.

We will now use the matplotlib library to visualize our data in a scatterplot. We will create three graphs each with the "Rating" attribute on the y-axis. On the x-axis we will call three different attributes: "Days_Since_First_Listed", "Five_Stars", and "Rating_Count". Let's see if any of these three scatterplots show a clear correlation.

The graph with "Five_Stars" on the x-axis is the only one that shows a clear correlation out of the three. Therefore, we will focus on the relationship between "Five_Stars" and "Rating". The approach for predicting a quantitative response using a single feature or attribute is called *simple linear regression*. We will start with simple linear regression because it makes it easier to understand what's going on. Simple linear regression takes the following form:

y = β_{0} + β_{1}x

What does each term represent?

- y is the response
- x is the feature
- β
_{0}is the intercept - β
_{1}is the coefficient for x

Together, β_{0} and β_{1} are called the **model coefficients**. Through training, our model will "learn" the values of these coefficients. And once we have learned them, we will be able to predict the "Rating" attribute, which is the goal of our model.

Generally (and in our case), coefficients are estimated using the **least squares criterion**, which means we find the mathematical line which *minimizes* the **sum of squared residuals**. To learn more about this process, visit this link.

What do the elements in this diagram represent?

- The black dots are the
**observed values**of x and y. - The green line is our
**least squares line**: the prediction made by the algorithm. - The blue lines are the
**residuals**: the distances between the observed values and the least squares line. - RSS is the
*residual sum of squares* - y
_{i}is the observed result - ŷ
_{i}is the model prediction

How do the model coefficients relate to the least squares line?

- β
_{0}is the**intercept**(the value of y when x = 0) - β
_{1}is the**slope**(the change in y divided by the change in x)

Here is a graphical depiction of those calculations:

We will now use Statsmodels to estimate the model coefficients of our model. The first line will import the statsmodel library. The next line of code will create a fitted model in one line. The last line will print the coefficients:

How do we interpret the *Five_Stars* coefficient (β_{1})?

- A "unit" increase in Five_Stars is
**associated with**a 0.025374 "unit" increase in Rating. - For example, if the Five_Stars attribute increases from 85 to 86, the Rating would increase by 0.025374 as well. If the rating decreases from 90 to 85, the rating would decrease by 5 x 0.025374 or 0.12687. This means β
_{1}would be**negative.**

Now that we understand what the model coefficients mean and their importance in our model, we can use it to make predictions. We will first make this prediction manually and later use the Statsmodels library to confirm this prediction and automate the process. Our β_{0} value (the intercept) is **2.6**. Our β_{1} value is 0.025374. Lets say our Five_Stars attribute value is 80. What would we predict the Rating value to be in this scenario?

y = 2.6 + (0.025374 * 80)

y = 4.63

The value of "Rating" would be 4.63. Lets see if this value is corroborated by our model.

We created a new DataFrame because the Statsmodels formula interface requires it. We gave the attribue "Five_Stars" a value of 80 and used the model to make a prediction on this value. The result, as you can see, is 4.629942 and nearly identical to our previously calculated value of 4.63.

Now that our model has calculated the coefficients necessary to make accurate predictions based on the data we have trained it with, we can show how the least squares line looks like on a graph.

How well does our model fit the data? To find out, we can calculate the **R-squared** value of our model. What is R-squared? R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. It is a simple equation:

R-squared = Explained variation / Total variation

This value is always a number between 0 and 1. In general, the higher the R-squared, the better the model fits the data. Let's calculate the R-squared of our model.

An R-squared of 0.80 is rather good; our model is a success.

Simple linear regression can easily be extended to include multiple features. This is called **multiple linear regression**:

y = β_{0} + β_{1}x_{1} + ... + β_{n}x_{n}

Each *x* represents a different attribute or feature, and each has its own coefficient.

In the following snippet, you will see we used Statsmodels to estimate coefficients for multiple features including: "Five_Stars", "Rating_Count", "Days_Since_First_Listed", and "Fit_As_Expected". The output of the code will show us the values of these coefficients that the model has calculated.

How do we interpret these coefficients? An increase in the a unit of "Five_Stars" equals an increase in "Rating" by 2.453566e-02 which is equal to 0.02453566. This same logic is applied to the remaining features. As we can see, "Days_Since_First_Listed" has a negative coefficient, meaning that an increase in a single unit of "Days_Since_First_Listed" *decreases* the "Rating" by 2.663208e-07.

From these coefficients we can deduce that the only relevant features (the features that have a clear effect on the "Rating") from these four are "Five_Stars" and "Fit_As_Expected". **Adding more features/attributes to the model will always increase the R-squared value**. The following snippet shows the R-squared value of our multiple linear regression model:

We will now redo some of the Statsmodels code above in scikit-learn:

The first number output (2.3657) is the intercept of the model. The follwing numbers: [ 2.45356634e-02 1.37728661e-06 -2.66320762e-07 3.59436673e-03] each apply to one of the features in this order: "Five_Stars", "Rating_Count", "Days_Since_First_Listed", and "Fit_As_Expected".

We can use the model to predict a rating by giving it some inputs (the order of inputs will be the same as above):

Finally, we show the R-squared value of our model:

*We hope this lesson has helped you come to a greater understanding of machine learning. The linear regression algorithm we have covered in this lesson serves as the first step if you wish to continue expanding your knowledge on the subject. Machine learning is such a large topic that it would take a miriad of lessons to completely cover. For that reason, we urge you to keep searching for ways to improve your understanding. We have provided a few useful resources in our "more info" page if you wish to continue this journey. *

This concludes the section on machine learning and visual representations of data. Since the purpose of this website is primarily to *inform*, we wanted to provide enough information for basic understanding of the tools we used in the project. To further these concepts, we have developed a **web app** and encourage you to visit it once you feel comfortable with the concepts described on this page. This web app for dynamic visualizations is the focus of the next page.

Continue