This lecture is about linear regression. The primary references are

Regression is about understanding the relationship between a dependent variable, y, and a bunch of explanatory variables, X. For example, consider the movies data set scraped by Mine Cetinkaya-Rundel.

# you only need to install this package if you want to recreate the 3d scatter plot below
# install.packages(plot3D)
library(tidyverse)


movies <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/movies.csv')

# fix a missing value!
movies[movies[, 'title' ] == 'The End of America', 'runtime'] <- 73

dim(movies)
## [1] 651  32
head(movies)
## # A tibble: 6 × 32
##                  title   title_type       genre runtime mpaa_rating
##                  <chr>        <chr>       <chr>   <dbl>       <chr>
## 1          Filly Brown Feature Film       Drama      80           R
## 2             The Dish Feature Film       Drama     101       PG-13
## 3  Waiting for Guffman Feature Film      Comedy      84           R
## 4 The Age of Innocence Feature Film       Drama     139          PG
## 5          Malevolence Feature Film      Horror      90           R
## 6          Old Partner  Documentary Documentary      78     Unrated
## # ... with 27 more variables: studio <chr>, thtr_rel_year <dbl>,
## #   thtr_rel_month <dbl>, thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## #   dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## #   imdb_num_votes <int>, critics_rating <chr>, critics_score <dbl>,
## #   audience_rating <chr>, audience_score <dbl>, best_pic_nom <chr>,
## #   best_pic_win <chr>, best_actor_win <chr>, best_actress_win <chr>,
## #   best_dir_win <chr>, top200_box <chr>, director <chr>, actor1 <chr>,
## #   actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>,
## #   imdb_url <chr>, rt_url <chr>

Each of the 651 rows is a different movie. The columns include data from IMDb and Rotten Tomatoes such as imdb_rating, mpaa_rating (PG-13, R, etc), critics_score, etc.

Notation warning: there are a million synonyms for the X variables

  • X
  • explanatory variables
  • independent variables
  • predictors
  • input
  • features

and for the y variable

  • y
  • dependent variable
  • response
  • output
  • outcome

I will attempt to be consistent(ish) but may use these interchangeably. Most of the time we will deal with one y variable and multiple X variables (hence lower case for y and upper case for X). It is certainly possible to deal with multiple y variables.

What do we want to do?

Some questions we might be try to answer using the movies data set

  • Is there a relationship between IMDb_rating and critics_score?
  • How strong is this relationship?
  • Is the relationship linear?
  • Is this relationship different for different genres?
  • Are rotten tomatoes critics scores or audience scores more predictive of IMDb scores?
  • How accurately can we estimate the effect of each on sales?
  • If we only know the Rotten Tomatoes information, how accurately can we predict IMDb scores (or number of votes)?

Let’s just consider one x variable, critics_score budget for now. The y variable is imdb_rating. A scatter plot is the most simple way to look at the relationship between two variables.

ggplot(data=movies) +
    geom_point(aes(x=critics_score, y=imdb_rating))

What is the most simple way to model the relationship between TV budget and Sales? Plop a line through the data

# ggplot can plot simple spline models
ggplot(data=movies) +
    geom_point(aes(x=critics_score, y=imdb_rating)) +
    geom_smooth(aes(x=critics_score, y=imdb_rating), color='red', method=lm, se=FALSE)

Lines

For our purposes a model is a simple mathematical formula, \(f(x)\) mapping x to y (TV budget to Sales). Linear regression means \(f\) is linear. With one predictor and one response variable the equation of a line is given by

\[f(x) = ax + b\] where \(a, b \in \mathbb{R}\) are real numbers. \(a\) is the slope of the line and \(b\) is the intercept. For the line above \(a = 0.029\) and \(b = 4.8\).

The most simple mathematical object is a line. This is actually one of the most powerful principles in math/science/engineering and underlies many concepts from theoretical physics, to engineering, to statistics and machine learning. The premise of calculus, quoting my math professor from freshman year of college, is

Curves are hard. Lines are easy.

A linear relationship is easy to interpret: for every extra dollar you spend on TV budget the model predicts you will see an addition \(a\) increase in sales. So how do we pick which line i.e. how do we select the slope \(a\) and intercept \(b\)?

Which line?

There are many different criteria one might use to select a reasonable choice of lines for a set of data. The best criterion depends on what assumptions you make. This is worth repeating. Anytime you use a mathematical model you make a bunch of assumptions. Quoting the late, great David McKay,

You can’t do inference without making assumptions - David McKay

This is one of those statistical mantras you should tattoo to your arm.

So far we have made one assumption: a linear model captures what we want to capture about the data. We need a few more assumptions to get us to a particular linear model. There are roughly two (not mutually exclusive) ways of coming up with a way of fitting a model

  • fit a statistical distribution
  • optimize some, hopefully reasonable, mathematical/geometric criteria (called minimizing a loss function)

Staistical modeling

If you have studied linear regression before you probably learned the following statistical model

\[y = a x + b + \epsilon\] \[\epsilon \sim N(0, \sigma^2)\] For a given \(x\) you get to \(y\) by computing \(ax + b\) then adding Gaussian noise \(\epsilon\). This models says all the \(y\) data points should lie on the line \(ax+b\), but the data points have some added, random noise.

Randomness or noise is often described as measurement error– which certainly plays a role. To me, randomness is more about a lack of information. You can’t reasonable expect to exactly predict the number of units sold of a product based solely on the TV ad budget. With just this information you can certainly learn something; randomness is a way of saying “with the information I have I believe the following with some degree of uncertainty.” Statistical modeling is an exercise in both humility and optimism: I know I can’t be perfect, but how well can I do?

Understanding the statistical perspective on modeling is important. You will learn about it in a class like STOR 455. See chapter 3 from ISLR for more details.

Optimization

An alternative perspective on modeling is the optimization perspective. To me this perspective is easier to understand and under emphasized in statistics departments. Pure optimization perspectives are not a prior better or worse than pure statistical perspectives, they are just (usually) different.

Returning to simple linear regression, (simple means one x variable), let’s come up with a way of measuring how well a line fits our data. Here are a bunch of potential lines

We want a way of measuring how well a line fits the data. Equivalently, (since statisticians are pessimists) want a way of measuring how poorly a line fits the data. We are looking for a loss function (also see CH2 from ISLR).

The residuals are the vertical distances from the data points to the line (red lines below).

I’m going to introduce a bit of notation. Call the data \(x_1, \dots, x_200, y_1, \dots, y_200\) i.e. \(x_1=203.1\) and \(y_1=22.1\). Suppose we have set the \(a,b\) parameters of the model(e.g. \(a=.04\) and \(b=7\)). Then the i\(th\) residual is \(r_i = y_i - (ax_i + b)\).

A reasonable loss function to choose is the sum of the absolute values of the residuals (why is there an absolute value?) i.e.

\[ L(a, b| x_1^n, y_1^n) = \sum_{i=1}^n |r_i| = \sum_{i=1}^n |y_i - (ax_i + b)|\] Where the loss function is \(L\). The notation \(a, b| x_1^n, y_1^n\) means \(a,b\) are the parameters we want to set and we are given (i.e. have set) the values of \(x_1, \dots, x_n\) and \(y_1, \dots, y_n\). If this math formula doesn’t look appealing just think about the geometric intuition.

We want a line that is a close as possible to each data point. We are measuring “close” by the vertical distance (i.e. the absolute value of the residual). Absolute values is one reasonable choice, but why not square the residuals, or take any other power? i.e.

\[ L(a, b| x_1^n, y_1^n) = \sum_{i=1}^n |r_i|^2\] or \[ L(a, b| x_1^n, y_1^n) = \sum_{i=1}^n |r_i|^{395}\] Any of these choices provide for (somewhat) reasonable loss functions. The larger the exponent the more the loss function will care about outlines – points far away from the line – so too large an exponent means the loss function might be too sensitive to outlines.

The most common choice is the squared loss i.e. \(r_i^2\). The squared loss has many, many nice properties you can read about in ISLR ch 3. One property of the squared loss is that it turns out to give the same model (values of \(a\) and \(b\)) as the Gaussian nose model above!

Almost every time someone is talking about a linear model they are using the squared loss function.

Fitting a simple linear model in R

Use the lm function to fit a linear model in R

# see below for note about the ~ notation
lin_reg <- lm(imdb_rating ~ critics_score, movies)
summary(lin_reg)
## 
## Call:
## lm(formula = imdb_rating ~ critics_score, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.93679 -0.39499  0.04512  0.43875  2.47556 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.8075715  0.0620690   77.45   <2e-16 ***
## critics_score 0.0292177  0.0009654   30.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6991 on 649 degrees of freedom
## Multiple R-squared:  0.5853, Adjusted R-squared:  0.5846 
## F-statistic: 915.9 on 1 and 649 DF,  p-value: < 2.2e-16

Analytically compute the simple linear regression fit

Most algorithms in Machine Learning require some kind of numerical method such as gradient descent to fit. A lot of machine learning research involves developing new methods to fit existing models or developing new models that can be fit with existing methods. This is where computer science and optimization become critical to machine learning.

In some cases we can find a closed form solution for a model (aka find an analytic solution). For linear regression this means finding a value of \(a, b\) that minimizes \(L(a, b| x_1^n, y_1^n)\) given above. This optimization problem is an exercises in freshman year calculus i.e. compute the derivative, set it to zero and solve for \(a*, b*\). Recall the \(x_i, y_i\) are given numbers.

\[L(a, b| x_1^n, y_1^n) = \sum_{i=1}^n |r_i|^2\] \[ = \sum_{i=1}^n (y_i - a x_i - b )^2\] Taking derivatives, \[ \frac{dL(a, b)}{da} = \sum_{i=1}^n 2 x_i (y_i - a x_i - b )\] \[ \frac{dL(a, b)}{db} = \sum_{i=1}^n (y_i - a x_i - b )\] Now set these two equations equal to zero

\[\sum_{i=1}^n 2 x_i (y_i - a x_i - b ) = 0\] \[ \sum_{i=1}^n (y_i - a x_i - b ) = 0\] and do a little algebra to find

\[a* = \frac{\sum_{i=1}^n (x_i-\overline{x})(y_i-\overline{y})}{\sum_{i=1}^n (x_i-\overline{x})^2}\] \[ b* = \overline{y} - \overline{x} a*\] where \(\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i\) and \(\overline{y} = \frac{1}{n}\sum_{i=1}^n y_i\) are the means.

More than one X variable

The piece of code below imdb_rating ~ critics_score + audience_score says model imdb_rating as the response variable on critics_score and audience_score. This is called Wilkinsion-Rogers notation or just formula notation which is a useful mini-language for writing models in R.

lin_reg <- lm(imdb_rating ~ critics_score + audience_score, movies)
summary(lin_reg)
## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51964 -0.19767  0.03466  0.30671  1.22691 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.647241   0.062471   58.38   <2e-16 ***
## critics_score  0.011816   0.000954   12.39   <2e-16 ***
## audience_score 0.034703   0.001340   25.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4904 on 648 degrees of freedom
## Multiple R-squared:  0.7962, Adjusted R-squared:  0.7956 
## F-statistic:  1266 on 2 and 648 DF,  p-value: < 2.2e-16
movies %>% 
    select(imdb_rating, critics_score, audience_score)
## # A tibble: 651 × 3
##    imdb_rating critics_score audience_score
##          <dbl>         <dbl>          <dbl>
## 1          5.5            45             73
## 2          7.3            96             81
## 3          7.6            91             91
## 4          7.2            80             76
## 5          5.1            33             27
## 6          7.8            91             86
## 7          7.2            57             76
## 8          5.5            17             47
## 9          7.5            90             89
## 10         6.6            83             66
## # ... with 641 more rows

The lm object computes a lot of useful statistics such p-values of each variable. Linear regression is very important and you should understand most of the statistics output from the summary function – many inferential questions can be answered by these statistics.

Predictions

Say we want to get the model’s prediction for the IMDb rating of a model with a RT critics score of 80 and audience score of 30. You can do this manually,

# new points
critics_score_new <- 80
audience_score_new <- 80

# get the model coefficients
beta <- lin_reg$coefficients
beta
##    (Intercept)  critics_score audience_score 
##     3.64724065     0.01181605     0.03470355
# manually compute the prediction, first term is the intercept
imdb_rating_pred <- beta[1] + beta[2] * critics_score_new + beta[3] * audience_score_new
imdb_rating_pred
## (Intercept) 
##    7.368808

Usually you will use the predict function. First you create a new data frame with the points at which you want to predict. This new data frame should have the same column names as the original data frame, but only include the x variables

# column names should be the same as the original data frame used to train the model
new_data <- tibble(critics_score = critics_score_new, 
                   audience_score=audience_score_new)

predict(lin_reg, newdata=new_data)
##        1 
## 7.368808

The modelr package has some functions that automate a lot of prediction tasks (see r4ds chapter 23).

Visual diagnostics

Understanding what’s going on with a linear model is an important skill that you learn lot’s about in a class like STOR 455. There are lots of numerical summaries that help you understand the model, but visual summaries can be very helpful. Unfortunately, most visual summaries are restricted to 2 dimensions.

Warning: most of the time 2 dimensional plots, maybe with a color or shape, are the most informative. While you can add lots of aesthetics or make 3d plots (e.g. below), you often don’t get much value over several well considered 2d plots.

A lot of helpful plots come from comparing the residuals to other variables in such as the y values

diagnostics <- tibble(predictions = lin_reg$fitted.values,
                      residuals = lin_reg$residuals)

ggplot(diagnostics) +
    geom_point(aes(x=predictions, y=residuals))

Geometry on linear regression

Simple linear regression puts a line in the data. When there is more than one x variable linear regression puts a hyperplane through the data. When we have two X variables a hyper plane is a plane in 2 + 1 = 3 dimensions.

When there are p x variables and one y variable then linear regression gives a p dimensional hyper plane in p + 1 dimensional space. If that statements makes you’re head hurt then you’re in good company. Warning: Reasoning about high dimensional objects can be very challenging. Typically the best place to start is to use 2 and 3 dimensional examples to get intuition about a higher dimension analogue. This is often sufficient to understand what you need (such as higher dimensional hyper planes), but the analogies can break down (high dimensional space is a weird place). An important skill in math is being able to suspend your disbelief in the right way.

When you have 3 or more predictors you obviously can’t plot the full data (unless you are Bill Thurston and your 4d intuition is on point). You can, however, make lots of 2d plots (see below).

Categorical variables: factors

Linear regression operates on numerical variables, but a lot of data is not numerical. For example, genre is a categorical variable (e.g. Drama, Comedy). mpaa_rating is an ordinal variable meaning it has a natural order ( G < PG < PG-13 < R < NC-17 < Unrated). We will focus on categorical variables.

From a math point of view, the trick for non-numerical variables is to turn them into numbers some how (usually by using dummy variables. From a programming point of view, R has a nice(ish) way of naturally dealing with categorical data: factors.

There is even an R package that makes dealing with factors easier

library(forcats)

From r4ds chapter 15, > In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values.

Factors can handle both unrecorded and ordered categorical variables. For more details read r4ds chapter 15.

You can create a factor variable

fact <- factor(c('a', 'a','d', 'b', 'c', 'c', 'b'))
fact
## [1] a a d b c c b
## Levels: a b c d

notice the Levels printed out with the factor object: these are the categories. Levels are ordered implicitly (usually alphabetically). You can change the ordering i.e.

factor_rating <- factor(movies$mpaa_rating)
levels(factor_rating)  <- c("G", "PG", "PG-13", "R", "NC-17", "Unrated")
levels(factor_rating)
## [1] "G"       "PG"      "PG-13"   "R"       "NC-17"   "Unrated"

Some functions will automatically treat string variables in a data frame as a factor variable. However, in general you should tell the data frame that a variable is a factor

# notice the <fctr> data type
movies %>% 
    mutate(mpaa_rating=factor(mpaa_rating)) %>% 
    select(mpaa_rating)
## # A tibble: 651 × 1
##    mpaa_rating
##         <fctr>
## 1            R
## 2        PG-13
## 3            R
## 4           PG
## 5            R
## 6      Unrated
## 7        PG-13
## 8            R
## 9      Unrated
## 10     Unrated
## # ... with 641 more rows

Let’s create a new data frame called data with just a few columns we’re interested in and specify the factor variables

data <- movies %>% 
        select(imdb_rating, imdb_num_votes,
               critics_score, audience_score,
               runtime, genre, mpaa_rating,
               best_pic_win) %>% 
        mutate(genre=factor(genre),
               mpaa_rating=factor(mpaa_rating), 
               best_pic_win=factor(best_pic_win))

and now fit a linear model

# imdb_rating ~. means imdb_rating  on everything else
lin_reg <- lm(imdb_rating ~. , data)
summary(lin_reg)
## 
## Call:
## lm(formula = imdb_rating ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.42000 -0.17686  0.02987  0.25947  1.13153 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.425e+00  1.688e-01  20.283  < 2e-16 ***
## imdb_num_votes                  8.336e-07  1.958e-07   4.258 2.37e-05 ***
## critics_score                   1.028e-02  9.426e-04  10.906  < 2e-16 ***
## audience_score                  3.228e-02  1.352e-03  23.887  < 2e-16 ***
## runtime                         4.178e-03  1.071e-03   3.902 0.000106 ***
## genreAnimation                 -4.192e-01  1.794e-01  -2.337 0.019777 *  
## genreArt House & International  2.822e-01  1.399e-01   2.018 0.044056 *  
## genreComedy                    -1.137e-01  7.683e-02  -1.480 0.139392    
## genreDocumentary                4.044e-01  1.065e-01   3.796 0.000161 ***
## genreDrama                      1.005e-01  6.737e-02   1.492 0.136138    
## genreHorror                     1.030e-01  1.153e-01   0.893 0.372004    
## genreMusical & Performing Arts  1.274e-01  1.500e-01   0.849 0.396097    
## genreMystery & Suspense         2.617e-01  8.572e-02   3.053 0.002363 ** 
## genreOther                     -5.107e-02  1.301e-01  -0.392 0.694886    
## genreScience Fiction & Fantasy -2.128e-01  1.641e-01  -1.297 0.195164    
## mpaa_ratingNC-17               -1.561e-01  3.486e-01  -0.448 0.654486    
## mpaa_ratingPG                  -1.484e-01  1.269e-01  -1.170 0.242572    
## mpaa_ratingPG-13               -1.394e-01  1.309e-01  -1.065 0.287236    
## mpaa_ratingR                   -9.380e-02  1.259e-01  -0.745 0.456608    
## mpaa_ratingUnrated             -1.735e-01  1.436e-01  -1.209 0.227282    
## best_pic_winyes                -1.066e-01  1.870e-01  -0.570 0.568781    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4598 on 630 degrees of freedom
## Multiple R-squared:  0.8258, Adjusted R-squared:  0.8203 
## F-statistic: 149.4 on 20 and 630 DF,  p-value: < 2.2e-16

You’ll notice R introduced a bunch of new variables (called dummy varaibles) such as mpaa_ratingPG, mpaa_ratingNC-17, best_pic_winyes, etc. Notice, however, that there is no best_pic_winno (see warning below).

Dummy variables: replace a categorical variable, \(x\), that has K categories with \(K\) new **indicator variables \(d_1, \dots, d_{K-1}\). For a given observation, \(x\) the indicator \(d_k\) is 1 if \(x\) is in the k\(th\) category.

For example, best_pic_win has two categories so we introduce one new dummy variable best_pic_winyes. For mpaa_rating there are 6 categories so we introduce 5 new variables.

Warning: for linear regression one commonly introduces \(K-1\) dummy variables instead of \(K\). For linear regression it turns out that one of the categories gets absorbed by the intercept term.

The upshot is that once we have replaced categorical variables with dummy variables our data matrix is only numbers!

Non-linear models

Many relationships are not linear. For example,

ggplot(data) +
    geom_point(aes(x=imdb_num_votes, y=imdb_rating))

In general fitting a non-linear model is challenging, but there are two ways of using a linear model to make a non-linear model. The first is through a data transformation i.e. instead of number of votes maybe we use \(\sqrt{\text(number of vote)}\)

# ggplot can automatically plot the linear regression line
ggplot(data) +
    geom_point(aes(x=sqrt(imdb_num_votes), y=imdb_rating)) +
    geom_smooth(aes(x=sqrt(imdb_num_votes), y=imdb_rating), color='red', method=lm, se=FALSE)

A related trick is to add a bunch of transformed variables into the X data frame. For example, for a variable \(x\) we could add all of \(\sqrt{x}, x^2, x^3, log(x)\) to the data frame; we now have 4 additional variables in the data frame

data_trans <- data %>% 
                mutate(nv_sqrt = sqrt(imdb_num_votes),
                       nv_sq = imdb_num_votes^2,
                       nv_cube = imdb_num_votes^3,
                       nv_log = log(imdb_num_votes))

We can now fit the linear regression model

lin_reg_trans <- lm(imdb_rating ~., data_trans)

summary(lin_reg_trans)
## 
## Call:
## lm(formula = imdb_rating ~ ., data = data_trans)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.48236 -0.18993  0.03556  0.24993  1.18680 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.285e+00  5.715e-01   5.748 1.41e-08 ***
## imdb_num_votes                 -2.048e-06  5.748e-06  -0.356 0.721692    
## critics_score                   1.033e-02  9.427e-04  10.956  < 2e-16 ***
## audience_score                  3.158e-02  1.380e-03  22.889  < 2e-16 ***
## runtime                         4.054e-03  1.077e-03   3.766 0.000182 ***
## genreAnimation                 -4.318e-01  1.792e-01  -2.409 0.016263 *  
## genreArt House & International  3.384e-01  1.420e-01   2.384 0.017431 *  
## genreComedy                    -9.155e-02  7.760e-02  -1.180 0.238561    
## genreDocumentary                4.965e-01  1.144e-01   4.340 1.66e-05 ***
## genreDrama                      1.342e-01  6.883e-02   1.950 0.051595 .  
## genreHorror                     1.147e-01  1.155e-01   0.993 0.321118    
## genreMusical & Performing Arts  1.828e-01  1.524e-01   1.199 0.230920    
## genreMystery & Suspense         2.753e-01  8.609e-02   3.198 0.001455 ** 
## genreOther                     -5.774e-04  1.324e-01  -0.004 0.996522    
## genreScience Fiction & Fantasy -1.893e-01  1.645e-01  -1.151 0.250198    
## mpaa_ratingNC-17               -1.413e-01  3.484e-01  -0.405 0.685316    
## mpaa_ratingPG                  -1.558e-01  1.267e-01  -1.230 0.219066    
## mpaa_ratingPG-13               -1.698e-01  1.313e-01  -1.293 0.196461    
## mpaa_ratingR                   -1.072e-01  1.258e-01  -0.852 0.394534    
## mpaa_ratingUnrated             -1.573e-01  1.443e-01  -1.090 0.276323    
## best_pic_winyes                -8.826e-02  1.887e-01  -0.468 0.640119    
## nv_sqrt                         1.378e-03  3.000e-03   0.459 0.646099    
## nv_sq                           2.944e-12  8.569e-12   0.344 0.731251    
## nv_cube                        -1.566e-18  5.885e-18  -0.266 0.790243    
## nv_log                          6.081e-03  8.751e-02   0.069 0.944619    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4589 on 626 degrees of freedom
## Multiple R-squared:  0.8277, Adjusted R-squared:  0.821 
## F-statistic: 125.3 on 24 and 626 DF,  p-value: < 2.2e-16

Now the resulting model is not linear in imdb_num_votes (also looks terrible!)

pred_df <- tibble(imdb_rating_pred = unname(predict(lin_reg_trans)),
                  imdb_num_votes=data_trans$imdb_num_votes,
                  imdb_rating=data_trans$imdb_rating)

ggplot(pred_df) +
    geom_point(aes(x=imdb_num_votes, y=imdb_rating)) +
    geom_line(aes(x=imdb_num_votes, y=imdb_rating_pred), color='red')