HW 3

George Han
03/27/13
Regression and Multivariate Data Analysis STAT-UB 17
Homework 3
Professor Simonoff
The Popularity of YouTube Videos
YouTube, a video sharing website, has gained massive popularity since its launch in
2005. Anyone with a free YouTube account can upload videos to the internet. These videos
range from amateur to professional, from videos of sports to videos of music. Some even boast
hundreds of millions of views, and sometimes even billions. What makes certain YouTube videos
so popular? Are there any specific factors that are closely associated with video popularity?
This report will use statistical methods to test for associations between video popularity
and several potential predicting factors. It is interesting to look for these factors because insights
may be revealed into the mechanics of internet video popularity. The response variable will be
YouTube video popularity, using the number of video views as a numerical unit of measurement.
Potential predictors are:
Video likeability. Most YouTube videos feature an easily accessible rating mechanism in
the form of a like button and a dislike button. Only those with YouTube accounts can
like or dislike videos, and even so, every YouTube account can only register one like or
one dislike for every video. A rating of a like or a dislike is not permanent; raters can with
a single click easily switch likes to dislikes, dislikes to likes, and can even undo ratings.
In this report, video likeability will be quantified as the ratio of the number of video likes
to the number of video dislikes. This ratio will be used instead of the ratio of the number
of video likes to the number of video views because the latter ratio does not distinguish
between neutral viewers who neither like nor dislike videos, and because dislike is a
fundamental component of likeability.
Number of subscribers to the uploader. Like the like/dislike feature, most YouTube videos
feature another easily accessible mechanism in the form of a subscribe button.
Similarly, only those with YouTube accounts can subscribe to uploaders, and every
YouTube account can only subscribe to a particular uploader once. Once subscribed, the
user will receive electronic notifications if the corresponding uploader uploads a new
video. Users can unsubscribe at any time. Some uploaders have only a few subscribers,
whereas others have millions.
Number of videos uploaded by uploader. While some uploaders only have uploaded a
few videos, some have uploaded hundreds.
Number of video comments. Most YouTube videos have a comment feature which allows
those with YouTube accounts to post comments directly onto the web site hosting the
video. Users can post new comments, respond to previous comments with a reply
feature, and can even rate individual comments with another like/dislike feature focused
on comments. Some videos have only a few comments, while others may have millions.
If uploaders want, they can disable the comment feature on their uploaded videos. This is
typically done if the video of interest is controversial (i.e. a video pertaining to gay
rights) and thus would garner comments in the form of cyber-arguments, or if the video is
exceedingly bad so as to block hate-comments.
Running length of the video measured in seconds. Most YouTube uploaders are allowed
by YouTube to upload videos of maximum length 15 minutes. However, the longest video
in this data set is only 6 minutes and 16 seconds. Some users have uploaded videos of
only a few seconds (sometimes even only one second), and some users have uploaded
videos with running times as long as many weeks.
Length of the videos title. All YouTube videos have titles. In this report, title length will
be quantified as the count of all characters in the videos title as visible on the videos
web page, including all spaces, special characters, and foreign characters.
Since these values are constantly changing because the viewing of YouTube videos is an
ongoing process, it is necessary to clarify that all analyses in this report are based on data values
collected on date 03/11/2013 starting at approximately 12:00PM noon. The data for this report
consists of video statistics for the Top 30 most popular videos on YouTube of all time based on
the number of video views, which were found on YouTube itself, at
http://www.youtube.com/charts/videos_views?t=a as well as on the individual YouTube web
pages for each video*. 28 of these 30 videos are music videos. Therefore, specifically, this report
tries to answer the following question: for YouTube videos ranked in the list Top 30 most popular
videos on YouTube of all time based on the number of video views, are there any statistically
significant associations between video views and the six predictors listed above?
Although it is unclear as to how exactly YouTubes view count algorithm counts views
for individual YouTube videos, YouTube states on its web page that a view is counted whenever
someone watches a video on YouTube and that we do not get more specific than this to avoid
attempts at artificially inflating view counts. Thus, it is widely accepted that measures are taken
to prevent bias of the statistic through practices such as repeated web page refreshing or visiting,
use of spamming techniques, use of malware or other illegal software, etc.
It is expected that greater popularity should be associated with:
Higher video likeability because it is expected that more users should desire to view
videos that are appealing and because users will desire to view appealing videos more
than once. However, there is an exception to this in the rare case that a video is so
exceedingly bad (very low like/dislike ratio) that viewers can barely believe what they
are seeing, the video may be virally sensationalized as a figure of ridicule, resulting in a
large number of video views.
Greater numbers of subscribers to the uploader because uploaders generally tend to
upload videos of similar type, i.e. uploaders of music videos are expected to upload more
music related content. Therefore, it is expected that viewers who enjoy a particular video
might enjoy to similar degrees the other videos of the videos uploader, potentially
leading to subscriptions so that they will receive notifications if that uploader uploads
additional videos; also, more uploader subscribers means that then the uploader uploads a
new video, more people will be aware of that videos uploading, potentially resulting in
more views.
Greater numbers of videos uploaded by uploader because those who upload more can be
expected to be generally more experienced in video creation and thus able to produce
videos that are more appealing and therefore more popular.
Greater number of comments on the videos YouTube page because if a video has traits
that should make it popular, those traits may also potentially incline viewers to contribute
by publicizing their opinions and/or thoughts on the video.
Greater running length, because it might be expected that videos at the Top 30 level are
obviously popular, and therefore the longer videos should tend to have more content as
well as more work put into them, resulting in more popularity. In this report, this will be
quantified as the number of seconds which the video runs for. However, this speculation
is much less substantial than the previous ones.
Greater number of letters in the title, because videos that have potential to be very
popular should be expected to be named more professionally with longer titles that may
include a title supplemented by additional information such as contributing artists or
names of notable people who may appear in the video; however, this speculation is also
much less substantial than the previous ones.
30
40
0e+00
4e+06
6e+06
Views vs. Videos
Views vs. Comments
200
300
400
4.0e+08
1.2e+09
Subscribers
500
0e+00
2e+06
4e+06
6e+06
Comments
Views vs. Length
Views vs. Letters
Length
300
400
4.0e+08
Views
200
8e+06
1.2e+09
Videos
4.0e+08
100
2e+06
LD
Views
100
1.2e+09
50
1.2e+09
4.0e+08
Views
1.2e+09
20
1.2e+09
10
4.0e+08
Views
Views
Views vs. Subscribers
4.0e+08
Views
Views vs. LD
20
30
40
50
60
70
Letters
At first glance, there does not appear to be much going on in any of the six above
scatterplots. There do appear to be some extreme values present in nearly all six, however. These
will be addressed and dealt with shortly.
The least squares multiple regression line is:

Views = 3.292x 108 6.262 x 105 * LD 1.962 x 101 * Subscribers 2.484 x 105 * Videos +
9.576 x 101 * Comments + 1.476 x 105 * Length + 1.236 x 106 * Letters
The intercept coefficient is useless to interpret because it is nonsensical in the context of
the data: a video cannot exist, let alone be viewed, if all predictors have value 0 because a video
cannot be 0 seconds long, cannot have 0 characters in the title, and cannot exist without its
uploader having at least 1 upload because that video counts as an upload.
Holding all other predictors constant:
An increase in the like/dislike ratio of 1 (this ratio can increase either as a result of a
greater increase in new likes than in new dislikes or an undoing of previously registered
dislikes) is associated with an expected estimated decrease in views of 6.262 x 105, or
approximately 626,200.
An increase in the number of subscribers to the uploader of 1 is associated with an
expected estimated decrease in the views of 1.692 x 101, or approximately 16.92.
An increase in the number of videos uploaded by the uploaded of 1 is associated with an
expected estimated decrease in views of 2.484 x 105, or approximately 248,400.
An increase in the number of video comments of 1 is associated with an expected
estimated increase in views of 9.576 x 101, or approximately 95.76.
An increase in the length of the video of 1 is associated with an expected estimated
increase in views of 1.476 x 105, or approximately 147,600.
An increase in the number of characters in the videos title of 1 is associated with an
expected estimated increase in views of 1.236 x 106, or approximately 1,236,000.
The regression is of moderate strength with R2 = 0.5893, meaning that 58.93% of the
variability in the target variable can be accounted for by the predicting variables. The residual
standard error is 162,600,000, representing the standard deviation of points formed around the
regression line. Although this value may be alarmingly large, it is actually reasonable because
values for the target variable are both in the low hundred millions as well as in the low billions.
68% of actual target values should be within +/- 162,600,000 of the predicted target values, 95%
within +/- 2 * 162,600,000, and 99.7% within +/- 3 * 162,600,000. The p-values of the tstatistics for the each of the regression coefficients represent the probabilities out of 1 of those
coefficient values occurring due to pure chance. They correspond to the following hypothesis
tests:
HO: the regression coefficient of interest is 0 (the predictor has no association with views,
holding all other predictors equal)
HA: the regression coefficient of interest is different from 0 (the predictor has some association
with views, holding all other predictors equal)
In this report, regression coefficients will be considered statistically significant if they are
less than 0.05, the = 0.05 level of significance. Unfortunately, none of them are except for (we
can only reject the null hypothesis and accept the alternate hypothesis for) the intercept (barely,
at p-value = 0.0449, but useless to interpret because the intercept is not meaningful in the context
of the data as previously mentioned), and the video comments (p-value = 2.24x10-5). This means
that the only predictor that appears to have association with views is the number of comments.
This probably means that this regression is bad. The p-value of the F-statistic, 0.001177,
however, is indeed significant at = 0.05, meaning that the model as a whole is statistically
significant. It corresponds to the following hypothesis test:
HO: all regression coefficients are simultaneously 0 (model has no predictive capability)
HA: at least one regression coefficient is not 0 (model has some predictive capability)
We can reject the null hypothesis and accept the alternate hypothesis, meaning that the
model as a whole has some predictive capability.
Standardized residual plots:
-2
-1
2
0
-2
4.0e+08
8.0e+08
1.2e+09
Fitted Values
Histogram of the Residuals
Residuals vs. Order of the Data
-4
-2
Standardized Residuals
2
0
-2
-4
10
15
Normal Scores
Frequency
-4
2
0
-2
-4
Residuals vs. Fitted Values
Normal Probability Plot of Residuals
10
15
20
25
30
Order of the Data
The normal probability plot implies that the data are fairly normally distributed, but have
a fat right tail as well as some outliers, which may prove problematic. The residuals vs. fitted
values plot implies that the data exhibit some heteroscedasticity, as evident in the outward
fanning as the fitted values increase, and it also displays some outliers. The histogram is
approximately normally distributed, but displays some outliers too. The residuals vs. the order of
the data is well behaved, but like the others, displays some outliers. Overall, the various outliers
clearly need to be addressed. Also, the heteroscedasticity in the data may mean that estimates of
regression coefficients may be less accurate, and also that predictive accuracy will be incorrect.
The other assumptions for linear regression hold up the data appear to be normally distributed,
and the errors do not appear to be significantly correlated with one another.
Once again looking at the scatterplots of each predictor vs. the target, we see some points
that may potentially be problematic outliers or leverage points.
30
40
0e+00
4e+06
6e+06
Views vs. Videos
Views vs. Comments
200
300
400
4.0e+08
1.2e+09
Subscribers
500
0e+00
2e+06
4e+06
6e+06
Comments
Views vs. Length
Views vs. Letters
Length
300
400
4.0e+08
Views
200
8e+06
1.2e+09
Videos
4.0e+08
100
2e+06
LD
Views
100
1.2e+09
50
1.2e+09
4.0e+08
Views
1.2e+09
20
1.2e+09
10
4.0e+08
Views
Views
4.0e+08
Views
Views vs. LD
20
30
40
50
Letters
60
70
Below is a series of diagnostic plots to assess the magnitude of any possible outliers or
leverage points.
Diagnostic Plots
Cook's distance
0 1 2 3 4 5 6
1
11
Studentized residuals
-6
-2
2
6
1
3
Bonferroni p-value
0.0
0.4
0.8
2
30
0.2
hat-values
0.6
13
0
10
15
20
25
30
Index
The topmost plot shows the Cooks distances for each observation. These measure how
much an observation influences the fitted regression coefficients. Any observation with a Cooks
distance of above 1 should be studied further, and here, those are flagged as observation 1
(0.874) and 2 (2.470). Although observation 1s Cooks distance is lower than 1, it is very close
to 1, and so it should still be studied further.
The second plot from the top shows the standardized residuals for each observation.
These measure how far out an observation is from where the general regression should imply.
Any observation with standardized residual of +/- 2.5 should be studied further because that
implies that such an observation could only occur due to pure chance 1% of the time, and here,
those are observations 1 (6.855) and 2 (-7.049).
The bottommost plot shows the hat values (Hi) for each observation. These, based on x
values, measure how far away particular cases are from the rest of the x variables, indicating
leverage. Any observation with hat value of 2.5((p + 1)/n) or greater, where p is the number of
predicting variables in the regression (6) and n is the total number of observations (30), so for
this regression the value would be 2.5((p + 1)/n) = 2.5((6 + 1)/30) = 0.58 3 , should be studied
further. Here, those observations are observation 2 (0.728) and 30 (0.423).
A table with specific values for diagnostic tests for potential extreme values (contains
observation numbers, standardized residual values, hat values, and Cooks distances):
It appears that the main observations of interest regarding extreme values are 1, 2, and 30.
Observation 1 has a large standardized residual as well as a relatively large Cooks distance.
Observation 2 has a large standardized residual, a large hat value, as well as a large Cooks
distance. Observation 30 has a large hat value. Observation 3 does not have any significantly
large value for any of these diagnostics, and either does observation 30. Observation 1, 2, and 30
are tabled below in order:
Title
PSY GANGNAM STYLE
(#####) M/V
Justin Bieber Baby ft. Ludacris
Tootin' Bathtub
Baby Cousins - Official
Views
LD
1.41E+
09
10.667
93
8.38E+
08
2.55E+
08
0.4655
65
0.5741
98
Subscribe Video Commen Lengt

rs
s
ts
h
3142113
1
5510834
53
Lette
rs
1
3613746
9023632
25
78609
40
Observation 1 is an influential outlier because it has a large standardized residual as well

as a relatively large Cooks distance. This is because it is a hugely popular internet sensation
music video by famous Korean-pop (K-pop) singer PSY. The video is catchy as well as flashy,
and is the only YouTube video to have surpassed 1 billion views (currently 1,407,168,102). It is
unsure as to why this video is so popular because according to the scatterplots above, other
videos with similar predictor values for the six predictors seem to have substantially less views.
The ridiculous rise to fame of Gangnam Style is throwing off the reliability of the data in this
report, so this observation will be removed.
Observation 2 is also an influential outlier because it has a large standardized residual, a
large hat value, as well as a large Cooks distance. It is the only observation in this report that is
flagged by all three diagnostics for extreme values! This is because it is by Justin Bieber, and
because the video is called Baby. Loved by some but hated by many due to his drippy style
oriented towards pre-teen females, his controversial videos have made him not only intolerably
famous among those to whom he orients his videos, but also infamous among the majority who
cannot tolerate him, as evident in the Babys extremely low like/dislike ratio of 0.466
(approximately 2 dislikes for every 1 like). Clearly, something about Bieber is throwing off the
reliability of the data in this report, and so for the greater good of least squares regression, this
observation will be removed.
Observation 30 is not an outlier because its standardized residual is not outstanding, but it
is a leverage point because of its high hat value. This is because it is another controversial video.
The video is a 36 second long poorly-animated comedic short involving two babies in a bathtub.
They are blabbering about inappropriate topics while a catchy music track plays in the
background. The attempt at humor is clearly a failure, as evident in the extremely low like/dislike
ratio of 0.574, also representing approximately 2 dislikes for every 1 like). The uploader of this
video has uploaded a monstrous total of 540 videos, nearly five times as much as the next highest
in terms of uploads in this data set (observation 9, Michel Telo Ai Se Eu Te Pego Oficial
(Assim voce me mata), 119 uploads). The uploader also disabled comments for Tootin Bathtub
Baby Cousins Official, probably because he or she did not want to receive massive inflows of
negative feedback, which he or she undoubtedly would have, similar to what happened with
Biebers Baby. In conclusion, the video is a good example of the previously mentioned
exception that in some cases, due to videos being bad, lower likeability will result in higher view
counts. This observation will be removed.
It is interesting to note that two of the three observations that will be removed are about
babies. These observations also have extremely low like/dislike ratios. Maybe the quality of
pertaining to babies can be used as another potential additional diagnostic to test for extreme
values regarding data of YouTube videos.
The same six scatterplots as before but without observations 1, 2, and 30:
10
20
30
40
50
0e+00
2e+06
4e+06
6e+06
Views vs. Videos
Views vs. Comments
3e+08
Views
5e+08
Subscribers
5e+08
LD
3e+08
40
60
80
100
120
200000
600000
1000000
Videos
Comments
Views vs. Length
Views vs. Letters
1400000
3e+08
3e+08
Views
5e+08
20
5e+08
Views
Views
5e+08
3e+08
Views
5e+08
3e+08
Views
Views vs. LD
100
200
300
Length
400
20
30
40
50
60
70
Letters
The data look much more well behaved. The removal of observations 1, 2, and 30 has
shifted the trend in Views vs. Subscribers from no trend to negative trend, Views vs. Videos from
negative trend to no trend, Views vs. Length from no trend to negative trend, and Views vs.
Letters from no trend to positive trend.
The new least squares multiple regression line is:

Views = 2.846 x 108 6.137 x 105 * LD 1.443 x 101 * Subscribers 2.436 x 105 * Videos +
1.413 x 102 * Comments + 8.451 x 102 * Length + 2.101 x 106 * Letters
Interestingly, the removals of observations 1, 2, and 30 have clearly made the model
worse. R2 has decreased from 0.5893 to 0.3783, meaning that even less variability in the target
variable can be accounted for by the predicting variables now. The residual standard error is
93,210,000, which is lower than the previous 162,600,000, which is expected because
observations 1 and 2 have very large view counts which would have inflated the residual
standard error. The p-values of the t-statistics for each of the regression coefficients have also
changed, some for the better and others for the worse: p-value of the intercept changed from
0.0449 to 0.00834, like/dislike ratio from 0.7968 to 0.68580, number of subscribers to the
uploader from 0.4060 to 0.35100, number of videos uploaded by uploader from 0.5321 to
0.79028, number of video comments from 2.24 x 10-5 to 0.02070, video length from 0.8033 to
0.99804, and number of characters in the video title from 0.6223 to 0.22552. Those that were
significant are still significant, and those that were not significant are still not significant (at the
= 0.05 level of significance). However, the p-value of the F-statistic has changed from 0.001177
to 0.1091, from significant to not significant at the 0.05 level of significance. Restating the null
hypotheses:
For the new model, since 0.1091 is greater than 0.05, we must accept the null hypothesis
and conclude that the new model has no predictive capability. This is interesting because it infers
that while the model appears to have no predictive capability, the inclusion of observations 1, 2,
and 30 creates the false illusion that the model has indeed some predictive capability.
-2
-1
2
1
0
-1
2.5e+08
3.5e+08
4.5e+08
-1
1
0
-1
8 10
6
4
0
-2
Fitted Values
Normal Scores
Frequency

2
1
0
-1
10
15
20
25
Order of the Data
The normal probability plot implies that the data have a slightly short left tail, a slightly
fat right tail, and data from both tails having slightly higher than expected residuals. The
residuals vs. fitted values plot implies that the data are roughly normally distributed. The
histogram is right skewed, implying some deviation from normality. The residuals vs. the order
of the data displays a somewhat non-linear trend, but this is useless to interpret because the data
is not time series data and therefore cannot contain autocorrelation. Overall, there may be some
deviations in normality in the data (the data might not be inherently linear). This means that
potentially, some estimates of regression coefficients may be inappropriate and that some parts of
signals might be being mistakenly treated as noise. The other assumptions for linear regression
hold up there appears to be little to no heteroscedasticity and the errors do not appear to be
significantly correlated with one another.
But the analysis does not end with this because there may yet be some relationships
between predictor variables, i.e. multicollinearity, which affects the quality of regression on the
model. Below is a series of scatterplots of each predictor variable with each other predictor
variable (ignore the row and column pertaining to Title because those have to do with the
categorical titles of the YouTube videos, as well as the row and column pertaining to views
because those have to do with the correlation between predictor variables and the target variable,
which does not impact multicollinearity):
0e+00
6e+06
200000
20
50
25
3e+08
0 10
Title
50
3e+08
Views
6e+06
20
LD
80
0e+00
Subscribers
20
Videos
400
200000
Comments
50
100
Length
20
Letters
0 10
25
0 20
50
20
80
100
400
Correlations in any of these scatterplots indicate potential multicollinearity. Because there

do not appear to be any noticeable correlations in any of the above scatterplots of predictor
variables vs. predictor variables, there does not appear to be any multicollinearity. However, we
must examine this topic further. A matrix of the correlations (Pearson correlation coefficients) of
each predictor variable with each other predictor variable as well as the corresponding p-values:
At the = 0.05 level of significance, there appears to be only one pair of predictors that
shows statistical multicollinearity, and that is the number of subscribers to the uploader and
video length. This pair exhibits r = 0.580 with p-value = 0.002, statistically significant at =
0.05, a correlation of moderate strength. Although the number of subscribers to the uploader also
appears to be somewhat correlated with the number of characters in the video title, exhibiting r =
-0.345 with p-value 0.078, the p-value is not statistically significant at = 0.05. However, 0.078
is very close to 0.05 (the next lowest p-value being 0.135), so it may be argued that the
correlation between the number of subscribers to the uploader and the number of characters in
the video title is indeed worthy of consideration. It is also important to note that both
aforementioned potential occurrences of multicollinearity have to do with the number of
subscribers to the uploader. This leads us to the issue of model selection should that particular
predictor be omitted from the model?
To determine the best model for this set of data, best subset analysis will be used. A table
of the results from a best subsets regression:
In theory, the best model should feature the following characteristics: parsimony (fewer
predictors is more desirable, holding all else equal), an R2 (R-Sq) similar to an adjusted R2 (RSq(adj)), and a low Mallows Cp (C-p). Here, the lowest Mallows Cp (0.9) corresponds to the
model of two predictors: the number of subscribers to the uploader and the number of video
comments. This model also has an R2 of 32.0 and an adjusted R2 of 26.3, two values that are
reasonably close to each other. It appears that this model is best. The runner-up is not as good;
the next lowest Mallows Cp is 1.2, with R2 and adjusted R2 37.3 and 29.1, respectively,
corresponding to a model with three variables.
The best subsets regression above concludes that the model of two variables, the number
of subscribers to the uploader and the number of video comments, is best as long as views is the
response variable.
The least squares multiple regression line is:

Views = 3.653 x 108 2.139 x 101 * Subscribers + 1.303 x 102 * Comments
The intercept coefficient is useless to interpret in the context of the data because it is
extremely unlikely for a situation to arise in which a video with 3.653 x 108, approximately
365,300,000 views, is associated with an uploader who has 0 subscribers (though an uploader
can indeed easily achieve 0 video comments by disabling comments).
Holding all other predictors constant:
An increase in the number of subscribers to the uploader of 1 is associated with an

expected estimated decrease in the views of 2.139 x 101, or approximately 21.39.
An increase in the number of video comments of 1 is associated with an expected
estimated increase in views of 1.303 x 102, or approximately 130.3.
The regression is of weak to moderate strength with R2 = 0.3197, meaning that 31.97% of
the variability in the target variable can be accounted for by the predicting variables. The residual
standard error is 89,010,000, representing the standard deviation of points formed around the
regression line. 68% of actual target values should be within +/- 89,010,000 of the predicted
target values, 95% within +/- 2 * 89,010,000, and 99.7% within +/- 3 * 89,010,000. It makes
sense that this residual standard error is smaller than those of the other models mentioned in this
report because this model has fewer parameters than do the others, meaning that residual errors
will be smaller. The p-values of the t-statistics for each of the coefficient values are all significant
at the = 0.05 level of significance (they are 2.33 x 10-90.0308, and 0.0154, respectively, for the
intercept, the number of subscribers to the uploader, and the number of comments). They
correspond to the following hypothesis tests:
HO: the regression coefficient of interest is 0 (the predictor has no association with views,
holding all other predictors equal)
HA: the regression coefficient of interest is different from 0 (the predictor has some association
with views, holding all other predictors equal)
Here, for each coefficient, we can reject the null hypothesis and accept the alternate
hypothesis because all p-values are below 0.05, meaning that all predictors have some
association with views.
The p-value of the F-statistic, 0.009824, is also significant at = 0.05, meaning that the
model as a whole is statistically significant too. It corresponds to the following hypothesis test:
We can reject the null hypothesis and accept the alternate hypothesis, meaning that the
model as a whole has some predictive capability.
-2
-1
2
1
0
-1
2
1
0
-1
2.5e+08
3.5e+08
4.5e+08
Fitted Values
-2
-1
2
1
0
-1
0 2 4 6 8
Frequency
12
Normal Scores
10
15
20
Order of the Data
25
The normal probability plot implies that the data have a short left tail, a fat right tail, and
data from both tails having slightly higher than expected residuals. The residuals vs. fitted values
plot implies that the data are roughly normally distributed. The histogram is right skewed,
implying some deviation from normality. The residuals vs. the order of the data displays a
somewhat non-linear trend, but this is useless to interpret because the data is not time series data
and therefore cannot contain autocorrelation. Overall, there may be some deviations in normality
in the data (the data might not be inherently linear). This means that potentially, some estimates
of regression coefficients may be inappropriate and that some parts of signals might be being
mistakenly treated as noise. The other assumptions for linear regression hold up there appears
to be little to no heteroscedasticity and the errors do not appear to be significantly correlated with
one another.
In conclusion, for YouTube videos ranked in the list Top 30 most popular videos on
YouTube of all time based on the number of video views, there indeed appears to be statistically
significant associations between video views and two of the six tested predictors: the number of
subscribers to the uploader and the number of video comments. This potentially implies that the
popularity of YouTube videos has to do with both how much viewers want to subscribe to
uploaders as well as how much viewers feel the urge to contribute a comment or a few.
* Notes regarding data collection: the actual data value for the number of comments for the video
Tootin' Bathtub Baby Cousins Official is Not Applicable (NA) because the uploader has
disabled comments for that video. Therefore, there are no comments, so in this report the NA
value has been changed to 0. All other data values were recorded as observed in actuality.
The data:

HW 3

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

HW 3

Cargado por

Copyright:

Formatos disponibles

George Han

Views vs. Videos

Views vs. Comments

Views vs. Length

Views vs. Letters

Views vs. Subscribers

The least squares multiple regression line is:

Histogram of the Residuals

Residuals vs. Order of the Data

Residuals vs. Fitted Values

Normal Probability Plot of Residuals

Order of the Data

Views vs. Videos

Views vs. Comments

Views vs. Length

Views vs. Letters

Views vs. Subscribers

Subscribe Video Commen Lengt

Observation 1 is an influential outlier because it has a large standardized residual as well

Views vs. Videos

Views vs. Comments

Views vs. Length

Views vs. Letters

Views vs. Subscribers

The new least squares multiple regression line is:

Histogram of the Residuals

Residuals vs. Order of the Data

Residuals vs. Fitted Values

Normal Probability Plot of Residuals

Order of the Data

Correlations in any of these scatterplots indicate potential multicollinearity. Because there

The least squares multiple regression line is:

An increase in the number of subscribers to the uploader of 1 is associated with an

Residuals vs. Fitted Values

Normal Probability Plot of Residuals

Histogram of the Residuals

Residuals vs. Order of the Data

Order of the Data

También podría gustarte