MATH38141 Regression Analysis – Coursework
This coursework accounts for 20% of overall mark for this course and it may take around 10 hours
to complete. Please present your solution in the form of a report, which you should upload on
Blackboard as a single file before the deadline. You can use R to perform your calculations, but you
must show the formulae in the text (not as R code) that you have used for the calculations. Marks
will be awarded for correct and accurate calculations and their interpretation. Interpretations
should be explained in words, referring to the context of the exercise, rather than naming generic
symbols only. High marks will be less likely if the presentation of the results is unclear, too short
or unnecessarily long and confusing, or if any formulas used in the calculations are missing from
the text.
1. Jane, an amateur cinema enthusiast, has decided to collect information on the success of her
top 10 favourite Hollywood movies. The dataset contains the following 4 variable for each
film:
-
BoxOce – Box oce net sales (money earned from ticket sales) in the first year, in million of US dollars ($);
-
Production – Production costs, in million dollars;
-
Promotion – Promotional costs, in million dollars;
-
Books – Total books sales (money earned from the sales of the books the movie is based on), in million dollars.
A tab-delimited text file with a table of the data, called films.txt, is available on Black- board.
-
(a) Draw scatterplots of BoxOce against each of the other three variables. Describe any observable trends in your plots.
-
(b) Formulate a multiple linear regression model for the dataset, using BoxOce as the response and the remaining three variables as regressors.
-
(c) Calculate the LSEs and construct 95% confidence intervals for all regression coecients.
-
(d) Provide an interpretation for the estimated coecients obtained in (c).
-
(e) Calculate and provide an interpretation for the R2 statistic for the model.
-
(f) Jane argues that, when fitting a multiple linear regression model to the data using BoxOce as the response and the other variables as the explanatory variables, the intercept term 0 should be set to zero. Is this argument reasonable? Why?
Excited about discovering more about her favourite film, Jane decides to test a theory and see whether the success of the film is really linked to the success of the book, or whether one might just need to know about the amount of money invested in producing and advertising the film. To investigate whether Books also a↵ects BoxOce, Jane fits two multiple linear regression models to the BoxOce data:
-
Model 1, with explanatory variables Production and Promotion;
-
Model 2, with explanatory variables Production, Promotion and Books.
-
(g) Decide which one is the reduced model. Then fill in the following ANOVA table to compare the nested models.
-
(h) Calculate the p-value associated with the significance of Books. Do you think Books should be included in the multiple linear regression model?
-
(i) Regressing BoxOce on Books alone, test at the 5% level the significance of Books under this simple linear regression model. Does your conclusion contradict that given in (h)? Comment.
2. A dataset concerns the net sales of shops in various locations in the USA. It contains the following variables:
• ANS: Annual net sales (in thousands of $);
• NSF: Number of square feet (in thousands);
• INV: Inventory, i.e. the total price of goods owned by the shop (in thousands of $);
• ASA: Amount spent on advertising (in thousands of $);
• SSD: Size of sales district (in thousands of families);
• NCS: Number of competing stores in the district.A tab-delimited text file with a table of the data, called greens.txt, is available on Black- board.
A multiple linear regression model is proposed to describe the relationship between the response variable ANS and the other 5 explanatory variables (NSF, INV, ASA, SSD, NCS).
A retail expert believes, however, that the variation in ANS can be adequately explained by the variable INV alone, and hence proposes a simple linear regression model for the data.
-
(a) Specify the models, and state the model assumptions clearly.
-
(b) Calculate the residual sums of squares fitting and ! respectively.
-
(c) Explain why in (b) the residual sum of square of is not larger than that of !.
-
(d) Under model, test whether the regression coecients of ASA and SSD are 15 and 10,
respectively, at the 10% significance level, and explain your conclusions.
-
(e) Suppose that we want to compare how well two new shops in two locations will perform:
Calculate the predicted difference in annual net sales between shop 2 and shop 1. Do we predict the two shops to perform significantly differently at a 5% significance level?
(f) It is suggested that the relationship between ANS and INV depends on the number of competing stores in the district, i.e. on NCS.
-
Propose a new model to reflect this suggestion, making sure model is nested within . Exclude the other regressor variables (i.e. NSF, ASA and SSD).
-
Carry out a hypothesis test to compare against and make conclusions.
-
Based on the fitted model , plot four fitted regression lines on the same diagram to display the relationships between ANS and INV for the four values of NCS of 0, 4, 8 and 12, respectively.
Comment on the changes in the relationship between ANS and INV for these four different amounts of competition.
-
-
咨询 Alpha 小助手,获取更多课业帮助