数据科学代写 |代码代写

DS 202 - Data Science

🗄️ Get the data

What data will you be using?

You will be using two distinct datasets for this summative.

Part 1

Your dataset, for this part, comes from the Office for National Statistics and relates to UK GDP figures.

Preparation

  1. Download the data by clicking on the button below.

Part 2

In this part, you will be re-using the same dataset as in summative 1.

Preparation

Click on the button below to re-download the dataset:

ℹ️ About the dataset

📋 Your Tasks

What do we actually want from you?

Part 1: Show us your dplyr muscles! (20 marks)

  1. Load the data into a data frame called uk_gdp. Freely explore the data on your own.

  2. Unlike in the previous summative and formative, this dataset does not come in clean format and will require some work before it can be used.

    1. Remove the rows from the data frame that do not contain quarterly GDP figures (i.e rows that don’t have Title values of the form 1955 Q1)
    2. Clean up and/or rename the column names to more tractable and meaningful names
  3. Create a new variable called gdp_lag that contains the GDP of the previous quarter.

  4. Calculate the percentage of quarterly GDP growth

    ______100

    and store it in a new data frame variable called quarterly_change

  5. How many times did the percentage of quarterly GDP growth dip below 0 and when? A technical recession is defined as two consecutive quarters of negative percentage of quarterly GDP growth. Can you identify periods of technical recession?

Part 2: Create a baseline model (50 marks)

In summative 1, we focused on predicting central_government_net_borrowing_pounds_million.

Here, we’ll change tack a bit and focus on net_investment_pounds_million.

We will tackle this as a classification task. We aim to create a logistic regression model to predict whether the net investment will increase or decrease in the next fiscal year (i.e from beginning of April in a given year to end of March in the next year).

As it was in the previous section, you don’t need to use a chunk for each question. Feel free to organise your code and markdown for this part.

  1. Load the data UK public finance dataset into a dataframe called uk_public_finances.

  2. Create a data frame called yearly_uk_finances that is based on the uk_public_finances dataframe. This new dataframe should:

    • keep the same numerical variables as the original dataframe
    • aggregate (i.e sums) for each fiscal year the monthly values of these numerical values
  3. The button below allows you to download the dataset as it would have looked if you had successfully completed questions 1 and 2 of this part.

Create a binary target variable called is_net_investment_up. The variable should be set to 1 if the net_investment_pounds_million variable in the current year is 5% higher than the net_investment_pounds_million variable in the previous fiscal year. Otherwise, it should be set to 0.

To avoid problems, don’t use a recipe here — just use mutate to create the variable.

  1. Create a logistic regression model using a single valid predictor. This could be either a column already in the data frame or a new column you create using mutate or with a recipe.

  2. Set the last year in the data set as the test set. Use the previous years as the training set.

  3. Use whatever metric you feel is most apt for this task to evaluate your model’s performance. Explain why you chose this metric.

  4. Explain what the regression coefficients mean in the context of this problem.

  5. Comment on the goodness-of-fit of your model and its predictive power.

Part 3: Model some more (30 marks)

Now is your time to shine!

Come up with your own feature selection or feature engineering strategy and try to get a better model performance than you had before.

Don’t forget to validate your results using the appropriate resampling techniques!

Whatever you do, this is what we expect from you:

  1. Show us your code and your model.

  2. Explain your choices (of feature engineering or cross-validation strategy)

  3. Evaluate your model’s performance. If you created a new model, compare it to the baseline model. If you performed a more robust cross-validation, compare it to the single train-test split you did in the previous section.

✔️ How we will grade your work

Here, we start to get more rigid about grading your work. Following all the instructions, you should expect a score of around 70/100. Only if you go above and beyond what is asked of you in a meaningful way will you get a higher score. Simply adding more code or text will not get you a higher score; you need to add interesting insights or analyses to get a distinction.

⚠️ You will incur a penalty if you only submit a .qmd file and not also a properly rendered .html file alongside it!

Part 1: Show us your dplyr muscles! (20 marks)

Here is a rough rubric for this part:

  • 5 marks: You wrote some code but filtered the data incorrectly or did not follow the instructions.
  • 10 marks: You cleaned the initial dataframe correctly correctly, but you might have made some mistakes when creating your lag and/or your GDP quarterly change columns, or your conclusions for Task 5 are not correct.
  • 15 marks: You did everything correctly as instructed. Your submission just fell short of perfect. Your code or markdown could be more organised, or your answers were not concise enough (unnecessary, overly long text).
  • 20 marks: You did everything correctly, and your submission was perfect. Wow! Your code and markdown were well-organised, and your answers were concise and to the point.

Part 2: Create a baseline model (50 marks)

Here is a rough rubric for this part:

  • <10 marks: A deep fail. There is no code, or the code/markdown is so insubstantial or disorganised to the point that we cannot understand what you did.
  • 10-20 marks: A fail. You wrote some code and text but ignored important aspects of the instructions (like not using logistic regression)
  • 20-30 marks: You made some critical mistakes or did not complete all the tasks. For example: your pre-processing step was incorrect, your model contained some data leakage (seeing the future), or perhaps your analysis of your model was way off.
  • 30-35: Good, you just made minor mistakes in your code, or your analysis demonstrated some minor misunderstandings of the concepts.
  • ~35 marks: You did everything correctly as instructed. Your submission just fell short of perfect. Your code or markdown could be more organised, or your answers were not concise enough (unnecessary, overly long text).
  • >35 marks: Impressive! You impressed us with your level of technical expertise and deep knowledge of the intricacies of the logistic function. We are likely to print a photo of your submission and hang it on the wall of our offices.

Part 3: Model some more (30 marks)

Here is a rough rubric for this part:

  • <10 marks: A fail. There is no code, or the code/markdown is so insubstantial or disorganised to the point that we cannot understand what you did, or you wrote some code and text but ignored important aspects of the instructions.
  • 10-20 marks: Good, although you made mistakes in your code, or your analysis demonstrated some misunderstandings of the concepts.
  • ~22 marks: You did everything correctly as instructed. Your submission just fell short of perfect. Your code or markdown could be more organised, or your answers were not concise enough (unnecessary, overly long text).
  • >22 marks: Impressive! You impressed us with your level of technical expertise and deep knowledge of the intricacies of the logistic function. We are likely to print a photo of your submission and hang it on the wall of our offices.

How to get help and how to collaborate with others

🙋 Getting help

You can post general coding questions on Slack but should not reveal code that is part of your solution.

For example, you can ask:

  • “Does anyone know how I can create a logistic regression in tidymodels without a recipe?”
  • “Has anyone figured out how to do time-aware cross-validation, grouped per country??”

You are allowed to share ‘aesthetic’ elements of your code if they are not part of the core of the solution. For example, suppose you find a really cool new way to generate a plot. You can share the code for the plot, using a generic df as the data frame, but you should not share the code for the data wrangling that led to the creation of df.

If we find that you posted something on Slack that violates this principle without realising it, you won’t be penalised for it - don’t worry, but we will delete your message and let you know. 

咨询 Alpha 小助手,获取更多课业帮助