UCL ELEC0136 Data Acquisition and Processing Systems
1. Overview
You are a junior Data Scientist at “Money Money Money!”, a UK investment company.
Your product manager, David, asks you to design and build a machine learning pipeline that informs the company on the trend of S&P 500, a stock market index (it is itself a stock) that tracks the stock performance of 500 of the largest companies listed in the US stock exchanges.
Your duty is to build an AI pipeline that allows David to study the market trend of S&P 500 and ultimately inform the company of whether they should buy, hold, or sell this stock.
David advises you to follow the company guidelines for building AI-informed decision-making products, which suggest using the following process:
1. Find a source of your data.
2. Acquire stock data for S&P 500.
3. Collect any other additional data that have an impact on the accuracy of your
model (e.g., stock data of other companies in the S&P 500 index, climate data, news, pandemic data)
4. Choose the storing strategy that most efficiently supports the upcoming data analysis.
5. Analyse your data. For example (at least, but also more analyses), is there any
trend? Is there missing data? Are there outliers? Then, pre-process the data accordingly.
6. If you collected additional data, study the relationship between this data and the
S&P 500 time series. Are they correlated? What is the impact of additional data on
the model accuracy?
7. Provide useful visualisations of the data, exploiting patterns you might find.
8. Formulate your machine learning problem. You can choose to model the task using
any formulation – regression, classification, sequence modelling – or even all the
three. The more evidence you provide the stronger David’s argument can be.
9. Train a model on this data and evaluate its accuracy.
Details for each task are described in Section 2.
You can choose the strategy you prefer for any of these actions. However, each relevant choice must be justified and supported by evidence. Failing to do so will impact your score. Any choice you make will impact your score as per Section 4.
You are expected to deliver a written report describing your work, in the form of a research paper, and the code that accompanies it as described in Section 3.
2. Task details
Below are some details about the task. Most of these details are only a guidance to solve the assignment, but enough for it. You can do even better than these directions. For example, you can choose more than one strategy to address a problem and show the impact of that strategy on the final model with ablation studies.
Note. This description voluntarily contains some details that are suboptimal, and may be done better. For this reason, we exhort you to divert from these suggestions when you believe you can do better. Always justify your choice. Data science is science after all.
2.1 Data Acquisition
You will first have to acquire the necessary data for conducting your study. One essential type of data that you will need are the stock prices. David asks you to collect data from 1st April 2017 to 1st April 2024.
Since S&P 500 data is public, the data is freely available online. Find an adequate source for the data and use the best (according to the criteria discussed in the course) strategy to acquire it. A good place to look into is the platforms that provide free data relating to the stock market such as Google Finance or Yahoo! Finance or other similar sites.
Consider if you want to collect additional data. There are many other sources of data that can have an impact on your predictions. Consider, for example:
a) Supplementary stocks: the S&P 500 summarises the trend of many other stocks. Can the price of these stocks be a useful feature to inform the price of the S&P 500?
b) Social Media: This can be used to uncover the public’s sentimental response to the stock market
c) Financial reports: This can help explain what kind of factors are likely to affect the stock market the most
d) News: This can be used to draw links between current affairs and the stock market
e) Climate data: Sometimes weather data is directly correlated to some companies’
stock prices and should therefore be taken into account in financial analysis
f) Others: anything that can justifiably support your analysis.
2.2 Data Storage
Once you have found a way to acquire the relevant data, choose the appropriate strategy to store it. In particular:
- Design your object model: what does a data point look like?
- Choose your format and justify your choice. You should choose a format that allows
an efficient read access enabling the training of a parametric model. The data corpus should be such that it can be easily inspected.
Data must be stored on a remote server. Make sure that the server is up when the report will be automatically scored (when you merge the pull request), as this will impact yoru reproducibility score. This choice will impact your score as per Section 4.1. You must not use any paid services.
2.3 Data Preprocessing
Now that you have stored your data, you can start preprocessing it.
Create your dataset using the definition of the course and describe it. Think about what features to keep, which ones to transform, combine or discard. Support these decisions with evidence. Make sure your data is clean and consistent (e.g., are there many outliers? Any missing values?). You are expected, but not limited, to:
1. Formulate your problem and describe it: a regression, classification or sequence modelling problem? Or maybe all the three?
2. Label your data accordingly and design your dataset.
3. Clean the data from missing values and outliers, if any.
4. Provide useful visualisation of the data. Plots should be saved on disk, and not
visualised on screen, which would break automation.
5. Transform your data (e.g., using normalisation, dimensionality reduction, etc.) to
improve the forecasting performance.
6. Choose the sampling strategy, and justify your choice.
Remember that data preparation and pre-processing is a very important task and will be time consuming. Motivate your steps carefully in the report.
2.4 Feature engineering and data exploration
After ensuring that the data is well preprocessed, it is time to start exploring the data to carry out hypotheses and intuition about possible patterns that might be inferred. Depending on the data, different Exploratory Data Analysis (EDA) techniques can be applied, and a large amount of information can be extracted.
For example, you could do the following analysis:
- Time series data might be a combination of several components:
- Trend represents the overall tendency of the data to increase or decrease over time.
- Seasonality is related to the presence of recurrent patterns that appear after regular intervals (like seasons).
- Random noise is often hard to explain and represents all those changes in the data that seem unexpected. Sometimes sudden changes are related to fixed or predictable events (i.e., public holidays).
- Features correlation provides additional insight into the data structure. Scatter plots and boxplots are useful tools to spot relevant information.
- Explain unusual behaviour.
- Explore the correlation between stock price data and other data that you collected.
- Use hypothesis testing to better understand the composition of your dataset and its representativeness.
At the end of this step, provide key insights on the data. This data exploration procedure should inform the subsequent data analysis/inference procedure, allowing one to establish a predictive relationship between variables.
2.5 Training and inference
Train a model that helps you inform the decision of whether “Money Money Money!” should long (buy) or short (sell) S&P 500 stocks.
Train your model on the acquired data except the last two months (from 1st February to 31st March), which you must use for testing. Use appropriate plots to show the training and testing procedure.
As an advice, do you use overly complex model architectures, for example, models with a huge number of parameters. This is usually not necessary. However, whatever your choice of model, make sure you justify it.
Show the impact of the choices you made in the previous steps of the process with ablation studies.
3. Deliverables
You are expected to deliver your assignments with:
1. A written report in the form of an academic paper and,
2. The code-base to support it, which allows to reproduce the experiments and the figures in the report.
You are allowed to discuss ideas with peers, but your code, experiments and report must be done solely based on your own work.
We provide details on each deliverable below. 3.1. Report
Page limit. Submissions cannot exceed 8 pages (not counting any Appendices or References). Doing so will impact your score negatively. The submission PDF may include an Appendix beyond the 8 pages limit and after the references (for proofs, derivations or complimentary results).
Format. Submissions must be PDF files generated using the TMLR LaTeX stylefile and template (attached on Moodle). No other formats are allowed. The easiest way to write a latex-generated PDF is to use overleaf (https://www.overleaf.com/). You can use this tutorial to get started.
The paper must include the following sections:
Abstract. The abstract is a short paragraph (max 300 words) introducing the problem, its significance, the methodology used to address it, the main results and the conclusions drawn from them.
Introduction. This section introduces the problem with an emphasis on the motivations and the end goal of the work. Contextualise your work, describing the broader scientific or application area that it sits in. Please make this introduction meaningful, be short but impactful. Descriptions must be clear.
Data description. Use this section to describe the data that was used for this study. You should clearly describe the content and the size of the dataset, and the object model and the format of the data. Justify your choices clearly.
Data acquisition. Use this section to present the data acquisition process. Explain the methods you used to acquire the data, and why you chose a specific acquisition method.
Data storage. Use this section to explain your data storage strategy. Justify your choices.
Data preprocessing. This section should describe in detail all the preprocessing steps that were applied to the data. A justification for each step should also be provided. In case very little or no preprocessing was chosen, this section should clearly justify why. It is really important for you to clearly motivate and explain your reasoning.
Data Exploration. Use this section to describe any analysis you performed on your dataset. Is there any particular pattern? Are features correlated? How do different datasets interact with each other? Do they contain redundant information? Put a strong emphasis on the reasoning you used to design your analyses and on the conclusions that came from them.
Training and inference. Use this section to describe the training procedure. What strategy did you use to sample the data? Why? What model did you use? Why? What are the main results? Consider enriching this section with ablation studies to show the impact of your choices.
Conclusions. Use this section to: a) provide a summary of your problem definition, its significance, the methods and the findings. Highlight any challenges or limitations that you encountered during the study and provide directions for potential improvements. Please do not forget the main goal of this project. What should you learn from the inference? What is the actual conclusion of your study? Shall we long (buy) or short (sell) S&P 500 stocks?
Make sure you support your descriptions with relevant equations, diagrams, and figures as you see fit. Figures must be readable and informative to be considered. Note that your work will be evaluated solely based on the report and your code. Information not appearing in the report cannot be deduced, so please provide reasoning and motivation behind each step.
3.2. Code
In addition to the report, you should also provide all the code that was used for your study. You are not allowed to use Jupyter notebooks, and you must use only plain Python files. The assignment will be released with GitHub Classroom, as we have shown in the formative assessments throughout the course.
The code you submit must be:
Reproducible. We will run automated procedures to check if your code reproduces the results described in the report. This procedure is a GitHub Action, triggered when you merge your pull request.
Reproducibility is tested on an ubuntu machine, so please make sure your code is portable to Linux, if you develop on a different platform.
The code will be considered reproducible if:
a. The automated procedure succeeds and
b. The code returns the same results included in the report.
The code-base should contain an environment.yml and a requirements.txt files, as described during the course.
The code should never ask for a manual input, and should not pause its execution for, for example, plotting graphs (do not use plt.imshow(), but rather save the image to disk).
Documented. We will evaluate the quality of the documentation against section 3.8 of the Google Python Style Guide.
Of good quality. We will evaluate the quality with respect to the PEP8 style guide. Please, if time complexity is not an issue, aim for readability and clarity when writing your code.
Well organised. Please, organise your code following the guidelines we provided during the course. Aim at not repeating yourself, and at grouping together methods and classes that are shared across tasks. Use a folder structure that allows you to immediately and clearly understand how the code is structured. You are free to choose the criteria to organise your code, as long as the choices are justified.
Version controlled. Please, use GitHub to submit your code following the best practices suggested in the course. Prefer modular commits to one big final commit, and aim at pairing each commit with a specific purpose, e.g. “add module to acquire data”. Remember to use Conventional Commit messages.
If you have local passwords, follow the pattern we discussed in class to store them: save them to a file and use your code to read from them. Unlike in class, commit your passwords, as we will need them to reproduce your results.
4. Marking Scheme
The mark will be decided based on both the report (70% of final mark) and corresponding code (30% of final mark). In particular, we will mark based on following scheme. Note that the name refers to the procedure, and not to the report section. For example, if relevant details of data pre-processing are in the “Data Inference” section, the mark on “Data Preprocessing” will also evaluate that part.
咨询 Alpha 小助手,获取更多课业帮助