机器学习代写 |AI 代写

COMP90049 Introduction to Machine Learning


    1 Overview

    The goal of this project is to build and critically analyse Machine Learning methods, to predict the rating of some movies extracted from the TMDB database. This assignment offers an opportunity to delve into machine learning concepts within a research context and enhance skills in data analysis and problem-solving.

    The objective is to critically evaluate and analyse the efficacy of various machine learning algorithms for predicting movie ratings and to communicate your findings in an academic report. The technical aspect involves implementing machine learning algorithms to solve the task, while the report focuses on interpreting observations and drawing meaningful conclusions. Your report should showcase your understanding of the subject matter in a manner accessible to a reasonably informed reader. 

    2 Data

    The information on movies is collected from the TMDB website, which is a platform that allows users to search its database of movies, rate them and write reviews. The data files for this project are available via Canvas and are described in a corresponding README file. In our datasets, each movie contains:

    • Movie features: such as `release_year`, `runtime`, `budget`, `revenue`, `original_language` and more.

    • Text features: There are four text features in this dataset: `title`, `overview`, `tagline` and `production_companies`. We have used various text encoding methods to encode these features for you.

      You are provided with:

    Labelled datasets: (Include movies’ ratings as explained in section 3.1).
    o TMDB_train.csv: Consists of movie features of 100,000 movies with their rating labels that you can use to train your supervised Machine Learning models.

    o TMDB_evaluate.csv: Consists of movie features of 20,000 movies with their rating labels that you can use to evaluate your supervised Machine Learning models.

    Unlabelled datasets:
    o TMDB_unlabelled.csv:Consists of details of 254,701 movies that you can use to train your unsupervised or semi-supervised Machine Learning models.

    o TMDB_test.csv: Consists of details of 20,000 movies that you should use to TEST the performance of your Machine Learning models and report the result onto the Kaggle page.

    Pre-processed datasets:
    o TMDB_text_features_*.zip:The preprocessed text features for training and test sets, one zipped file for each text encoding method. Details about using these text features are provided in the README file. 

    3.1 Target Labels

    These are the labels that your model should predict (y). We provide this label in two forms:

    • the average rating (float; in the column named `average_rate` in the Train and Evaluate CSV files); and

    • a categorical label indicating the rating band, where we binned the rating of the movies into 6 categories as follows (integer; in the column named `rate_category` in the Train and Evaluate CSV files).:

    o average_rate < 4 0 o 4<=average_rate<51 o 5<=average_rate<62 o 6<=average_rate<73 o 7<=average_rate<84 o average_rate >= 8 5

    You may use either of these label representations in your experiments, but different representations might call for different machine-learning approaches.


    3.2 Features

    To aid in your initial experiments, we have created different feature representations from the given datasets. You may use any subset of the text feature representations described below in your experiments, and you may also engineer your own features from the raw descriptions if you wish. The provided representations are:

    1. BoW (Bag of Words)

    We applied the CountVectorizer to transform the instances into vectors of Token_ID and their count. For example, with the use of CountVectorizer the title “Sudani from Nigeria”, will be transformed into the following vector:

    [(39785, 1), (28688, 1)]

    Where 39785 is the token_ID for the word “Sudani” and 28688 is the Token_ID for the word “Nigeri”. In this example and the provided files, we (1) removed all `stopwords1` and (2) only retained the 1000 words in the full data set with the highest COUNT values. There are many other modifications you can use to experiment with different hypotheses you may have. For example, how ‘removing very frequent and/or very infrequent words’ can affect the behaviour of your Machine Learning models. There are many more examples.


    2. TFIDF

    We applied term frequency-inverse document frequency pre-processing (TfidfVectorizer) to transform the text features as a vector of values that measure their importance using the following formula: Where 𝑓 is the frequency of term t in document d, 𝑓 is the number of documents containing t, !,# #

    and N is the total number of documents in the collection. You can learn more about TFIDF in (Qaiser, Ali 2018). Using TFIDF the above example title will be transformed into the following vector


    4 Stage I
    4.1 Task Basics

    You'll use the TMDB_train.csv dataset and, if you haven't pre-processed the data yourself, you can use the pre-processed files (TMDB_text_features_*.zip) to train machine learning models. Then, you'll evaluate these models using TMDB_evaluate.csv. Depending on your research question, you might also use TMDB_unlabelled.csv to improve your models. Once your models are ready, you'll use them to predict the 'rate_category' for all the movies in TMDB_test.csv. You'll submit these predictions to our Kaggle competition page.

    4.2 Research Question

    You should formulate a research question and develop machine learning algorithms and appropriate evaluation metrics to address the research question. Here are some sample research questions:

    • Are big-budget films more popular than their low-budget counterparts?

    • Is there any relationship between production companies and the rating of their movies?

    • Is the rating of a movie correlated with its genre?

    • Is the rating of a movie correlated with its title?

    • Does the use of unlabelled data improve the performance of machine learning models in this dataset?

    • How does the use of text features impact the performance of machine learning models in this dataset?

    • Does using the ‘overview’ assist with the identification of very high-rated movies?

      There are many more possible questions. You can choose to use any as your research question.


      4.3 Feature Engineering

      The process of engineering or selecting features that are useful for discriminating among your target class set is inherently poorly defined. Most machine learning assumes that the attributes are simply given, with no indication from where they came. The question as to which features are the best ones to use is ultimately an empirical one: just use the set that allows you to correctly classify the data.

      In practice, the researcher uses their knowledge about the problem to select and construct “good” features. What aspects of a movie’s details might indicate its rating? You can find ideas in published papers, e.g., (Saraee, White et al. 2004).

      It is optional for you to use the features provided (as they are), generate a new set of features or select a substitute of the features. Whatever method you choose, you have to use the features to train some models and run a few experiments on the given evaluation data.


      4.4 Analysing Machine Learning Models

      Various machine learning techniques have been (or will be) discussed in this subject (0R, 1R, Naive Bayes, Decision Trees, k-NN, Logistic Regression, Neural Networks, etc.); many more exist. You may use any machine learning method you consider suitable for this problem. You are strongly encouraged to make use of machine learning software and/or existing libraries (such as sklearn) in your attempts at this project.

      In this stage, your task has two phases:

      • The training-evaluation phase: The holdout approach should be applied to the training data provided. Check section 4.6 for the minimal expectations in this phase.

      • The test phase: the trained classifiers will be evaluated on the provided test data. The predicted labels of test cases should be submitted as part of the Stage I deliverable on Kaggle. Check section 7 for details.

        Based on your research question and after training different models and running a few experiments, you are expected to develop some knowledge of why you are reaching the results you do and some hypotheses of how you can change these results. Here are a few examples:


        Example 1
        Hypothesis:
        removing the ‘release_year’ from the features can reduce the noise and increase the

        performance of x and y models.
        Test:
        compare the performance of x and y models before and after ‘release_year’ removal and comment

        on the observation. Did the observation support the hypothesis? Why or why not?


        Example 2
        Hypothesis: The
        machine learning model A is working faster than model B has something to do with the

        structure of the instances in this dataset.

        Test: Change the structure of the instances (by sub-sampling or feature engineering or any other way) and test the performance of the two models before and after the changes. Did the observation support the hypothesis? Why or why not?


        You should then test these hypotheses with more experiments. When explaining your results, you are expected to use examples from the dataset as well as theories and findings from the lectures and published literature. You are also expected to use appropriate visualization tools (e.g., tables or diagrams) to communicate your findings professionally and academically.


        4.5 Report

        Your main submission for this assignment is your report. The report should follow the structure of a short research paper, as will be discussed in the guest lecture on Academic Writing. It should describe your approach and observations, both in engineering features, and the machine learning algorithms you tried. Its main aim is to provide the reader with knowledge about the problem, in particular critical analysis of your results and discoveries. The internal structure of well-known classifiers (discussed in the subject) should only be mentioned if it is important for connecting the theory to your practical observations.

        The following is the expected structure of the report for this assignment.

        Introduction: a short description of the problem, data set and research question. Your report should clearly state your research question. Remember addressing more than one research question does not necessarily lead to higher marks. We value the depth and quality of your critical analysis of methods and results over simply covering more content or materials.

        Literature review: a short summary of some related literature, including the data set reference and at least two additional relevant research papers of your choice.

        Method: Introduce the used feature(s), and the rationale behind including them. Explain and justify the Machine Learning models you have used and their hyperparameters. You also need to explain your evaluation method(s) and metric(s) you have used (and why you have used them). This should be at a conceptual level; a detailed description of the code is not appropriate for this report. The description should be similar to what you would see in a machine learning conference paper.

        Results: Present the results, in terms of evaluation metric(s) and, ideally, illustrative examples and diagrams.

        Discussion / Critical Analysis: Contextualise the systems' behaviour, based on the understanding of the subject materials (This is the most important part of the task in this assignment). Contextualise implies that we are more interested in seeing evidence that you have thought about the task and determining reasons for the relative performance of different methods, rather than the raw scores of the different methods you selected. This is not to say that you should ignore the relative performance of different runs over the data, but rather that you should think beyond simple numbers to the reasons that underlie them and connect them back to your research question. You can also add complementary experiments and their results in this section.


        Conclusion: Demonstrate your identified knowledge about the problem and suggest the next steps.

        A bibliography, references to any related work you used in your project. You are encouraged to use the APA 7 citation style but may use different styles as long as you are consistent throughout your report.

        We will are to be submitted in the form of a single PDF file. If a report is submitted in any format other than PDF, we reserve the right to return the report with a mark of 0. provide LATEX and docx-style files that we would prefer that you use in writing the report. Reports

        Your name and student ID should not appear anywhere in the report, including any metadata (filename, etc.). If we find any such information, we reserve the right to return the report with a mark of 0. 

       

咨询 Alpha 小助手,获取更多课业帮助