Python 机器学习 - Project
Overview
The problem of evaluating an individual's risk of drug consumption and misuse is an important first step to address this serious problem globally. According to [1], a number of factors are correlated with initial drug use including psychological, social, individual, environmental, and economic factors, and these factors are likewise associated with a number of personality traits.
In [1], an online survey methodology was employed to collect data including Big Five personality traits (NEO-FFI-R), impulsivity (BIS-11), sensation seeking (ImpSS), and demographic information, resulting in the Drug Consumption dataset that contained information on the consumption of eighteen (18) central nervous system psychoactive drugs.
The following paper contains a detailed description of the data and the process of data quantification: https://link.springer.com/book/10.1007/978-3-030-10442-9 or https://arxiv.org/abs/1506.06297.
You are tasked to analyze this data through the construction of machine learning models. For this assignment, please use the Drug Consumption dataset from the UCI Machine Learning Repository.
Topic: Supervised learning – Binary classification
The aim of this learning task is to identify the profiles of persons prone to consume drugs, when contrasted with those that do not, i.e., the class label is Consume (Non-user, User).
In terms of the dataset, this implies that we are converting the original multi-class learning problem into a binary learning problem. Specifically, for feature number 32 in the original data set, we combine classes C1 (Never Used) and C2 (Used over a Decade Ago) into “Non- User” while the data for the remaining four (4) classes are grouped together and labeled as “User”.
Note this dataset is imbalanced, where most individuals surveyed never consumed drugs. For now, we are not considering any form of data rebalancing, prior to learning. You are asked to follow the following steps.
Import the data into your machine learning environment and conduct feature engineering. Next, construct models using the following four (4) types of algorithms: a single decision tree (DT), a random forest (RF) learner, a support vector machine (SVM), and a k‐nearest neighbor (k-NN) classifier. You should use the holdout method of evaluation, namely use 67% of the data for training, and 33% for testing.
A Programming
a. Feature engineering: feature transformation and feature selection. Feature transformation and feature selection are pre-processing steps followed before conducting machine learning. Refer to the reference paper for more details [1]. Feature transformations are useful to prepare the data for learning and include converting categorical data to numerical data, or normalizing numerical data prior to training. Feature selection techniques remove unnecessary features prior to training. To this end, you may use (any) one (1) feature selection algorithm as available in Scikit-Learn.
b. Model construction: Use the four (4) algorithms - DT, RF, SVM, k-NN - to construct four (4) models against the data.
c. Evaluation: Show the four (4) confusion matrices corresponding to the four (4) models and calculate the recalls and precisions.
d. Evaluation: Draw a figure to show the ROC Curves for the four (4) models.
B Explainable AI
You are asked to explore the most accurate model obtained by the decision tree algorithm against the Drug Consumption dataset as used in assignments 1 and 2. Answer the following questions. Your answers should focus on feature importance, the properties of the dataset, the characteristics of the algorithm, and the usefulness of the resultant model.
1. Display/visualized the resultant model created by the decision tree.
2. Explain how, and why, the algorithm made a specific decision.
3. Explain why the algorithm didn’t do something else.
4. Discuss when the algorithm succeeded and when it failed.
5. Explain how the algorithm could potentially improve its predictions.
咨询 Alpha 小助手,获取更多课业帮助。