-
For this homework, you’ll build a KNN Classifier to predict how a given state would vote in a specific year based on its demographic similarity to other states.
-
● Features: demographic data for each state
-
● Label: the party each state votes for in a presidential election (“party_detailed” column)
We recommend using scikit learn for this homework. We’ll explore some of its functionality in lab (and in class), but it’ll be helpful for you to read up on the documentation. Some useful functions:
● KNeighborsClassifier(),knn.fit(),knn.predict() ● KFold(),cross_validate()
● train_test_split()
● metrics.classification_report()High-Level Steps
-
● Read in, clean, organize data. Make sure your dataframe has the states in alphabetical order.
-
● For the year 20001, use cross-fold validation to choose the best value of k for a KNN Classifier
○ There could be many “best” values of k depending on whether you care about accuracy, precision, or recall
-
● Now that you’ve chosen a good k, set up a new classifier for 2004. This is your “real” one. Your classifier should predict the party a given state voted for in 2004.
-
● Run the “real” classifier and see how it does!
Data for this Homework
● 1976-2020-president.tab (right-click and choose “save link as” to download)
○ We’ll spend some time breaking down and cleaning up this file in class on Tuesday, 3/19!● demographics.csv
The first file contains all presidential election years from 1976 through 2020. For each year, it lists every state and how many votes that state cast for each candidate on the ballot. The second file contains some demographic information gathered about every state from the 2000 census2.
Some ideas for data cleaning you’ll need for this homework -- these are starting points to consider; you might need more or different things to make it all work!
-
● We need to know which party (“party_detailed” column) won each state in a given year; currently, it has the raw number of votes per candidate. This will be our classifier label.3
-
● Use this demographic data as features for each state:
-
○ Percent of population identified as male (relevant columns: TOT_MALE, TOT_POP)
-
○ Percent of population identified as female (relevant columns: TOT_FEMALE, TOT_POP)
-
● You’ll probably want to merge the two datasets into one dataframe. State names are what they
have in common, but be careful with upper/lowercase, and make sure you put them in alphabetical order before creating your model.
Choose a Value of k (Year: 2000)
To be sure we all get the same answers for the auto-graded questions, please make sure that you...
-
● Set the optional parameter random_state to 0 in train_test_split()
-
● Set the optional parameter random_state to 0 and shuffle to True in KFold()
-
● Put your merged dataframe in alphabetical order by state
For the first four questions below, your job is to find the best value of k (number of neighbors) for a given year, depending on whether we care most about accuracy, precision, or recall.
How will you find the best k? By setting k = 4, 5, 6, ..., 10 and using Cross Fold Validation to evaluate how good that value of k is. Set the number of splits to 5 (using the n_splits parameter). Scikit-learn will split the data into 5 groups and use four of them as training and one of them as testing. Each group is used as the testing group once, and scikit-learn can compute the mean accuracy, precision, and recall (among other things) across all 5 folds.
Create the “Real” Classifier (Year: 2004)
Now that you’ve found good k values, set a up a new classifier, but using data from 2004 instead of 2000.
Part 1 - Questions about the DataThis part of your solution will be auto-graded. When you find this assignment on gradescope, the first part will ask you the following questions; type or select your answers. You must compute the answers to these questions programmatically. Gradescope will confirm your answers are correct/incorrect.
Check the output! Gradescope can be a little picky about formatting, and we don’t want you to lose points for putting extra characters or whitespace in an answer. Make sure you’ve got the correct answer to each question for full credit.
Answer these questions (make sure you compute these answers in your Python solution):
-
For year 2000: What is the optimal value of k if we care most about recall?
-
For year 2000: What is the lowest mean recall for any value of k?
-
For year 2000: What is the optimal value of k if we care most about overall precision4?
Part 2 - Visualization
Create two Python plots and upload them as screenshots/downloads.
-
● Plot #1: A heatmap showing the confusion matrix when the value of k is optimal if we care most about recall, for the year 2004.
-
● Plot #2: A plot showing why you picked those optimal values of k for the year 2000.
Part 3 - Code QualitySubmit on gradescope the code you developed to compute your answers to the Part 1 questions, and to generate the plots for Part 2. Your code will be graded on modularity, readability, and reusability.
-
-
DS 2500 Programming with Data
咨询 Alpha 小助手,获取更多课业帮助