数据分析代写 |计算机代写

IEOR 4501 Tool for Analysis - Coursework

  1. Problem Overview

    Let’s say your apartment lease is ending at the end of the year, and you need to find a new apartment. There are a lot of criteria you can use to help find a neighborhood you’d like to live in. One thing you care a lot about is a quiet neighborhood with a lot of greenery.

     

    Using NYC Open Data datasets and Zillow’s historic monthly rent averages, you will be creating a single Jupyter notebook to download, clean, and store data, as well as defining a set of SQL queries and visualizations to help answer questions of yours in search of a great area to live within your budget.

    Datasets

    ● Shapefiles of NYC’s zip codes with geometric boundary data from here.

    ● Historical monthly average rents by zip code from Zillow from here.

    ● Historical data from NYC Open Data on 311 complaints and the 2015 tree census, for which you will download programmatically in your Notebook.

    Setup

    PostgreSQL & PostGIS

    This project includes storing and querying geographic data. To support this, you will use PostgreSQL (“Postgres”) instead of SQLite3. Using Postgres then allows us to use a popular extension for geographic data called PostGIS, which supports storing spatial data. Neither Postgres nor PostGIS come pre-installed on your computer, but it is very easy to install.

     

    Install & setup PostgreSQL & PostGIS before starting your project in a notebook. Follow the instructions here.

     

    NYC Open Data Account

    Many municipalities publish datasets open to the public, including New York City. “Open data” is content made freely available to use and redistribute, so long as there is attribution to the original source.

     

    While open data is free to download and work with, limitations are put on how much someone can download at one time. Data sources want to provide access to anyone who wants it, and throttles requests so that no one person “hogs” resources. Oftentimes, anonymous requests are restricted more so than authenticated requests.

     

    The way that NYC Open Data approaches this is through creating an account and then creating what’s called an Application Token. This token is used during requests for the data to identify who or where the request is coming from.

     

    Note: technically, only one person needs to set up an NYC Open Data account for your group’s project. You may both create accounts, but you only need to use one when executing requests to download data in Part 1.

     

    First, create a free account. Sign up with any email address, it doesn’t have to be your Columbia email.

     

    Once you’ve signed up, login, and go to your developer settings. Click on the button that says “Create New App Token” (not Create new API Key). Fill in the Application Name - it can be anything, like “Final Project Fall 2023”. Then fill in the Description - it can also be anything, like “Project for Fall 2023 for Tools for Analytics”. Do not fill out any other fields. Click Save.

     

    Copy the value of your App Token, it should be roughly a 25-character string of random letters and numbers. Set this string to a variable in your project’s notebook so it’s easy to use when you need it.

     

    When you’re ready to download the required datasets, learn how to use your Application Token here.

    Specifications

    Read these specifications closely. Failure to follow them will result in points being deducted.

     

    Create a single Jupyter Notebook that is divided into four parts: Data Preprocessing, Storing Data, Understanding Data, and Visualizing Data.

     

    For each part, use Markdown cells to describe what is going on. Construct and write your notebook as if you could hand it to a (not necessarily Python-inclined) friend, and they could understand it without needing to read the actual code.

     

    When implementing your solutions to each part, write separate functions or specific tasks that help accomplish the goals of each part. This will make it easier to test different parts of the logic independently. Then in separate cells, execute the functions (so, don’t just define functions; call them as well). I should be able to clear the output, then run all cells successfully to generate everything needed below.

     

    Of course, you’re encouraged to experiment, understand, and play around with the data and your code’s implementation. Be as messy as you need. Create multiple notebooks if you want as you’re exploring. The final result should be clean, readable, and tell a logical story in a single notebook.

    Part 1: Data Preprocessing

    Overview: For Part 1, you will be downloading Parquet files (both by hand and with Python code), cleaning and filtering for the relevant data, filling in missing data, and generating samples of these datasets.

     

    Downloading: For the zip code data and Zillow data, manually download the data from the links above and save it to a directory called data lives alongside your notebook. For the 311 and tree data, you will need to programmatically download the datasets in your notebook using the Application Token you set up earlier.

     

    Cleaning & filtering: You must use the pandas and geopandas  packages to load and clean all the datasets. The process of cleaning & filtering the data should include:

    ● Removing unnecessary columns, and only keeping columns needed to answer questions in the other parts of this project;

    ● Remove invalid data points (use your discretion!);

    ● Normalize column names & column types where needed;

    ● Normalize the Spatial Reference Identifiers (SRID) of any geometry.

     

    Tips for Part 1:

    ● Get familiar with what SRIDs are and how they’re used. Here is a walkthrough, and an even longer walkthrough with a handy introduction to PostGIS.

    ● When downloading data from NYC Open Data, be sure to use SoQL query filters to (1) get the correct date range of data needed, and (2) to get all the data within the date range, since the default is only 1,000 results.

    ● Make use of your .gitignore file to avoid committing the large datasets and other unnecessary files to your repo.

    ● Read ahead to figure out which columns are absolutely necessary for each dataset.

    ● Be mindful & consistent with the data types for each column, which will make it easier for yourself when storing and filtering data later on.

    ● Read about how to handle a lot of data below.

     

    Part 2: Storing Data

    Overview: For Part 2, you will be taking the datasets downloaded & cleaned from Part 1, and populating a PostgreSQL database with tables generated from the datasets.

     

    First, create a new database in PostgreSQL with which you’ll load in your preprocessed datasets, and turn on the PostGIS extension. Do so by running the following commands in a Jupyter Notebook - one command per Notebook cell, filling in YOUR_DATABASE_NAME with the desired name for your database:

     

    !createdb YOUR_DATABASE_NAME

    !psql --dbname YOUR_DATABASE_NAME -c 'CREATE EXTENSION postgis;'

     

    Both createdb and psql, are commands that have been installed along with the PostgreSQL installation

     

    Create a schema.sql file that defines each table’s schema. You can use SQLAlchemy within the notebook to help generate this file, another programmatic approach, or create this schema file by hand.

     

    Finally, programmatically in your notebook, create & populate four tables: one for NYC zip codes, one for 311 complaints, one for trees, and one for historical average rents. Use appropriate data types for each column.

     

    Tips for Part 2:

    ● Take a look at the Appendix for an example of a schema file.

    ● You’ll most likely need to use psycopg2 to connect to the database, which is a 3rd party package you will need to pip install. psygopg2 is similar to Python’s sqlite3 library (Module 9) but for PostgreSQL databases instead of SQLite3 databases.

    ● I should be able to run this schema file to create the tables in a database via the psql CLI tool. That is, I should be able to run the following 3 commands in a Jupyter notebook cell to create a database (in this example, groupNproject) with the four required tables (it is not expected that you do this yourself for the project, but this is a good sanity check for it to succeed without error):

    !createdb groupNproject

    !psql --dbname groupNproject -c 'CREATE EXTENSION postgis;'

    !psql --dbname groupNproject -f schema.sql

     

    ● In anticipation for some of the spatial queries in part 3, consider setting indices on your tables for the geometric columns. This StackOverflow post will be helpful when using SQLAlchemy, particularly within the table args, using postgresql_using when defining an index. PostGIS’s documentation can also be helpful. Be sure to include the CREATE INDEX statements in the schema as well.

    ● Double-check the SRIDs are all the same for the geometry columns in every table.

    ● Some pandas functionality that might be helpful to look into (and not just for Part 2) (not exhaustive of what could be helpful):

    ○ pd.read_sql_query - read data from querying a SQL table

    ○ pd.read_sql_table - read entire SQL table

    ○ pd.read_sql - read a query or an entire table

    ○ df.to_sql - add data from the dataframe to a SQL table

    ○ pd.to_numeric - Convert argument to a numeric type

    ● Some geopandas functionality that might be helpful as well:

    ○ gpd.read_file - read any type of spatial data file (shapefiles, GeoJSON files, etc)

    ○ gpd.read_postgis - read spatial-based data from a SQL query

    Part 3: Understanding Data

    Overview: In Part 3, you will be crafting a set of SQL queries to develop a better understanding of the datasets we’re working with.

     

    For this part, define a SQL query for each of the following questions - one query per question. Save each query as a .sql file, naming it something illustrative of what the query is for, e.g. top_10_zipcodes_by_population.sql.

     

    For each query, be sure to execute it in the notebook so we can see your answers to the question.

     

    The query results can be either a pandas/geopandas dataframe, or simple Python types (like a list of tuples).

     

    Note: we will not be checking for exactness with the results. We’re only concerned with the construction of your queries.

    Query 1: Which area might be more calm to live in?

    Between October 1st, 2022 and September 30th, 2023 (inclusive), find the number of 311 complaints per zip code.

     

    The query result should have two columns, one row per zip code, with the number of complaints in descending order.

     

    Query 2: Where has the most greenery?

    Using just the trees table, which 10 zip codes have the most trees?

     

    The query result should have two columns, 10 rows. The rows should be sorted by the total number of trees, descending.

     

    Query 3: Can I afford a place in the areas with the most trees?

    Of the 10 zip codes with the most trees, for the month of August 2023, what is the average rent by zip code?

     

    The query should have a JOIN statement. The query result should have two columns (not three) and 10 rows. The rows should be sorted by the total number of trees, descending. “Humanize” the rent numbers, meaning format the results as 2,879.58 instead of 2879.575128.

     

    Query 4: Could there be a correlation between an area’s rent, the number of its trees, and the number of 311 complaints?

    For the month of January 2023, return the 5 zip codes with the lowest average rent, and 5 zipcodes of the highest average rent, and include the tree count and complaint count for each zip code by using JOIN statements.

     

    The query result should have 4 columns (zip code, average rent, tree count, and complaint count) and 10 rows: five with the highest average rent, and five with the lowest average rent. “Humanize” the rent numbers, meaning format the results as 2,879.58 instead of 2879.575128.

     

    Query 5: Where has the most greenery (take 2)?

    Rewrite Query 2 to use both the trees table and the zipcodes table. Join both tables where the coordinate point of the tree is inside the polygon boundary of the zipcode as defined in the zipcode table.

     

    The query should have a JOIN statement. The query results should match exactly the results of Query 2.

     

    Query 6: What is the immediate area like?

    Using the following coordinate pair on campus, which trees are within ½ mile radius of this point?

    Latitude: 40.80737875669467, Longitude: -73.96253174434912


    The result should have 5 columns (ID, species, health, status, and coordinate location of each tree).

     

    You may use regular Python to help construct the geometric object needed for setting the radius in the query.

     

    Tips for Part 3:

    ● You may wish to use either psycopg2,  SQLAlchemy, or GeoAlchemy2 (or all three!) within the notebook to help craft these queries and query files. You can also use pandas and geopandas to help check the validity of your queries.

    ● To help with formatting numbers, check out PostgreSQL’s built-in formatting functions.

    ● You may want to familiarize yourself with the WITH <name> AS expressions in SQL. This is a good resource with a lot of examples to help get familiar (it uses SQLite3 but it will work in PostgreSQL as well).

    ○ There are quite a few examples using UNION ALL - this will help "flatten" tables that have common columns (e.g. average rent) into a singular shared column (for example, lowest and highest average rents in one result table).

    ● This StackExchange post will be helpful in figuring out the coordinates of a radius around a single point. Or this PostGIS function may help too.

    ● For a couple of queries, you’ll need to do what’s called spatial joins. This PostGIS function can be used to check if one geometry object (point, polygon, etc) is inside the boundary of another geometry object. This other helpful PostGIS function can be used to convert a string of coordinate pairs to a geometry object. GeoAlchemy2 can help too.

    ● I should be able to run each SQL file to query the database via the psql CLI tool. That is, I should be able to run the following command for each of the 6 query files (it is not expected that you do this yourself for the project, but this is a good sanity check for it to succeed without error):

    !psql --dbname groupNproject -f query.sql

     

    Part 4: Visualizing Data

    Overview: For Part 4, you will be creating visualizations to enhance your understanding of the datasets.

     

    For this final part, you will be creating a bunch of visualizations embedded in your notebook using matplotlib and/or other visualization libraries of your choice.

     

    This is where you can get creative with the look and feel of each visual. All that is required is that each visualization is immediately understandable without necessarily needing to read its associated function (i.e. labeled axes, titles, appropriate plot/graph/visual type, etc). You’re welcome to use Markdown cells to introduce and/or explain each visualization.

     

    You can use pandas and geopandas to help parse data before generating a visualization, but you must read the data from your PostgreSQL database; do not read data directly from a file. That is, you are able to use pandas / geopandas dataframes to help with your visualization. But you should be creating those dataframes from querying the SQL tables you need and not by reading from a file or reusing data frames from Part 1.

     

    Create the following 6 visualizations. You must define a function for each visualization, then call these functions in separate cells to render each visualization. Pay close attention to the time frames.

    Visualization 1: What can I expect to put up with in NYC?

    First, find the top 3 complaint types for October 1st, 2022 to September 30th, 2023 (inclusive).

     

    Then, create an appropriate visualization for the number of complaints per day over $timeframe for those complaint types.  

     

    Visualization 2: What are the most common in the immediate area?

    Create a visualization that shows the number of complaints by complaint type for the top 10 complaints in zip code 10027 for October 1st, 2018 to September 30th, 2023 (inclusive).

     

    Visualization 3: Is there any correlation between rent, trees, and complaints at all?

    Between January 1st, 2015 and September 30th, 2023 (inclusive), create a visualization using 2 subplots that share the x-axis where one subplot shows rent compared to the number of trees by zip code, and the other subplot shows rent compared to the number of complaints by zip code.

     

    Visualization 4: If I can afford more in rent, will there be fewer issues & complaints?

    Create a boxplot, where the x-axis is average rent in September 2023, separated into $1000 bins (i.e. $0-1000, $1001-2000, etc), and the y-axis is the number of 311 complaints observed in each zip code between October 1, 2022 (inclusive) to September 30, 2023 (inclusive).

     

    Visualization 5: Where were the recent 311 incidents reported from in the immediate area?

    Create a geospatial plot of the coordinates of reported 311 incidents that happened between January 1st, 2023 and September 30th, 2023 (inclusive) within a 1 kilometer radius of the same coordinate from Query 6 in Part 3.

     

    Visualization 6: Are areas trying to improve the amount of trees in the neighborhood?

    Create a geospatial plot of two sets of data: the coordinates of trees in NYC, and the coordinates of "New Tree Request" 311 complaint types that were made from October 1st, 2018 to September 30th, 2023 (inclusive).

     

    Tips for Part 4:

     

    ● Some handy resources for various pandas functionality:

    ○ Group by two columns

    ○ Creating bins/buckets

    ● Check out the matplotlib’s gallery, geopanda’s gallery, and geoplot’s gallery for inspiration on which visualization to choose. Click on a visualization for sample code that generates it.

    ● You’re not limited to the above visualization libraries – feel free to others, like seaborn, folium, Rasterio, or anything else that interests you.

咨询 Alpha 小助手,获取更多课业帮助