Monash
FIT5137 Advanced Database
You have been hired as a data analyst at Public Transport Victoria (PTV), the Victoria Government authority responsible for public transport in the state. Some of your duties are data extraction, integration and analysis to provide good understanding regarding the public transportation condition in Victoria to the stakeholders.
After the COVID-19 restrictions were lifted, most companies are switching the workstyle from work-from-home to face-to-face. Therefore, transportation infrastructure and network is one of the most important aspects. While some people prefer to drive to work, some other people prefer to use the public transportation network as their main transportation mode. PTV as the sole provider for the public transportation network reduced their services during the lockdown period. Now, PTV has restored the services to cover as many areas as possible in the whole region. However, some questions remained mysteries. How good is the current PTV coverage? Are there any uncovered spots? Which area has the best public transportation options?
Therefore, as a data analyst, your task is to evaluate the data and provide the spatial data analysis to the stakeholders of PTV. The data should be presented in an area level, such as municipality, suburbs or postcode. For example, you may present “The number of bus services in Bundoora” or "The number of Trains or Trams network in Bundoora”.
There are two datasets that you have to obtain in this assignment, which are the PTV/GTFS dataset and Australian Boundary data.
The General Transit Feed Specification (GTFS) is a data specification that allows public transit agencies to publish their transit data in a format that can be consumed by a wide variety of software applications. Today, the GTFS data format is used by thousands of public transport providers.
GTFS is split into a schedule component that contains schedule, fare, and geographic transit information and a real-time component that contains arrival predictions, vehicle positions and service advisories. A GTFS feed is composed of a series of text files collected in a ZIP file. Each file models a particular aspect of transit information: stops, routes, trips, and other schedule data.
For more detailed information about GTFS, you can refer to the official documentation provided by Google at https://developers.google.com/transit/gtfs. Additionally, You can read further explanation about the PTV-GTFS data from https://transitfeeds.com/p/ptv/497. For this assignment, we will be using the 17th March 2023 version of the dataset.
The Australian digital boundary is defined by the Australian Bureau of Statistics using the Australian Statistical Geography Standard (ASGS). The ASGS is a classification of Australia into a hierarchy of statistical areas. It is a social geography, developed to reflect the location of people and communities. It is used for the publication and analysis of official statistics and other data. The ASGS is updated every 5 years to account for growth and change in Australia’s population, economy and infrastructure. For the 2021 release, the ASGS will be re-named to the Australian Statistical Geography Standard (ASGS) Edition 3.
The ASGS is split into two parts, the ABS and Non ABS Structures. The ABS Structures are geographies that the ABS designs specifically for the release and analysis of statistics. This means that the statistical areas are designed to meet the requirements of statistical collections as well as geographic concepts relevant to those statistics. This helps to ensure the confidentiality, accuracy and relevance of ABS data.The Non ABS Structures generally represent administrative regions which are not defined or maintained by the ABS, but for which the ABS is committed to directly providing a range of statistics.
The Main Structure is developed by the ABS and is used to release and analyse a broad range of social, demographic and economic statistics. It is a nested hierarchy of geographies, and each level directly aggregates to the next level. Mesh Blocks (MBs) are the smallest geographic areas defined by the ABS and form the building blocks for the larger regions of the ASGS. Most Mesh Blocks contain 30 to 60 dwellings.
Below is the simplified ABS and Non ABS Structure. You can read further explanation about the structure here
Your assignment consists of several parts. Always read the instruction one by one. Do not move to the step without completing the previous step:
● Task 1: Data Restoration - Restore the data to the database. Monitor the success indicator to ensure successful restoration of the data.
● Task 2: Data Preprocessing - Perform necessary structure maintenance and create result tables for further processing.
● Task 3: Data Analytics and Visualization - Develop SQL queries to analyze the data and evaluate performance & Create visualizations to present the results of the data analytics.
● No data cleaning required for this assignment.
● For more information, see the FAQ for Assignment 3.
For simplicity, all the data required for this assignment is readily available in the PostGIS Docker container. You can access these datasets within the container by navigating to the /data/adata folder. If you don’t know how to do it, refer to the labs 10 activities.
Task 1: Data Restoration
Before you can start the data analytic processes, the first thing you have to do is to restore the external data to your database. Make sure you prepare a destination schema to restore your data. The destination schema for your assignment is “ptv”.
Note:
● Before initiating the data restoration process, it is essential to thoroughly explore the dataset. This exploration involves identifying appropriate data types, determining field lengths, and making other relevant considerations that will inform the creation of the table structure.
● Ensure that you restore the data into the PTV schema using regular (local) tables. Do not utilise foreign tables, as the data must be stored directly within the PostgreSQL database.
● Ensure that all tables are successfully restored, including 8 tables from GTFS and 3 tables for MB_2021, LGA_2021 and SAL_2021 respectively.
Task 2: Data Preprocessing for Melbourne Metropolitan area
The purpose of this section is to manipulate the data into a suitable format for the following task analysis. This task has two parts: Mandatory requirement and Optional requirements.
Mandatory requirement
[You must meet the mandatory requirements described in this section.]
In this assignment, we aim to explore the transportation accessibility [Topic of report] of the Melbourne Metropolitan area exclusively [Scope of report]. The mb_2021 table
contains mesh blocks for the entire country of Australia. To minimise query costs, ensure that you only use the mesh blocks within the Melbourne Metropolitan area for this assignment. The Melbourne Metropolitan’s mesh blocks can be identified from the gcc_name21. If the column contains “Greater Melbourne”, this mesh block is located in Melbourne Metropolitan. As a result, you need to create a table called "mb2021_mel" that contains ONLY the mesh blocks in Melbourne Metropolitan.
Task 3: Data Analytics and Visualisation
In this section you will need to perform data analysis on the tables you have restored, focusing on transport accessibility in metropolitan Melbourne. Use the techniques you have learned in the spatial database part to carry out your analysis. You are free to choose any specific perspectives or aspects of data analysis relevant to your dataset, but ensure that your analysis relates to the main topic: transport accessibility in metropolitan Melbourne.
This could include exploring different statistical measures or carrying out other relevant analyses. Present your findings clearly and concisely, demonstrating your understanding of the dataset and highlighting any notable observations or patterns.
As part of this data visualisation, you will also need to create at least one map-based headmap using QGIS to present your findings related to the main topics. These visualisations will be used in the next section of the assignment, the summary report. To support your analysis, you can include screenshots of the visualisations directly in the report.
Be sure to include the script or code used for data analysis and data visualisation in the appendix of your report. The script should provide clear instructions on how the analysis was performed and any necessary calculations or transformations applied to the data. This will ensure that your analysis can be reproduced and verified. Remember to include appropriate labels, titles, and legends in your visualisations to make them easy to understand. The visualisations should be of sufficient quality and clarity to effectively convey your analysis findings.
Note:
● Use SQL queries to investigate the restored tables.
● Conduct a thorough descriptive analysis to uncover insights within the data.
● Summarise and Visualise your findings clearly and concisely.
● Highlight key observations and patterns discovered during the analysis.
● Ensure your findings reflect a deep understanding of the data.
Summary Report for Task 1 to 3
As a professional data analyst, your task is to consolidate all the previous tasks, including data restoration, processing, analysis, and visualisations, into a comprehensive written report. The report should adhere to a word limit of 2000 words and follow a structured format, consisting of an introduction, methodology, results, conclusion, and appendix. Please note that a question-and-answer format is not acceptable for this assignment, and marks will be deducted for using such a format.
咨询 Alpha 小助手,获取更多课业帮助