-
Introduction
In this assignment, you will be given a working program and are asked to improve its execution time. Without breaking it, of course!
In completing this assignment, you will learn how to:
-
● Identify inefficiencies in existing code
-
● Apply techniques to make code more efficient
Note that this is only Part 1 of this assignment; be sure to complete Part 2 in Gradescope by the due date as well.
Description
In this assignment, you are asked to analyze and improve the execution time of a Java program that attempts to detect plagiarism in a corpus of plain-text documents.
Plagiarism detection is a very difficult problem to solve, but a simple approach is to just look for common words and phrases between documents. If two documents contain many of the same phrases, then there is a good possibility that one author copied from the other.
The program you will improve detects common phrases of size windowSize in a corpus of documents, and then report pairs of documents for which the number of common phrases is equal to or greater than some threshold, sorted by number of common phrases.
Getting Started
Download the PlagiarismDetector code, which contains the source of the PlagiarismDetector class that you need to analyze and improve, a PlagiarismDetectorTest JUnit class to help make sure that changes don’t break the code, and the corpus of documents we will use to evaluate the execution time of your program.
then this means that the unit test could not read the corpus of documents. Be sure that the “docs” directory is in the right place; you may also change line 19 of the PlagiarismDetectorTest class to point to the right directory.
The PlagiarismDetector.detectPlagiarism method is a public, static method that takes the name of the directory containing the text files, as well as the windowSize and threshold, and returns a Map with pairs of file names as the key and the number of matching phrases as the value.
Aside from the general detectPlagiarism method, the PlagiarismDetector also has helper methods to:
● read a file and store its contents in a List of Strings
● create the distinct phrases, each of which is of size windowSize ● find the common phrases between two documents
● sort the resultsA modified version of this problem and the corpus of documents was originally used at the University of Chicago and presented as a Nifty Assignment at SIGCSE 2008 (http://nifty.stanford.edu/2008/franke-catch-plagiarists/), and the implementation you are using in this assignment was written by the instruction staff of a previous Software Engineering course here at Penn. We will assume that this implementation is correct for our purposes, but if you believe you have found a critical bug, please notify the instructor.
Specification
Before you start improving the efficiency of the code, you should analyze it and identify the places in which it is inefficient.
If you have a question about the intent of any part of the code, it is okay to post a public note in Ed asking for clarification, but keep in mind that we are assuming that this code is correct, and please do not post public questions regarding the efficiency of any part of the code, including whether or not the code is necessary, since that’s pretty much the whole purpose of the assignment!
Once you’ve reviewed the code and thought about some of its inefficiencies, then you need to improve the execution time of the code without affecting its functionality.
Initial Execution Time
First, run the DetectPlagiarismTest JUnit test that we provided. If the test fails or does not compile, please post a public note in Ed right away!
Depending on your hardware, the version of Java that you’re using, the other programs running on your computer, etc., it may take anywhere from 1-4 minutes for the test case to finish. The test case should pass, and you should see the execution time (in milliseconds) reported in the Console.
Before you begin modifying the code, make a note of the execution time so that you have a baseline for your improvements.
You will notice that the time will vary slightly on each execution since the time measurement includes other things that are happening on your computer and within the Java VM, but try to get a sense of the approximate time it takes to run the program on your computer.
Also, be sure to make a backup copy of the original code so that you can compare the two versions as you go along and then compare them once you’re done. You may want to do this by creating a GitHub repo for this assignment.
Objectives
Your goal in this part of the assignment is to improve the execution time of the code, using the various techniques seen in class.
Keep in mind that you are asked to focus only on improving the execution time of the code. It is okay if your changes have a negative effect on things like memory usage or other aspects of quality, except for correctness, of course!
Please put all code in PlagiarismDetector.java. If you need to create new classes, implement them as inner classes so that your entire program is just the one file.
Restrictions
You may change anything you’d like in the PlagiarismDetector class, as long as you do not violate any of the following restrictions:
First, you cannot change the signature of the detectPlagiarism method, i.e. the PlagiarismDetectorTest class that we provided must still compile. You may, however, change or remove other methods in the class.
It is okay, however, if:
-
● you change the order in which the file names are listed for a given pair, e.g.
either bwa242.txt-sra42.txt or sra42.txt-bwa242.txt would be considered correct as long as you only list one and not both, and as long as they still have 10 matches and are listed first overall
-
● you change the order in which pairs with the same number of matches are listed, e.g. it is okay if bwa242.txt-bwa249.txt is listed before bwa0.txt-bwa242.txt as long as they still have 8 matches and follow the pair with 10 matches
The PlagiarismDetectorTest should be able to handle these variations, but please contact a member of the instruction staff (either in office hours or via a private Ed post) if you feel that your output is correct but the test is failing.
Also, in making changes to the code, you may not assume that:
-
● the program will always use the same set of documents that was provided to you
-
● the number and size of documents in the corpus will always be the same as the one that was provided to you
-
● the windowSize and threshold will always be the same as in the test case provided to you
-
● the code is always executed on a multi-core/multi-processor machine (so you can’t assume that threads will necessarily be run concurrently)
We will determine the correctness of your program using the test case that we provided, as well as some other test cases (which will not be made available in advance) to make sure that bugs were not introduced.
-
In some cases, you may need to make a decision that is somewhat dictated by the input, e.g. it may be better to use one data structure for small window sizes and a different one for large ones. In such cases, choose the one that works best for the test case we provided, but ask a member of the instruction staff if you need help.
If you have any questions about what you can and cannot do to the code and what you can and cannot assume, please post a private note in Ed so as not to inadvertently give away solutions.
Completion CriteriaPart of the challenge of an assignment like this is knowing when you’re done, and since execution time can vary based on hardware, operating system, Java version, other tasks running on the computer, etc., it’s important to standardize the execution environment as much as possible.
The “official” execution time of your code will be measured via Gradescope. To measure the execution time, just upload your PlagiarismDetector.java file to the “Homework #5 Part 1” assignment in Gradescope and after a while you’ll see the result.
We have measured the average execution time of the code we’ve distributed as around 120 seconds on Gradescope. In order to receive 100% of the credit for this part of the assignment, you need to get the execution time to be two seconds or less.
This may sound daunting but there are at least two major inefficiencies in the code that, if addressed, should each reduce the code by as much as 30 seconds compared to the original; one of these changes should reduce the code down to nearly 20 seconds on its own, so if you can find and fix that, you’re in really good shape!
If you are seeing differences of 20 seconds or more between your computer and Gradescope, please notify the instructor. More details on grading are provided below.
-
-
CIS 5730 Software Engineering
咨询 Alpha 小助手,获取更多课业帮助