数据结构代写 |CS代写

CIT 5940 Data Structures

  1. Background

    Below is an actual graph of lab reports submitted for an introductory Physics course at a large university. This graph represents a portion of data collected for about 800 lab reports. In the graph, each node represents a lab report and each edge indicates the number of 6-word phrases shared between two documents. To reduce visual clutter, a threshold of 200 common phrases has been set, so a document sharing fewer than 200 6-word phrases with all other documents is not shown. The “Lab Manual” is a style-guide for the lab report and the two brown boxes are sample lab reports that were distributed; as you can see, many students apparently “borrowed liberally” from these help materials. Particularly suspicious are clusters like the one in the top-right corner: those documents have an inordinate number of 6-word phrases in common with each other. It is likely those people turned in essentially the same lab report or copied large portions from each other.

    In order to detect likely cases of plagiarism, you will find the common word sequences among a closed set of documents. Your input will be a set of plain-text documents and a number n; your output will be a representation showing the number of sequences of n words (called n-grams) that each document has in common with every other document in the set. Using this representation, you will identify suspicious groups of documents that share many common n-grams among themselves but not with others.

    Output

    We could first think of our intended output as a × matrix, where  is the total number of documents. Each cell of the matrix will represent the number of shared n-grams between the corresponding pair of documents:
    From this table, we can conclude that the writers of documents A, C and D share a high number of similar 6-word phrases. We can be fairly certain that A and D collaborated on this assignment. However, printing and storing a × matrix is unmanageable for large sets: look at how many wasted spaces we have for =5You should instead produce a list of documents ordered by number of “hits”. For example:

    We’ve gone from 25 entries to 5 without losing any information! For even larger sets of documents, it may be important to only report document pairs with a high number of shared n-grams above a certain threshold.

    The Documents

    We will provide you with different folders of documents that represent the sets. One set will be small (~25 documents); its size will allow you to calculate expected values for unit tests using this set. The other sets will be larger: one has 75 documents and the other has over 1300 documents (the documents came from www.freeessays.cc, a repository of really bad high school and middle school essays on a variety of topics).

    Strategy Discussion

    For this discussion, we’ll assume that we have  distinct n-grams across all of our documents. The straightforward matrix solution, requiring us to compare each n-gram in a document to all other n-grams in all other documents, has an (2) time complexity. This is polynomial time: fast in some contexts, but the growth will be prohibitive here. For example, if the 25 document set takes 10 seconds to process, the 1300 document set will take several minutes—provided that you can store the matrix representation in memory!

    We’ll need to improve on this initial strategy in terms of both time and space complexity. The use of hash tables will allow us to speed up the computation dramatically, but for large numbers of documents, we’ll still run out of program memory. Our solution will be to create supplementary data files that we can store on/load from the disk as needed.




咨询 Alpha 小助手,获取更多课业帮助