Unsupervised Parsing

In a corpus of words, a pair of words XY can be replaced by a rule: R -> XY. If the pair of words XY occurs a lot, replacing XY by a reference to the rule R can save reduce the number of characters needed to represent this corpus. Heuristics may be used to decide which rules to try out. These rules are similar to grammar rules. It has been shown that using the heuristics of Mutual Information and Relative Entropy does a decent job of parsing the corpus. The purpose of this project is to try other heuristics besides Mutual Information and Relative Entropy and to show a visual representation of the results of the heuristics so one can compare and contrast which heuristics do well in what circumstances. Here is a screenshot of the application:

The application uses the PennTree Bank as its corpus. The learned parse tree is on top and real parse is on the bottom.


The following heuristics were tried:


Here is the application.

Here is the source.

Installing and executing the application

The application will work with Java 1.4 or above. It has been tested on both Windows and Macintosh. After downloading the application unzip the file to the directory of your choice. Run the application with the following command:

java -jar UnsupervisedParsing.jar

Load a file from the Penn Tree Bank by going to "Load Text from Penn Tree Bank" in the file menu. A window will pop up showing all of the Penn Tree Bank files. Choose the files you would like to load. The application will then load the text and run the algorithms. This may take a while. It took over two hours on a 1.8 GHz machine with 512 megs of Ram to run all of the files which come with the application. It also has a tendency to run out of memory with two much data.

Once the files are loaded, a list of all sentences represented by a rule number (for example: r1042) can be seen on the left. Clicking on a sentence in the list at the left causes the it's parse trees to be shown on the right. The top parse tree can be changed by going to the Algorithm menu and choosing a different Algorithm. The actual parse tree is uses the open source library Prefuse to plot the tree and thus can take quite a bit of time. If it is too slow, it can be turned off by going to the View menu and unchecking the check box.

After loading files, the user can save that session to a file and load the session later. This saves the time of running the algorithm again later. Unfortunately, the hand parsed tree does not currently get saved with the session.


There are some minor problems parsing the Penn TreeBank files which sometimes cause the sentence in both the top and bottom panel to be out of sync and some words to not be shown.


The application does a good job of showing the parse trees so they can be analyzed and compared to the actual parse. Using Conditional Probability and Selectional Preference for the heuristic don't seem to work very well compared to Mutual Information and Relative Entropy. The combined algorithm appears be about as good as Mutual Information and Relative Entropy at parsing. It was attempted to get an actual score of how well each algorithm parsed the sentences, but this was not accomplished because of the parse trees given by the Penn Tree Bank. The algorithm I used makes the assumption that the parse trees were binary trees. The Penn Tree Bank does not make that assumption and often uses more than two branches for any one node. The Penn Tree Bank also has many traces which are not taken into account by the MDL algorithm used here. Thus I was unable to get a useful measure of how close the parse trees are to the actual parse. It's like comparing apples to oranges.

Future Work

I would like in the future to find and implement a useful way to compare the results of each algorithm to the actual parse tree and show that visually, so I can easily see where each algorithm does a good job of parsing and where they don't do a good job. It would also be nice to color code the bad rules in red and the good rules in green so I could visually see what the problems are and what works well.