(Last Update: June 10, 2014)
The following are descriptions of research projects that I would like students to work on. Most of them are related to document versioning and software configuration management, which is my main area of research currently.
Most Wikipedia pages change over time. Pages about living persons and current events get updated with new information. Pages on historical and technical topics get refined over time. Pages get hacked once in a while. These changes are reflected in version histories for the articles.
This research project would try to extract interesting information from the version histories of Wikipedia pages. By analyzing those histories and using statistical techniques such as cluster analysis, it may be possible to do some of the following
This project has already been the basis of two capstones and one MS thesis. We have software that can pull data out of the Wikipedia version histories and have done some simple cluster analyses on that data. What is needed now is more serious data analysis using better statistical measures. A student working on this project is going to need to look closely at Wikipedia data, and may have to do considerable content analysis of pages in order to build a suitable corpus for analysis. There are also performance issues in getting the version data that need to be addressed and may require effort to get higher privileges with Wikipedia.
Dr. Cheng Thao, now at UW-Whitewater, developed a great new technology called Version Aware XML Documents. These are XML documents that include an embedded version history. The version history is included in a way that is designed to not interfere with existing XML editing applications and, in general, it works well. We've made some great experiments with SVG editors that show the potential of the technology.
Unfortunately, XML editors often "interfere" with our approach, because they can read Version-Aware files without any trouble, but they throw away the versioning information. The LibreOffice suite is an example of this and there are others.
There are multiple projects possible under this topic.
We would like to convince the world that document systems should integrate versioning support from the beginning. To do this, we need to make the Version Aware framework into a library with a lightweight API that is convenient to code and that imposes a minimal burden on the system that is using it.
This project would design and implement the library. Probably, we would choose an open-source XML application and modify it to use the library.
LibreOffice is the fully open-source version of the OpenOffice office application suite. I have had two MS students (one capstone and one thesis) work to make LibreOffice compatible with the Version Aware Documents framework. We have gotten pretty close to succeeding, but have yet to finish the job.
Available projects with Version Aware LibreOffice include:
This project may be more ambitious than the previous two.
For this project, the goal would be to make it easier for users to merge changes made by multiple authors. An example use case is when a PhD student gives her dissertation to the members of her dissertation defense committee. Each professor proceeds to mark up the dissertation with possible corrections. The professors do this in parallel, so it's quite likely that multiple professors will try to correct the same problems, but will do so in conflicting ways. It would be nice to have a system that can recognize when this is happening and help the user to choose the best one.
There are other similar cases, such as using the spelling and grammar-checking feature of an office suite to notice that corrections are being made where there were grammar issues. Suggested changes around those areas probably don't require much attention, and even if they do merit some attention, the system could provide an interface that lets the user check them more efficiently. In contrast, users may want to be very careful of corrections that would actually change the meaning of the text. Good natural language processing software should be able to spot these cases and a good user interface could call out such changes for special attention.
This project would require using both versioning software and natural language processing software. It would be a novel integration of the two domains. I don't know of any examples of systems (commercial or research) that do this.
This project would be a collaboration with the laboratory of Dr. Brian Armstrong in Electrical Engineering. It would be a mix of user interface design, computational geometry, and maybe some computer graphics.
Dr. Armstrong has a system called Moire Phase Tracking that tracks the position and orientation of objects with very high accuracy and only needs a single video camera. Most other systems are not as accurate and require at least two cameras. Systems like this are used in medicine (e.g. remote control of surgery or functional MRI), biomechanics research, and computer generated imagery (as in film special effects).
The video cameras used in Moire Phase Tracking must be calibrated very carefully. A PhD student developed a novel technique for doing this that involves a special calibration device. The person calibrating a camera must collect many dozens of images of this device in different positions and orientations. It remains challenging to collect good sets of calibration images. We believe that one issue is that it is hard to get an image set that really covers ("samples") the range of possible positions and orientations well.
This project would develop a user interface application that would help the human user generate a set of calibration images that is a high-quality sample of the full six-dimensional spaces of positions and orientations. A good sample could either be a systematically specified set based on a high-dimensional grid or it might use a more ad hoc approach based on Voronoi diagrams. In the Voronoi diagram approach, the system would find large regions in the diagram and then help the user take more samples in those regions.
Note: This project is "deprecated", which means that I've had someone work on it and am not sure that it merits further effort at this time.
Semantics Designs builds a variety of software development tools. One of them is a lexical search system for source code. "Lexical search" means that the search knows about the syntax of the programming language, so you can search for all the occurrences of "foo", but only where it is the name of a function in a function call.
The tool achieves very high performance by first processing all of the files and creating a set of indices. This processing is time-consuming but only need be done once if the files aren't changed.
This project would extend the lexical search system so that it would be integrated with version repositories. Rather than require that the developer remember to create the indices, they would automatically be created incrementally as part of the check-in process. Not only would this make it much easier to do searches on short notice but it would also permit cross version searches; for example, "Find the earliest version where the 'foo' function was present". This could be very useful for
This project will require learning the internal structure of the Subversion source code control system and, possibly, enhancing it to support the desired functionality.