Background
In recent years, the inclusion of an evaluation component has become almost obligatory in any publication in the field of natural language processing. For complete systems, user-based and task-oriented evaluations are used in both the natural language understanding (NLU) and natural language generation (NLG) communities. A third, more competitive, form of evaluation has become increasingly popular in NLU in the form of shared-task evaluation campaigns (STECs). In a STEC, different approaches to a well-defined problem are compared based on their performance on the same task. A large number of different research communities within NLP, such as Question Answering, Machine Translation, Document Summarization, Word Sense Disambiguation, and Information Retrieval, have adopted a shared evaluation metric and in many cases a shared-task evaluation competition.
The NLG community has so far withstood this trend towards a joint evaluation metric and a competitive evaluation task, but the idea has surfaced in a number of discussions, and most intensely at the 2005 European Natural Language Generation Workshop in Aberdeen, Scotland, and the 2006 International Natural Language Generation Conference in Sydney, Australia. There are a significant number of researchers in the community who believe that some form of shared task, and corresponding evaluation framework, would be of benefit in enhancing the wider NLP community's view of work in NLG, and in providing a focus for research in the field. However, there is no clear consensus on what such a shared task should be, or whether there should be several such tasks, or what the evaluation metrics should be.
We believe the time is ripe for an exploratory workshop on this topic. As a consequence of the issue being raised at recent workshops, there is a sizeable amount of pent-up intellectual energy ready to be directed to addressing the questions that the idea of evaluation in NLG raises; we aim to harness that energy before it dissipates.