Named Entity Recognition on Large Collections from Python

Named Entity Recognition is the problem of locating and categorizing chunks of text that refer to…well…entities. “Entities” usually means things like people, places, organizations, or organisms, but can also include things like currency, recipe ingredients, or any other class of concepts to which a text might refer. NER has a wide range of applications in text mining. Recently, for example, I worked with researchers at the Max Planck Institute for the History of Science to extract names of people from meeting minutes in their instutional archives, in order to identify patterns of cooperation in committees and subcommittees. I frequently use NER to identify organisms mentioned in scientific publications.

The first step in NER is typically to train a model that can predict categories (e.g people, places) for words or phrases. The Stanford Natural Language Processing Group provides a wonderfully robust set of tools for this. They also provide a few classification models that are already trained. In this post I’ll describe how to use one of their pre-trained models to perform NER on a large collection of texts, using Python.