Code Review: How the AllenNLP Vocabulary indexes your text

Xinzhe Li, PhD in Language Intelligence
Level Up Coding
Published in
3 min readJun 12, 2022

--

Although AllenNLP provides a good guide for almost all the modules in the library, I am still confused in many aspects about how the vocabulary is constructed. Specifically, we answer the following questions/discussions.

  • Construction: How token/id pairs are added into Vocabulary?
  • Used for a text field: How the vocabulary is consumed for text indexing?
  • Vocabulary for label indexing?
  • Any Caveats

Section 1: How token/id pairs are added into Vocabulary?

From instances: Normally, from_instance method can construct a counter: Dict[str, Dict[str, int]] where the outer key is reserved for each namespace, and inner dictionary stores token/id pairs. counter would be further decomposed into self._token_to_index and self._index_to_token attributes by the _extend method.
Specifically, it call count_vocab_items of each Instance object, which in turn call count_vocab_items of each Field object, which in turn (again) call count_vocab_items of each TokenIndexer object. Therefore, the functional code of counting items is indeed in each TokenIndexer object. This count_vocab_items in TokenIndexer would match namespace with outer key name of counter to extend the items or increase the count of items. Below is the code in SingleIdTokenIndexer .

def count_vocab_items(self, token: Token, counter: Dict[str,   
Dict[str, int]]):
if self.namespace is not None:
text = self._get_feature_value(token)
if self.lowercase_tokens:
text = text.lower()
counter[self.namespace][text] += 1

Section 2: How the vocabulary is consumed for text indexing?

The Vocabulary would coordinate with TokenIndexer to index tokens in Field objects. Specifically, TokenIndexer.tokens_to_indices method would take a Vocabulary object as the argument to match the namespace and index tokens.

As above, the functional code is actually in TokenIndexer. Below is the code in SingleIdTokenIndexer .

def tokens_to_indices(
self, tokens: List[Token], vocabulary: Vocabulary
) -> Dict[str, List[int]]:
indices: List[int] = []
for token in itertools.chain(self._start_tokens, tokens, self._end_tokens):
text = self._get_feature_value(token)
if self.namespace is None:
indices.append(text) # type: ignore
else:
if self.lowercase_tokens:
text = text.lower()
indices.append(vocabulary.get_token_index(text, self.namespace))
return {"tokens": indices}

tokens_to_indices method would be Textfield.index method, as shown below. Notice the difference between Textfield.index and LabelField.index discussed in the next section.

def index(self, vocab: Vocabulary):
self._indexed_tokens = {}
for indexer_name, indexer in self.token_indexers.items():
self._indexed_tokens[indexer_name] = indexer.tokens_to_indices(self.tokens, vocab)

Section 3: The vocabulary for label indexing

It differs from text indexing in both construction and consumption.

  • No IDs reserved for padding and unknown token during construction
  • No TokenIndexder : The LabelField itself contains namespace (commonly hardcoded as “labels”) to extract the label index from the token-to-index dictionary (i.e., Vocabulary._token_to_index[“labels”] ), as shown in the following code in LabelField .
def index(self, vocab: Vocabulary):
if not self._skip_indexing:
self._label_id = vocab.get_token_index(
self.label, self._label_namespace # type: ignore
)

Section 4: Caveats

Forget to apply TokenIndexer to TextField: This is one of the most common mistakes when we process data with AllenNLP, because logically the token/id mapping in Vocabulary is enough for indexing tokens in TextField . However, there are many reasons why we need TokenIndexer .

  • One common reason is to add special tokens (starting or ending tokens )
  • Another reason is that token in Textfield may not match the granularity of the token/id mapping in Vocabulary. For example, we tokenize text into words but the vocabulary contains token/id mapping for character. This sounds wired: why we tokenize text into word rather than characters if we want to index them on the character-level. As far as I know, I guess that it benefits us to combine both word indexing (using SingleIdTokenIndexer) and character indexing (TokenCharacterIndexer).

So, remember to apply TokenIndexer. Below is the code to use both word indexing and character indexing within one text field.

text_field.token_indexers={
"tokens":
SingleIdTokenIndexer(namespace="token_vocab"),
"token_characters":
TokenCharactersIndexer(namespace="character_vocab"),
}

--

--