Code Review: How the AllenNLP Vocabulary indexes your text
Although AllenNLP provides a good guide for almost all the modules in the library, I am still confused in many aspects about how the vocabulary is constructed. Specifically, we answer the following questions/discussions.
- Construction: How token/id pairs are added into
Vocabulary
? - Used for a text field: How the vocabulary is consumed for text indexing?
Vocabulary
for label indexing?- Any Caveats
Section 1: How token/id pairs are added into Vocabulary
?
From instances: Normally, from_instance
method can construct a counter: Dict[str, Dict[str, int]]
where the outer key is reserved for each namespace, and inner dictionary stores token/id pairs. counter
would be further decomposed into self._token_to_index
and self._index_to_token
attributes by the _extend
method.
Specifically, it call count_vocab_items
of each Instance
object, which in turn call count_vocab_items
of each Field
object, which in turn (again) call count_vocab_items
of each TokenIndexer
object. Therefore, the functional code of counting items is indeed in each TokenIndexer
object. This count_vocab_items
in TokenIndexer
would match namespace with outer key name of counter
to extend the items or increase the count of items. Below is the code in SingleIdTokenIndexer
.
def count_vocab_items(self, token: Token, counter: Dict[str,
Dict[str, int]]):
if self.namespace is not None:
text = self._get_feature_value(token)
if self.lowercase_tokens:
text = text.lower()
counter[self.namespace][text] += 1
Section 2: How the vocabulary is consumed for text indexing?
The Vocabulary
would coordinate with TokenIndexer
to index tokens in Field
objects. Specifically, TokenIndexer.tokens_to_indices
method would take a Vocabulary object as the argument to match the namespace and index tokens.
As above, the functional code is actually in TokenIndexer
. Below is the code in SingleIdTokenIndexer
.
def tokens_to_indices(
self, tokens: List[Token], vocabulary: Vocabulary
) -> Dict[str, List[int]]:
indices: List[int] = []
for token in itertools.chain(self._start_tokens, tokens, self._end_tokens):
text = self._get_feature_value(token)
if self.namespace is None:
indices.append(text) # type: ignore
else:
if self.lowercase_tokens:
text = text.lower()
indices.append(vocabulary.get_token_index(text, self.namespace)) return {"tokens": indices}
tokens_to_indices
method would be Textfield.index
method, as shown below. Notice the difference between Textfield.index
and LabelField.index
discussed in the next section.
def index(self, vocab: Vocabulary):
self._indexed_tokens = {}
for indexer_name, indexer in self.token_indexers.items():
self._indexed_tokens[indexer_name] = indexer.tokens_to_indices(self.tokens, vocab)
Section 3: The vocabulary for label indexing
It differs from text indexing in both construction and consumption.
- No IDs reserved for padding and unknown token during construction
- No
TokenIndexder
: TheLabelField
itself contains namespace (commonly hardcoded as “labels”) to extract the label index from the token-to-index dictionary (i.e.,Vocabulary._token_to_index[“labels”]
), as shown in the following code inLabelField
.
def index(self, vocab: Vocabulary):
if not self._skip_indexing:
self._label_id = vocab.get_token_index(
self.label, self._label_namespace # type: ignore
)
Section 4: Caveats
Forget to apply TokenIndexer
to TextField
: This is one of the most common mistakes when we process data with AllenNLP, because logically the token/id mapping in Vocabulary is enough for indexing tokens in TextField
. However, there are many reasons why we need TokenIndexer
.
- One common reason is to add special tokens (starting or ending tokens )
- Another reason is that token in
Textfield
may not match the granularity of the token/id mapping in Vocabulary. For example, we tokenize text into words but the vocabulary contains token/id mapping for character. This sounds wired: why we tokenize text into word rather than characters if we want to index them on the character-level. As far as I know, I guess that it benefits us to combine both word indexing (usingSingleIdTokenIndexer
) and character indexing (TokenCharacterIndexer
).
So, remember to apply TokenIndexer
. Below is the code to use both word indexing and character indexing within one text field.
text_field.token_indexers={
"tokens":
SingleIdTokenIndexer(namespace="token_vocab"),"token_characters":
TokenCharactersIndexer(namespace="character_vocab"),}