Live Talk Open to the Public
Watch a video of the live presentation of this NLP++ analyzer for Brazilian addresses during the 2024 HPCC Systems Community Summit. It took place on Wednesday, October 8, at 9:00 am EST USA online. It was free to register and attend.
Using the NLP++ Plugin for HPCC Systems
While I was at LexisNexis, I implemented an NLP++ plugin for the HPCC Systems Platform allowing for the creation of trustworthy, glass box NLP systems that can be constantly improved. Having given a talk to the Brazilian team at LexisNexis, I was contacted by a manager who was interested in building a prototype system for replacing their current address cleaner in Portuguese. I was then paired with a bright young Brazilian programmer named Guilherme da Silva who I worked with for several months.
Replacing Black Box with NLP++ Glass Box
One of the biggest attractions to NLP++ is that it is rule and knowledge based, and it is 100% transparent code. Whereas statistical NLP systems like Machine Learning, Neural Networks, and Large Language Models are statistical black boxes, NLP++ analyzers are 100% customizable computer code that can be logically built and can be logically enhanced when the system fails.
The idea of the Brazilian Address Cleaner prototype was to replace the black box statistical system that was cleaning or “parsing” the incoming addresses with a glass box system. Guilherme had tried to “squash” errors coming from the current address cleaner but being black box system, more often than not, the only way to correct the problems was for a human to completely reparse the errant address from scratch. It was decided to build a glass box system using the NLP++ plugin.
Building Portuguese Dictionaries and Knowledge
During my time working on the prototype with Guilherme, I created various dictionaries and knowledge bases for the Portuguese language. This included states, municipalities, numbers words, words “logradouros” (which are part of Brazilian addresses) and more. To do this, I found webpages with the necessary data, including a webpage for the Brazilian postal system, and used NLP++ to parse the data into NLP++ dictionary and knowledge base files. You can see the list of dictionaries and KB files currently available for everyone to use in the NLP++ language extension for VSCode.
You can see many of the NLP++ analyzers I wrote to create the Portuguese dictionaries in github at dehilster-analyzers/pt at main · VisualText/dehilster-analyzers (github.com)
Talk Description
Here is the description from the webpage for Guilherme‘s presentation.
by Guilherme da Silva, LexisNexis Risk Solutions
NLP++ is a new programming language specially designed to build deep text parsers. The main objective of this approach is to build a Brazilian address analyzer and cleaner that is capable of improving the current cleaning process, with the advantage of being a transparent process with easy problem identification and correction, demonstrating great potential for future use in production.