Building an NLP++ Brazilian Address Cleaner for HPCC Systems 

Live Talk Open to the Public

Watch the live presentation of this NLP++ analyzer for Portuguese addresses during the 2024 HPCC Systems Community Summit. It took place on Wednesday, October 8, at 9:00 am EST USA online. It was free to register and attend.

Watch Guilherme’s presentation of his Brazilian Address cleaner using NLP++

Using the NLP++ Plugin for HPCC Systems

While I was at LexisNexis, I implemented an NLP++ plugin for the HPCC Systems Platform allowing for the creation of trustworthy, glass box NLP systems that can be constantly improved. Having given a talk to the Brazilian team at LexisNexis, I was contacted by a manager who was interested in building a prototype system for replacing their current address cleaner in Portuguese. I was then paired with a bright young Brazilian programmer named Guilherme da Silva who I worked with for several months.

Replacing Black Box with NLP++ Glass Box

One of the biggest attractions to NLP++ is that it is rule and knowledge based, and it is 100% transparent code. Whereas statistical NLP systems like Machine Learning, Neural Networks, and Large Language Models are statistical black boxes, NLP++ analyzers are 100% customizable computer code that can be logically built and can be logically enhanced when the system fails.

The idea of the Brazilian Address Cleaner prototype was to replace the black box statistical system that was cleaning or “parsing” the incoming addresses with a glass box system. Guilherme had tried to “squash” errors coming from the current address cleaner but being black box system, more often than not, the only way to correct the problems was for a human to completely reparse the errant address from scratch. It was decided to build a glass box system using the NLP++ plugin.

Building Portuguese Dictionaries and Knowledge

During my time working on the prototype with Guilherme, I created various dictionaries and knowledge bases for the Portuguese language. This included states, municipalities, numbers words, words “logradouros” (which are part of Brazilian addresses) and more. To do this, I found webpages with the necessary data, including a webpage for the Brazilian postal system, and used NLP++ to parse the data into NLP++ dictionary and knowledge base files. You can see the list of dictionaries and KB files currently available for everyone to use in the NLP++ language extension for VSCode.

Just some of the Portuguese dictionaries and knowledge bases now available for free to everyone using NLP++

You can see many of the NLP++ analyzers I wrote to create the Portuguese dictionaries in github at dehilster-analyzers/pt at main · VisualText/dehilster-analyzers (github.com)

Talk Description

Here is the description from the webpage for Guilherme‘s presentation.

by Guilherme da Silva, LexisNexis Risk Solutions

NLP++ is a new programming language specially designed to build deep text parsers. The main objective of this approach is to build a Brazilian address analyzer and cleaner that is capable of improving the current cleaning process, with the advantage of being a transparent process with easy problem identification and correction, demonstrating great potential for future use in production.

Loading