Python Project

With the official launch of the NLPPlus Python package, we are now focusing on the best way to introduce it to the Python community. The major message?

NLPPlus is NLP that is 100% customizable.

Where other NLP packages and toolkits are turnkey and supposedly do not require any customization, they inevitably fail and are abandoned because they cannot be fixed or adapted. That is not true with the NLPPlus package.

Why Industry, Universities, and Programmers Get It Wrong

Today, those who are involved in software development treat NLP as a solved and generic capability – something that can be inserted into a workflow. Worse, with the current craze of Large Language Models such as ChatGPT, NLP is now considered close to solved and building rule-based systems is not only considered impossible, but a waste of time. There are various problems with these notions.

First, NLP is treated by universities and industry as “solved” by statistical methods. Students in universities and software engineers learn to train automatic learning systems such as machine learning or neural networks, or learn how to “prompt” large language models. But the results from these NLP systems are the same: untrustworthy answers that cannot be fixed with logic. Statistical errors are not logical ones.

Second, linguists have known for decades that humans understand and use language through rules, not probabilities. Every human has a different neural network and yet given a list of 100 sentences to judge which are grammatical or not, they will all agree. Children actually perceive grammar rules before they can produce them. They, like the rest of us can’t articulate the rules, but they use them without fail. Recently, a colleague of mine who is an expert in machine learning said to me: language understanding is rule based.

Third, natural language understanding tasks are not generic. This is most evident when we try and use current NLP packages like NLTK or CoreNLP. When students or engineers apply these generic NLP systems to a specific task, they quickly fail. And because these tasks rely heavily on statistical methods and are opaque, they cannot be modified or corrected in any efficient or productive.

A good example of this is a prototype parser using NLP++ to replace a commercial address cleaner in Brazil. The address cleaner being used at LexisNexis is a commercial system that often incorrectly parses addresses in Portuguese. The engineer in charge tried to apply logic to the spurious errors it was generating, but being a statistical system, there was no obvious logical patterns that therefore followed. I helped the engineer to create the dictionaries and NLP rules to parse Brazilian and the results were significantly better and most importantly: correctable with logic (see a presentation of the system below).

Myths About Current Python NLP Packages

If you as ChatGPT or Copilot what is the best programming language for NLP, it’s answer is Python. This is crazy. There is only one programming language in the world that is specifically designed for NLP and that is NLP++. Yet, since Large Language Models train on existing text about NLP, it parrots back what people write about Python and NLP and currently, NLP++
is not part of that text in sufficient quantity. Here is what copilot answered when asked about what is the best program language for NLP:

Python: This is often regarded as the top choice for NLP, thanks to its robust libraries like NLTK, SpaCy, and Hugging Face’s Transformers. Python is versatile, beginner-friendly, and widely supported in the NLP community.

What NLP community are they talking about? When I asked about what it meant by “robust”, it replied:

  • Comprehensive Functionality:
    Robust libraries like NLTK or SpaCy include features for tokenization, stemming, lemmatization, entity recognition, and more, all within one framework.
  • Efficiency and Performance: They are optimized for performance, so they can process large datasets quickly and handle diverse linguistic challenges.
  • Ease of Use: These libraries often have clear documentation and simple
    APls, making it easy for both beginners and experts to implement NLP solutions.
  • Community Support: A robust library is usually backed by a large, active community that contributes to its improvement and provides assistance through forums and discussions.

The “NLP community” it talks about are not people who are trained in computational linguistics, but who are simply “users” of available NLP tools and those who teach about how use them. And the NLP systems that are talked about in general text are “toy” or demonstration systems that show how to use them, not actual use cases.

The reality is that when programmers are tasked with using these packages, they quickly find out they fail and cannot be modified. They are not rule based and 100% transparent. Yet we continue to teach our students how to use these packages in toy or demonstration systems and when I ask these students (and I have asked many of them) if these systems can be used in industry the answer is an emphatic “NO!!!”

Advantages of the NLPPlus Package

Our NLPPlus Python Package Project is out to change all this. Instead of unrealistically promising generic NLP rule-based systems, we will be proposing a 100% modifiable program that will allow users to use NLP++ text analyzers for simple text patterns like telephone numbers, emails, urls and addresses that allow them to easily modify and enhance. The idea is simple: show them rule-based NLP that they can modify and enhance and put into production systems. This gets them familiar with the power of NLP++ opening the door to many tasks that programmers now abandon because the current NLP tools are not fixable.

One great example of this is a Python package for extracting telephone numbers from text. Like all things Python, you install a Python package using pip, you use that package, and then modify what you need in order to make it work. Problem is with NLP packages including simple ones, there is no way to modify, fix, or enhance them. So programmers give up and move on.

With the NLPPlus Python Package, to comes with these parsers and they are in no way perfect. No NLP parser will ever be perfect. But they have one thing all the other parsers lack: the ability to modify and customize to the specific task at hand. That is all any programmer wants when they are programming. After all, NLP++ is a programming language and framework, not a toolkit or turnkey solution.

Retraining the World about NLP using NLP++

We hope to use the NLPPlus Python package to retrain the programming world about NLP. That NLP is not a generic task but a specific one and that there is one programming language that is specifically designed for NLP.

With our NLPPlus Project Initiative, we hope to come up with creative and persuasive ways to get the word out to programmers, universities, and industry that true rule-based NLP exists and that any programmer can create 100% transparent, trustworthy NLP that they can tailor and improve over time.

And it is hope that one day, large language models will correctly answer the question “what is the best programming language for NLP?”