In Progress
This is a project that is still in progress. Anyone wanting to contribute to this please email us at contact@nluglob.org.
Project Description
The more languages that can be used in Natural Language Processing, the more effective it can be as a whole. Therefore, my goal was to expand the NLP++ dictionary to include Tamil, the fifth largest language in the most populated country in the world, India. This project was special to me because Tamil is the language my family speaks, and even though I can speak it fluently, I can’t read and write it, so for the language research portion, I worked closely with my father who is from India and has been reading and writing it for a majority of his schooling. As a summer intern for HPCC Systems, I worked on creating the world’s first and most advanced Tamil dictionary with parts of speech for NLP++. My goal was to use Tamil wiktionary pages and leverage the past English Wiktionary parser to create my own parser for Tamil. This project was heavy on research since it’s something that’s never been done before, and there were more than a few roadblocks along the way. For example, when using Python to process the Tamil Wiktionary pages, my dad and I thought the pages were a good source to use, but when I was writing out the NLP++ analyzers, I noticed that the pages didn’t have a common format. When reviewing it with my dad, we found out that most of the parts of speech and definitions from the wiktionary pages were nonsense and incomprehensible, so we had to research and look for a new source of Tamil words and parts of speech and eventually found a tagging project that had words and part of speech correctly. It just went to show how new natural language processing is that even the Wiktionary site wasn’t a reliable source for the project, and how important it is to build and expand on it so that more and more people from across the world can be a part of this new wave with NLP++ as the medium. My end result was the most thorough Tamil dictionary for NLP++ to date, but my hope is that more people will come along and build on it and expand it to make it more complete, and the same is carried across more languages.
First Attempts
The initial phrase of this project was to parse Wiktionary pages for the Tamil with linguistic information. But it turns out that the Wiktionary pages for Tamil were not that useful. So Shyamaa, with the help of his father sought out a possible source for an online Tamil dictionary they could parse into an NLP++ dictionary.
Video Presentation of the Project
Here is a presentation by Shyamma at the HPCC Systems 2023 Community Day.