Every day we use tools such as search engines to find information, we talk to artificial intelligences like Siri, Alexa or Google Home that serve as personal assistants; we use automatic translation systems to understand other languages; less frequently, we know of other applications such as those that perform automatic summarization of a document. Behind all these developments is linguistic engineering also known as computational linguistics. In this post we will talk about:
Uses of linguistic engineering
As marketers, and specifically if we are dedicated to studying brand reputation, we use specialized software that helps us understand the emotions behind data sets from social media like Twitter or Facebook that use human language. Analyzing this information has become gold to understand the levels of acceptance of a politician or the popularity of artists and all kinds of brands. But the applications for commercial purposes of linguistic engineering are not the only ones, others of them focus on research, for example, to automatically find if plagiarism has been committed, to detect diseases such as Alzheimer’s or even to realize when someone write something is at risk of committing suicide.
Every time engineers work with human language, that is, our way of expressing ourselves and communicating with others, to carry out development, we are in the area of linguistic engineering. It is a fascinating and constantly evolving area that poses many challenges, since human language and language are not static, they are constantly changing and machines have begun to learn from them.
Linguistic Corpus
Linguists work with speech samples, whether they are written or spoken. If they are transcripts of how we speak, they are oral corpus; if they are written documents or extracted from some platform such as a blog, magazine or a social media, they are written corpus. They can be synchronous, if they analyze the language at a certain moment in its history, or diachronic, if they analyze the phenomena of the language during a broad historical period. When a linguistic investigation begins, a corpus is formed or selected. According to linguists Joan Torruela and Joaquim Llisterri, the characteristics that corpus must meet are:
- Be made up of real texts
- Show on a small scale the functioning of natural language
- Be correctly selected to be representative
- Have a finite size
- Be computer manageable
Previously, linguists created corpus and later annotated and analyzed various aspects of them by hand, recorded their observations on worksheets, and carried out word or linguistic pattern counts manually. Now, of course, we can help ourselves with computers that do this job with greater speed and precision. This is also where the creativity of engineers enters, who create systems to analyze specific aspects of each corpus.
Many corpus have already been compiled, and linguists can analyze them directly. For example, the Real Academia Española de la Lengua, has created the following corpus: Reference Corpus of the Current Spanish, Diachronic Corpus of Spanish and Corpus of the Spanish of the 21st century. Based on them they carry out their dictionaries and other studies of the language.
Natural Language Processing
The process of analyzing these corpus automatically with specific purposes is known as .Natural Language Processing. This discipline is considered an area of Artificial Intelligence, of computer science and of linguistics. As we have said, it studies the interactions between computers and human language. In recent decades, this discipline has rebounded through the use of algorithms and automatic learning. All this is possible, to a large extent, thanks to the amount of information we generate daily on platforms such as the Internet, search engines and social media. Some of its applications are:
- Synthesis of the speech
- Language Analysis
- Language understanding
- Speech recognition
- Voice synthesis
- Generation of natural languages
- Automatic translation
- Answer to questions
- Information recovery
- Extraction of information
NLP in Mexico
In Mexico, the Linguistic Engineering Group of the UNAM has worked for more than 25 years in the development of didactic materials, corpus formation and development of computer systems in the area of computational linguistics. The founder and head of the group, Gerardo Sierra Martínez, has advised dozens of students, has disseminated information about this area of knowledge, and is the author of books such as Introduction to linguistic corpus. He and his fellows, undergraduate and graduate students, have more than a hundred academic articles related to the NLP. Among the linguistic corpus that have been developed in Mexico by this group and other institutions such as COLMEX, there are:
- Sinaloa Speech Corpus
- Corpus of Mexican Criminal Law
- Parallel Corpus of Mexican Languages
- Baja California Speech Corpus
- Puebla Speech Corpus
- Corpus Linguistics in Engineering
- Historical Corpus of Spanish in Mexico
- Corpus of the Sexualities in Mexico
- Corpus of Defining Contexts
- Electronic Corpus for the Study of Written Language
- Corpus on human trafficking
- Axolotl: Nahuatl-Spanish Parallel Corpus
- Annotated Corpus with Discursive Relations (RST Spanish Treebank)
- Corpus of Contemporary Mexican Spanish, COLMEX
- Basic Scientific Corpus of the Spanish of Mexico, COLMEX
- Electronic Corpus of Mexican Colonial Spanish, IIF-UNAM
- Digital Library of Novohispanic Thought, FFyL-UNAM
In conclusion…
- Linguistic engineering has many applications in the commercial field and in the field of research.
- One of its applications helps us, as marketers, to better understand our audiences in social media
- NLP is a hybrid area between linguistics, artificial intelligence and computer science.
- Without NLP, many tools that we use on a daily basis, such as automatic translation systems or software that analyzes feelings towards a brand or person, would not be possible.
Want to know more about these topics? Leave us your opinion in the comments.