Turkish Named Entity Recognition with Zemberek in Python



Original Source Here

Turkish Named Entity Recognition with Zemberek in Python

How to train the model and test it in Python

Screenshot by Ayşe Kübra Kuyucu

Zemberek is a natural language processing library for the Turkish language written in Java by Ahmet A. Akın.

In this article, we will use the Named Entity Recognition module of this library in Python.

In this link, as Ahmet A. stated, the Zemberek Library has a Named Entity Recognition module but it doesn’t have a model currently.

Therefore, we need to train our own model ourselves with our own data set (I will share a link to download a data set).

In this article we will do 2 main things:

  • We will train a Turkish Named Entity model
  • We will test our model in Python

There are 3 requirements additional to Java:

You can check this link to install this library or you can simply type

pip install JPype1 in your terminal, if you use Linux.

If you already have it, this line should work in Python: import jpype .

We will install this file manually. Simply go to the distributions/0.17.1 folder at this link. Then, install zemberek-full.jar manually.

  • A train data set for Turkish Named Entities

If you have a train data set in Enamex Style you can use it. If you don’t have one, I found one, simply click here to download it.

You should now have nerdata.txt file. I found it under this issue.

Not: I will run some comments on Linux and there aren’t Windows or Mac correspondings of these comments in this article.

Are you ready to start?

If you have these 2 files: zemberek-full.jar and nerdata.txt (or your own txt file as your data set) and can run this line import jpype in Python and have Java in your system, you are ready! (You also need to be in Linux or need to know corresponding comments on your operating system.)

Let’s start.

Turkish Named Entity Model Training

We will use a ready Zemberek module to train our model.

Open the terminal and copy-paste this line.

java -jar path/to/file/zemberek-full.jar TrainNerModel -s ENAMEX -t path/to/file/nerdata.txt -o test-model

Here, you need to replace path/to/file/ parts with the location of your files.

When you hit enter, after some time, you should have a “test-model” folder. If you have this folder, congratulations! You trained your model!

The inside of the “test-model” looks like this:

test-model folder by Ayşe Kübra Kuyucu

If you are curious about the details of the training or TrainNerModel.java class, here is the link.

If you are not that curious and have something like that folder, let’s continue testing.

Testing the Model in Python

  • Import the necessary modules from the JPype library.
from jpype import JClass, getDefaultJVMPath, shutdownJVM, startJVM, JString
  • Start Java Virtual Machine with the jar file.
# path to jar file 
ZEMBEREK_PATH = r"jars/zemberek-full.jar"
# start jvm
startJVM(getDefaultJVMPath(), "-ea", "-Djava.class.path=%s" % (ZEMBEREK_PATH))
  • Obtain necessary Java classes and objects.
# Getting necessary classes and objects
TurkishMorphology: JClass = JClass("zemberek.morphology.TurkishMorphology")
PerceptronNer: JClass = JClass("zemberek.ner.PerceptronNer")
Paths: JClass = JClass("java.nio.file.Paths")
modelRoot = Paths.get('test-model/model')
morphology = TurkishMorphology.createWithDefaults()
ner = PerceptronNer.loadModel(modelRoot, morphology)
sentences = "NTV, 1996 yılında Cavit Çağlar tarafından kurulan, Türkiye'nin ilk haber kanalıdır. Ocak 1999'da Doğuş Yayın Grubu bünyesine katılarak yakaladığı başarı ile Türk medya endüstrisini değiştiren NTV, Türkiye'de tematik kanal dönemini başlattı."
result = ner.findNamedEntities(JString(str(sentences)))
namedEntities = result.getNamedEntities();
for e in namedEntities:
print(e.toString())
Output of the code by Ayşe Kübra Kuyucu

It worked quite well by cathing all the entities in the sentences. Notice that it couldn’t get the word Türk as a nation.

An Important Note about Lower and Upper Cases

There is a lowercase problem as I noticed. If I write all these entity names in lowercase, the algorithm fails.

sentences2 = "ntv, 1996 yılında cavit çağlar tarafından kurulan, türkiye'nin ilk haber kanalıdır. Ocak 1999'da doğuş yayın grubu bünyesine katılarak yakaladığı başarı ile türk medya endüstrisini değiştiren ntv, türkiye'de tematik kanal dönemini başlattı."result = ner.findNamedEntities(JString(str(sentences2)))
namedEntities = result.getNamedEntities();
for e in namedEntities:
print(e.toString())
Output of the code by Ayşe Kübra Kuyucu

It only caught the Türkiye as location when I write entities in lowercase.

However, it is totally okay if you write all nan-entity words in uppercase. So, we can convert the whole string to uppercase and then run the algorithm.

I added this at the end of the string .replace('i', 'İ').upper() to convert all letters to Turkish upper letters.

result = ner.findNamedEntities(JString(str(sentences2.replace('i', 'İ').upper())))
namedEntities = result.getNamedEntities();
for e in namedEntities:
print(e.toString())
The output of the code by Ayşe Kübra Kuyucu

So, it made it all correct as it did the first time.

Let’s make it a function

I convert our code to a function that receives a string and prints entities in it.

def entities_in_string(a_string):
result = ner.findNamedEntities(JString(str(a_string.replace('i', 'İ').upper())))
namedEntities = result.getNamedEntities();
for e in namedEntities:
print(e.toString())

Let’s try it with our sentences.

entities_in_string(sentences2)
The output of the code by Ayşe Kübra Kuyucu

All is well but I would like to try something different. I am curious what’s gonna happen if we try single words.

Does it check word or ‘word in the sentence’?

I put entities in a list word by word.

words = ['ntv', 'cavit', 'çağlar', "Türkiye'nin", 'doğuş', 'yayın', 'grubu']

Then, I run the method in a for-loop.

for w in words:
entities_in_string(w)
The output of the code by Ayşe Kübra Kuyucu

It caught several of them so, it has something with the whole sentence. You can go try different things yourself.

I am ending by sharing the full code.

Full Code

# import
from jpype import JClass, getDefaultJVMPath, shutdownJVM, startJVM, JString
# path to jar file
ZEMBEREK_PATH = r"jars/zemberek-full.jar"
# start jvm
startJVM(getDefaultJVMPath(), "-ea", "-Djava.class.path=%s" % (ZEMBEREK_PATH))
# Getting necessary classes and objects
TurkishMorphology: JClass = JClass("zemberek.morphology.TurkishMorphology")
PerceptronNer: JClass = JClass("zemberek.ner.PerceptronNer")
Paths: JClass = JClass("java.nio.file.Paths")
modelRoot = Paths.get('test-model/model')
morphology = TurkishMorphology.createWithDefaults()
ner = PerceptronNer.loadModel(modelRoot, morphology)
sentences2 = "ntv, 1996 yılında cavit çağlar tarafından kurulan, türkiye'nin ilk haber kanalıdır. Ocak 1999'da doğuş yayın grubu bünyesine katılarak yakaladığı başarı ile türk medya endüstrisini değiştiren ntv, türkiye'de tematik kanal dönemini başlattı."def entities_in_string(a_string):
result = ner.findNamedEntities(JString(str(a_string.replace('i', 'İ').upper())))
namedEntities = result.getNamedEntities();
for e in namedEntities:
print(e.toString())

entities_in_string(sentences2)
The output of the code by Ayşe Kübra Kuyucu

Note: If you run this full code 2 times you will get this error.

Error by Ayşe Kübra Kuyucu

Just comment line (startJVM line) 8 when you run second time not to get this error.

I hope you enjoy this article.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: