Case study: automatic Chinese text word segmentation and translation
Executive summary
Within a month a new feature has been delivered for a Words-In-Memory
mobile app. This feature allows to segment Chinese text into separate
words, translate each word on English, extend with Pinyin, cleanup final
dataset and add to the storage to show up on the screen. Feature is
added into mobile app, but has been turned off, because I don't have
enough resource to pay for hosting or allocate some of my machines for
24/7 server work.
Words-In-Memory is my personal project which I developed almost a
year ago to help me learn Chinese. It is simply a vocabulary.
Active usage shown a need to make app automatically split a new
sentence has been entered by user into separate words and to give
translation for each of them.
During R&D phase several options has been considered:
1. ML kit from Google to automatically recognize a separate Chinese
words
2. BERT-based model as leader on passed 5 years for NLP tasks
3. different options, where Chaquopy is a leader, to integrate python
based models into Android native app
4. options to serve python based model over REST API
5. translation over Google Translate Android app which is common on
mobile devices to final translation from Chinese to English
6. cloud based Google translation API
7. ML kit translation model on the device
8. open source solution for extending origin text with Pinyin
(Chinese hieroglyphics are called Hanzi, its transliteration is
called Pinyin)
Final solution includes a backend on Python which utilizes a
tokenizer to recognize separate Chinese words in text and a
mobile client which under the hood uses a ML kit for translation
each word into English in the background. Also with help of open
source library Pinyin transliteration has been added to origin text.
It worth to note, any other direction like on device ML kit
model/tokenizer or integration over Chaquopy worth to explore
further. The reason why they had been left behind is I had to
minimize time-to-market and time I allocated to explore each of this
technologies had been exceeded.
Task decomposition
The sentence user entered into the app should be translated on English.
But before that, a text should be split into sentences and sentences
into words. For Russian or English it is a trivial task, but for
languages as Chinese which doesn't have spaces and where each
combination of new symbols might have a new meaning it become an another
challenge.
The task is split into two sub-tasks: Chinese word segmentation and
automatic Chinese-to-English translation.
Word segmentation
Android ML kit
Google quite a while ago introduced a ML kit on Android which allows to utilise machine learning on device for common ML tasks. Translation is one of them and out-of-the-box I was able to translate Chinese text to English from screenshot (note: it combines two models, one is for OCR and another one for translation). It is interesting and in some way solves my 1 from 2 parts of my task, but I need to go further.
BERT model
Further research shown for a passed 5 years for NLP tasks a current
leaders are BERT and BERT-based models. Pre-trained models are available
for public usage and some of them even support Chinese.
Origin BERT model has a limit 512 chars as the max text length.
There is a known BERT-based model which has 2048 as max text
length input, for longer text some workarounds would be required,
e.g. split text on smaller blocks within BERT-based model allowed
input size (although, it might have negative impact on model
inference quality as a BERT model uses context around word to
produce better results).
Unfortunately, BERT model itself is not what I was need for my task.
What I need is a tokenizer - neural network which does word
segmentation. It is used in BERT, but it is just one of it's layers.
A few notes on how ML kit work
Android ML kit framework allows to run custom models on the device. Before that they should be converted into tensorflow lite format which is optimized version of Tensorflow on mobile devices. Another popular format is Pytorch and conversion from Pytorch -> Tensorflow -> Tensorflow lite is also possible. However, this conversions have limitation on a set of framework operators you can use to train this model which might be a blocker for a such conversion. Also during this conversion you might lose pre-trained weights, but this topic requires further discovery.
Solution
During time I allocated for this task I hadn't done this with tensorflow
or pytorch models on Android due several reasons. I returned back to
Python notebooks to just solve my word tokenizer and classification task
and I done this by using combination of several classifiers from
ckip-transformers.
At that moment I got a working classifier, but it was in format which doesn't work
on Android. To make it work I need to find out a way to port into supported
format or integrate python script into kotlin/java Android app.
Chaquopy: integrate your python code into native Android app
Further research shown just a few products offers this and Chaquopy looks like
the most promising one.
The idea to simply port your existing Python code into native Android app
is really tempting. Chaquopy samples shown how to import and run some
popular Python packages, but when I had started work on importing my Python
code I found out-of-the-box it misses some packages required by my tokenizers.
Chaquopy is an open source product, so I went into a code looking for a way
to import packages I was looking for. I had found an instruction
and also some caveats which I might faced on this way. I was already
behind a schedule and when I faced with not easily solved environment conflicts
on my machine, I decided to postpone with this and switch on serving ML model
via API. It was a right move, because later I will discover my tokenizer
uses under the hood ~400mb pytorch model which would be extreme memory overhead
for my mobile app.
There is still an option to try to decrease a model (mobile BERT from Google has
~100mb memory footprint), there is still an option to try to import packages
and pushed it into origin repo or into your own fork. However, all of this takes
extra time which meant they are possible tasks for a future, but not for now.
Backend API on Python to serve your ML model
I exported my jupiter notebook into python script and python script I
refactored into concise module to be used by server:
""" Chinese text classifier @ref https://ckip-transformers.readthedocs.io/en/stable/main/readme.html#git """ import asyncio from ckip_transformers.nlp import ( CkipWordSegmenter, ) class ChineseTextClassifier: def __init__(self): self.ws_driver = CkipWordSegmenter(model="bert-base") self.lock = asyncio.Lock() async def run_single_word_segmentation(self, text): async with self.lock: assert(isinstance(text, list) == True) ws = self.ws_driver(text, use_delim=True) return ws
Python has several interesting frameworks to write a server and FastAPI looked the most appealing for writing a simple server. Web servers on Python emphasizes its WSGI or ASGI type and even provides converters from one to another and backward. It my case it was important later when I look for a free hosting: many hostings with a free tier support only WSGI servers. The market overview of my search for a free hosting for my project is available here
REST backend code is below:
from typing import Union from typing import Optional, Any from fastapi import FastAPI from fastapi.responses import JSONResponse from model import ChineseTextClassifier app = FastAPI() model = ChineseTextClassifier() @app.post("/classify") async def classify(payload: TextForClassification): result = await model.run_single_word_segmentation([payload.text]) return get_response(True, result) # ref. https://pypi.org/project/fastapi-queue/ def get_response(success_status: bool, result: Any) -> JSONResponse | dict: if success_status: return {"status": 200, "data": result} if result == -1: return JSONResponse(status_code=503, content="Service Temporarily Unavailable") else: return JSONResponse(status_code=500, content="Internal Server Error")
I was not familiar with nuances of work of synchronisation primitives on Python, so I left this code for code review by sharing my considerations. In case you have any constructive feedback or proposals, repository is open for pull requests.
Docker container for this will look as this:
FROM python:3.11 WORKDIR /deployment COPY ./requirements.txt /deployment/requirements.txt RUN pip install --no-cache-dir --upgrade -r /deployment/requirements.txt ENV PYTHONPATH /deployment/src COPY ./src /deployment/src CMD ["uvicorn", "src.start:app", "--host", "0.0.0.0", "--port", "80"]
Translation
Google translate Android app
Google developed a good translation app available to download &
install from Play market. Android ecosystem allows to run 3rd
party apps from your app to execute some task and even return
result back to your app. The code below open your Google translation
app as a pop-up:
val intent = Intent() intent.action = Intent.ACTION_PROCESS_TEXT intent.type = "text/plain" intent.putExtra(Intent.EXTRA_PROCESS_TEXT_READONLY, true) intent.putExtra(Intent.EXTRA_PROCESS_TEXT, "hello") startActivity(intent)
It is Android native way. Unfortunately, I didn't find to do such translation in the background without user involvement. May be it doesn't possible. Let me know if such option is available. For now I have to search for alternative solution.
Google cloud translation API
Google translation API offered 500k characters per month for a free translation. However, at this moment they put a signed of cross near this offer -- usually it means it is not available anymore. I didn't succeed to contact their sales team to clarify, but the idea API could be canceled in the future pushed me back to ML kit solution.
ML kit translate
ML kit also provides translation API.
Its translation quality is less than origin native app, but good
enough for my case. The ML model is used for this has a quite small
storage footprint - ~20mb.
Before usage the model should be downloaded on the device. I have not found
a way to download a model separately and put it into assets to package
it into final apk. The code is:
fun prepare(listener: ITranslationListener?) { val conditions = DownloadConditions.Builder() .requireWifi() .build() chineseToEnglishTranslator.downloadModelIfNeeded(conditions) .addOnSuccessListener { isTranslationModelReady = true listener?.onModelDownloaded() } .addOnFailureListener { exception -> listener?.onModelDownloadFail(exception)} } fun translateChineseText(text: String, listener: ITranslationListener) { chineseToEnglishTranslator.translate(text) .addOnSuccessListener { translatedText -> listener.onTranslationSuccess(translatedText) } .addOnFailureListener { exception -> listener.onTranslationFailed(exception) } }
Altogether it takes near 30 seconds to download a model and translate a chinese sentence from 16 words on the first launch. All this work had been moved into background which fits my UX.
It was interesting to discover this translation API also uses tokenizers. Unfortunately I had not found a public API to leverage it. It would be good to have them open as well in future releases of ML kit translation API.
Extension with Pinyin
During learning I focus first on learning pinyin versions of the chinese
word instead of memorizing hieroglyphics. That's why to have a pinyin
transliteration of the origin word or sentence is important.
Quick research shown several available open source products on the market
and I chosen this one.
It is quite old one and does not maintain a strokes on top of symbols which
are important to recognize tones, but for a first release it was good enough.
More works on ML:
House price prediction