Case study: automatic Chinese text word segmentation and translation

Executive summary

Within a month a new feature has been delivered for a Words-In-Memory mobile app. This feature allows to segment Chinese text into separate words, translate each word on English, extend with Pinyin, cleanup final dataset and add to the storage to show up on the screen. Feature is added into mobile app, but has been turned off, because I don't have enough resource to pay for hosting or allocate some of my machines for 24/7 server work.

Words-In-Memory is my personal project which I developed almost a year ago to help me learn Chinese. It is simply a vocabulary. Active usage shown a need to make app automatically split a new sentence has been entered by user into separate words and to give translation for each of them.

During R&D phase several options has been considered:

1. ML kit from Google to automatically recognize a separate Chinese words
2. BERT-based model as leader on passed 5 years for NLP tasks
3. different options, where Chaquopy is a leader, to integrate python based models into Android native app
4. options to serve python based model over REST API
5. translation over Google Translate Android app which is common on mobile devices to final translation from Chinese to English
6. cloud based Google translation API
7. ML kit translation model on the device
8. open source solution for extending origin text with Pinyin (Chinese hieroglyphics are called Hanzi, its transliteration is called Pinyin)

Final solution includes a backend on Python which utilizes a tokenizer to recognize separate Chinese words in text and a mobile client which under the hood uses a ML kit for translation each word into English in the background. Also with help of open source library Pinyin transliteration has been added to origin text.

It worth to note, any other direction like on device ML kit model/tokenizer or integration over Chaquopy worth to explore further. The reason why they had been left behind is I had to minimize time-to-market and time I allocated to explore each of this technologies had been exceeded.

Task decomposition

The sentence user entered into the app should be translated on English. But before that, a text should be split into sentences and sentences into words. For Russian or English it is a trivial task, but for languages as Chinese which doesn't have spaces and where each combination of new symbols might have a new meaning it become an another challenge.

The task is split into two sub-tasks: Chinese word segmentation and automatic Chinese-to-English translation.

Word segmentation

Android ML kit

Google quite a while ago introduced a ML kit on Android which allows to utilise machine learning on device for common ML tasks. Translation is one of them and out-of-the-box I was able to translate Chinese text to English from screenshot (note: it combines two models, one is for OCR and another one for translation). It is interesting and in some way solves my 1 from 2 parts of my task, but I need to go further.

BERT model

Further research shown for a passed 5 years for NLP tasks a current leaders are BERT and BERT-based models. Pre-trained models are available for public usage and some of them even support Chinese.

Origin BERT model has a limit 512 chars as the max text length. There is a known BERT-based model which has 2048 as max text length input, for longer text some workarounds would be required, e.g. split text on smaller blocks within BERT-based model allowed input size (although, it might have negative impact on model inference quality as a BERT model uses context around word to produce better results).

Unfortunately, BERT model itself is not what I was need for my task. What I need is a tokenizer - neural network which does word segmentation. It is used in BERT, but it is just one of it's layers.

A few notes on how ML kit work

Android ML kit framework allows to run custom models on the device. Before that they should be converted into tensorflow lite format which is optimized version of Tensorflow on mobile devices. Another popular format is Pytorch and conversion from Pytorch -> Tensorflow -> Tensorflow lite is also possible. However, this conversions have limitation on a set of framework operators you can use to train this model which might be a blocker for a such conversion. Also during this conversion you might lose pre-trained weights, but this topic requires further discovery.

Solution

During time I allocated for this task I hadn't done this with tensorflow or pytorch models on Android due several reasons. I returned back to Python notebooks to just solve my word tokenizer and classification task and I done this by using combination of several classifiers from ckip-transformers.

At that moment I got a working classifier, but it was in format which doesn't work on Android. To make it work I need to find out a way to port into supported format or integrate python script into kotlin/java Android app.

Chaquopy: integrate your python code into native Android app

Further research shown just a few products offers this and Chaquopy looks like the most promising one.

The idea to simply port your existing Python code into native Android app is really tempting. Chaquopy samples shown how to import and run some popular Python packages, but when I had started work on importing my Python code I found out-of-the-box it misses some packages required by my tokenizers.

Chaquopy is an open source product, so I went into a code looking for a way to import packages I was looking for. I had found an instruction and also some caveats which I might faced on this way. I was already behind a schedule and when I faced with not easily solved environment conflicts on my machine, I decided to postpone with this and switch on serving ML model via API. It was a right move, because later I will discover my tokenizer uses under the hood ~400mb pytorch model which would be extreme memory overhead for my mobile app.

There is still an option to try to decrease a model (mobile BERT from Google has ~100mb memory footprint), there is still an option to try to import packages and pushed it into origin repo or into your own fork. However, all of this takes extra time which meant they are possible tasks for a future, but not for now.

Backend API on Python to serve your ML model

I exported my jupiter notebook into python script and python script I refactored into concise module to be used by server:

                    """ Chinese text classifier
                    @ref https://ckip-transformers.readthedocs.io/en/stable/main/readme.html#git
                    """
                    import asyncio
                    from ckip_transformers.nlp import (
                        CkipWordSegmenter, 
                    )


                    class ChineseTextClassifier:

                        def __init__(self):
                            self.ws_driver  = CkipWordSegmenter(model="bert-base")
                            self.lock = asyncio.Lock()

                        async def run_single_word_segmentation(self, text):
                            async with self.lock: 
                                assert(isinstance(text, list) == True)
                                ws = self.ws_driver(text, use_delim=True)
                                return ws

Python has several interesting frameworks to write a server and FastAPI looked the most appealing for writing a simple server. Web servers on Python emphasizes its WSGI or ASGI type and even provides converters from one to another and backward. It my case it was important later when I look for a free hosting: many hostings with a free tier support only WSGI servers. The market overview of my search for a free hosting for my project is available here

REST backend code is below:

                    from typing import Union
                    from typing import Optional, Any
                    from fastapi import FastAPI
                    from fastapi.responses import JSONResponse
                    from model import ChineseTextClassifier

                    app = FastAPI()

                    model = ChineseTextClassifier()

                    @app.post("/classify")
                    async def classify(payload: TextForClassification):
                        result = await model.run_single_word_segmentation([payload.text])
                        return get_response(True, result)

                    # ref. https://pypi.org/project/fastapi-queue/
                    def get_response(success_status: bool, result: Any) -> JSONResponse | dict:
                        if success_status:
                            return {"status": 200, "data": result}
                        if result == -1:
                            return JSONResponse(status_code=503, content="Service Temporarily Unavailable")
                        else:
                            return JSONResponse(status_code=500, content="Internal Server Error")

I was not familiar with nuances of work of synchronisation primitives on Python, so I left this code for code review by sharing my considerations. In case you have any constructive feedback or proposals, repository is open for pull requests.

Docker container for this will look as this:

                    FROM python:3.11

                    WORKDIR /deployment

                    COPY ./requirements.txt /deployment/requirements.txt

                    RUN pip install --no-cache-dir --upgrade -r /deployment/requirements.txt

                    ENV PYTHONPATH /deployment/src

                    COPY ./src /deployment/src

                    CMD ["uvicorn", "src.start:app", "--host", "0.0.0.0", "--port", "80"]

Translation

Google translate Android app

Google developed a good translation app available to download & install from Play market. Android ecosystem allows to run 3rd party apps from your app to execute some task and even return result back to your app. The code below open your Google translation app as a pop-up:

                    val intent = Intent()
                    intent.action = Intent.ACTION_PROCESS_TEXT
                    intent.type = "text/plain"
                    intent.putExtra(Intent.EXTRA_PROCESS_TEXT_READONLY, true)
                    intent.putExtra(Intent.EXTRA_PROCESS_TEXT, "hello")
                    startActivity(intent)

It is Android native way. Unfortunately, I didn't find to do such translation in the background without user involvement. May be it doesn't possible. Let me know if such option is available. For now I have to search for alternative solution.

Google cloud translation API

Google translation API offered 500k characters per month for a free translation. However, at this moment they put a signed of cross near this offer -- usually it means it is not available anymore. I didn't succeed to contact their sales team to clarify, but the idea API could be canceled in the future pushed me back to ML kit solution.

ML kit translate

ML kit also provides translation API. Its translation quality is less than origin native app, but good enough for my case. The ML model is used for this has a quite small storage footprint - ~20mb.

Before usage the model should be downloaded on the device. I have not found a way to download a model separately and put it into assets to package it into final apk. The code is:

                    fun prepare(listener: ITranslationListener?) {
                        val conditions = DownloadConditions.Builder()
                            .requireWifi()
                            .build()
                        chineseToEnglishTranslator.downloadModelIfNeeded(conditions)
                            .addOnSuccessListener {
                                isTranslationModelReady = true
                                listener?.onModelDownloaded()
                            }
                            .addOnFailureListener { exception -> listener?.onModelDownloadFail(exception)}
                    }
                
                    fun translateChineseText(text: String, listener: ITranslationListener) {
                        chineseToEnglishTranslator.translate(text)
                            .addOnSuccessListener { translatedText -> listener.onTranslationSuccess(translatedText) }
                            .addOnFailureListener { exception -> listener.onTranslationFailed(exception) }
                    }

Altogether it takes near 30 seconds to download a model and translate a chinese sentence from 16 words on the first launch. All this work had been moved into background which fits my UX.

It was interesting to discover this translation API also uses tokenizers. Unfortunately I had not found a public API to leverage it. It would be good to have them open as well in future releases of ML kit translation API.

Extension with Pinyin

During learning I focus first on learning pinyin versions of the chinese word instead of memorizing hieroglyphics. That's why to have a pinyin transliteration of the origin word or sentence is important.

Quick research shown several available open source products on the market and I chosen this one. It is quite old one and does not maintain a strokes on top of symbols which are important to recognize tones, but for a first release it was good enough.

More works on ML:
House price prediction