Over 15 years ago, I wrote a Chinese-English translator in PHP. I learned a lot from the project (including a lot of Chinese slang). It was frustrating getting the encoding conversions right, but the end result was an understandable translation. I stopped working on it when Google rolled out their own translator, which appeared to be just as good.
My translator was not very efficient. I’m not sure how Google Translate works—maybe it works similarly, or maybe not—but I will describe it here to help anyone interested in writing translation software.
There are two major components to the system.
The first component is the front-end. The front-end is the web page you go to for accessing the translator. It essentially has a big text-box where you can copy and paste Chinese text, and a “Submit” button for submitting the text to the translator. Once you hit “Submit”, the entered text is sent to the second major component—the back-end.
The back-end consists of the translation algorithm and the database. The database contains pairs matching Chinese to English. The translation algorithm will:
- Split all of the text into sentences.
- For each sentence of length n:
- Check if the first n characters are in the database. If it is, output the translation. If not,
- Check if the first n-1 characters are in the database. If it is, output the translation. If not, repeat this step until a translation is found or n=0 (the word cannot be translated).
- Move on to the next set of untranslated characters.
- This method of translating requires a massive database of translated words, phrases, names, and slang.
- It would be most useful to allow users to enter a web address to translate.
- It may be more efficient if the text is first pre-processed by re-arranging different types of words (verbs, adjectives, nouns) before translation
- Progressive AJAX loading of the translation in real-time would make loading less painful
- The prototype I originally wrote doesn’t work anymore (encoding problems) and I had concerns about security of input (cross-site scripting), so I will not put it back online anytime soon