Back in 2003, when I was in graduate school getting an MBA in Information Technology, I wrote a Chinese to English translation web app using LAMP architecture (Linux, Apache, PHP, MySQL). It worked very well, but I never made it public because I had concerns about its security and efficiency.
I tinkered with it from time to time, mostly just adding words and phrases to the database. Then I completely gave up soon after Google released their own Chinese to English translation with Google Translate in 2007. After all, there’s no point in developing an app if I can’t do better than what’s already out there.
Recently, I dug up the old code and cleaned it up, thinking I might revive the project. But then, I remembered why I abandoned the project in the first place, and saw that there’s no point in reviving it again. So… I’m just going to release the code in case anyone is interested in one approach to language translation. You can use this approach to translate any language, but it probably works best with Asian languages where one character generally equals one word.
How this Chinese to English translator works
It translates each sentence one at a time. It starts by reading the first sentence, from the first character all the way until it finds some type of ending character like a period, end of line, or other punctuation that typically represents the end of a sentence.
The code looks up and substitutes the English equivalent of the Chinese words and phrases it finds. The database ideally contains thousands or tens of thousands of entries… the more the better. The database entries look like this:
Index | Chinese | English |
1 | 你 | you |
2 | 好 | good |
3 | 你好 | hello |
4 | 好嗎 | ok |
5 | 你好嗎 | how are you |
For each sentence, it first queries the database to see if the entire sentence is in there. If not, it will remove the last character in the string, then query the database with the new, shorter string. It continues to pop the last character off and querying the database until it either finds the phrase in the database or reaches zero characters (which would mean that character isn’t in the database and can’t be translated).
As you can imagine, this results in many, many queries to the database, so it may not be practical for allowing many people to translate many large documents.
The Code
Word of warning: I’ve never considered myself a “good” coder. Any seasoned professional would probably see all the problems with my code. That said, here is all of the code for my Chinese to English Translator web app.
Note that there are security issues with this code. Since I wrote it for my own use, I never added any code to check and make sure the user isn’t inputting something they really should be inputting.
The MySQL database setup used for this code is extremely simple. Probably too simple. I simply created one table called “ss_chinese”. The table has three fields: “index”, “chinese”, and “english”. “index” is simply to ensure that each entry is unique. “chinese” contains a Chinese character or phrase. “english” contains the English translation of the Chinese character or phrase.
index.php
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>Chinese to English Translator</title>
<meta http-equiv="Content-Language" content="zh" />
<meta http-equiv="Content-Type" content="text/html; charset=Big5" />
</head>
<body>
<h1>Chinese to English Translator</h1>
Input Traditional Chinese text (Big5 encoding) to translate to English:<br />
<form method="post" action="translate.php">
<p><textarea rows="10" cols="50" name="text"></textarea></p>
<p><input type="submit" name="translate" /></p>
</form>
</body>
</html>
translate.php
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html>
<head>
<title>Chinese to English Translator</title>
<meta http-equiv="Content-Language" content="zh" />
<meta http-equiv="Content-Type" content="text/html; charset=Big5" />
</head>
<body>
<?php
$text = stripslashes($text);
$text = str_replace("\n","<br />",$text);
printf("<h1>Traditional Chinese Text:</h1>\n");
printf("<p>%s</p>\n<hr />\n<h1>English Translation:</h1>", $text);
// connect to dbase
$db = mysql_connect("localhost", "username", "password");
mysql_select_db("databasename",$db);
$uppercaseNext = false; // keep track of whether to uppercase a word
$uppercaseNow = true;
// loop until no more characters in text
while ( strlen($text) )
{
$englishChar = "";
$chineseQuery = "";
// build a query
while ( strlen($text) && !strlen($englishChar) )
{
if ( ord($text)<128 )
{
$englishChar = substr($text,0,1);
$text = substr($text,1);
if ($englishChar=='.' || $englishChar==':')
{
$uppercaseNext = true;
}
}
// else if's break translation strings at chinese punctuation, making translation faster
else if ( substr($text,0,2)=='¡A' ) // I think the character '¡A' got corrupted and should actually be a Chinese comma character
{
$englishChar = ',';
$text = substr($text,2);
}
else if ( substr($text,0,2)=='¡B' ) // corrupted character?
{
$englishChar = ',';
$text = substr($text,2);
}
else if ( substr($text,0,2)=='¡C' ) // corrupted character?
{
$uppercaseNext = true;
$englishChar = '.';
$text = substr($text,2);
}
else if ( substr($text,0,2)=='¡G' ) // corrupted character?
{
$uppercaseNext = true;
$englishChar = ':';
$text = substr($text,2);
}
else if ( substr($text,0,2)=='¡u' ) // corrupted character?
{
$uppercaseNext = true;
$englishChar = '"';
$text = substr($text,2);
}
else if ( substr($text,0,2)=='¡v' ) // corrupted character?
{
$uppercaseNext = true;
$englishChar = '"';
$text = substr($text,2);
}
else
{
$chineseQuery = $chineseQuery . substr($text,0,2);
$text = substr($text,2);
}
}
// query
while ($chineseQuery)
{
$chineseQuery = addslashes($chineseQuery); // add slashes because they were added in the database
$result = mysql_query("SELECT * FROM ss_chinese WHERE 1 AND STRCMP(binary `chinese`, '$chineseQuery')=0",$db);
// binary to make query case-sensitive
$chineseQuery = stripslashes($chineseQuery); // remove extraneous slashes from the results
$myrow = mysql_fetch_array($result);
if ($myrow)
{
if ($uppercaseNow == true)
{
printf(" %s", ucfirst($myrow[english]));
$chineseQuery = "";
$uppercaseNow = false;
if ($uppercaseNext == true)
{
$uppercaseNow = $uppercaseNext;
$uppercaseNext = false;
}
}
else if ($uppercaseNext == true)
{
printf(" %s", $myrow[english]);
$chineseQuery = "";
$uppercaseNow = $uppercaseNext;
$uppercaseNext = false;
}
else
{
printf(" %s", $myrow[english]);
$chineseQuery = "";
$uppercaseNow = $uppercaseNext;
}
}
else
{
if ($englishChar!="")
{
$text = $englishChar . $text;
$englishChar = "";
}
if ( strlen($chineseQuery)==2 )
{
printf(" %s ", $chineseQuery);
$chineseQuery = "";
}
else
{
$text = substr($chineseQuery, strlen($chineseQuery)-2) . $text;
$chineseQuery = substr($chineseQuery, 0, strlen($chineseQuery)-2);
}
}
}
// print the ending english character if any
if ($englishChar!="")
{
printf("%s", $englishChar);
$englishChar = "";
}
}
?>
</body>
</html>
chinese.php (tool for adding new entries to database)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>Chinese to English Translator Admin</title>
<meta http-equiv="Content-Type" content="text/html; charset=Big5" />
</head>
<body>
<h1>Add to the Chinese Dictionary</h1>
<form method="post" action="chinese_admin.php">
<p>Chinese: <input type="text" name="chineseText" size="30" /></p>
<p>English: <input type="text" name="englishText" size="30" /></p>
<p><input type="submit" name="add" value="Add to Dictionary" /></p>
</form>
</body>
</html>
chinese_admin.php
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>Chinese Administration</title>
<meta http-equiv="Content-Type" content="text/html; charset=Big5" />
</head>
<body>
<?php
# DATABASE CONNECTION #
$HOST_ADDR = "localhost";
$DB_USER = "username";
$DB_PASS = "password";
$DB_NAME = "databasename";
$db = mysql_connect($HOST_ADDR, $DB_USER, $DB_PASS);
mysql_select_db($DB_NAME,$db);
$chineseText = addslashes($chineseText);
$englishText = addslashes($englishText);
$result = mysql_query("SELECT * FROM `ss_chinese` WHERE 1 AND STRCMP(binary `chinese`, '$chineseText')=0",$db);
$chineseText = stripslashes($chineseText);
$englishText = stripslashes($englishText);
$myrow = mysql_fetch_array($result);
if ($myrow && $add) // already exists
{
echo "<p>Already exists in the dictionary</p>";
$englishTextNoSlashes = stripslashes($englishText);
echo "<form method=\"post\" action=\"chinese_admin.php\">\n";
echo "<input type=\"hidden\" name=\"chineseText\" value=\"$chineseText\" />\n";
echo "<input type=\"hidden\" name=\"englishText\" value=\"$englishText\" />\n";
echo "<input type=\"hidden\" name=\"indexNumber\" value=\"$myrow[index]\" />\n";
echo "<p>Replace $myrow[chinese]=$myrow[english] with $chineseText=$englishTextNoSlashes?\n";
echo "<input type=\"submit\" name=\"replace\" value=\"Replace\" /></p>\n";
echo "</form>\n";
}
else if ($add)
{
$result = mysql_query("INSERT INTO `ss_chinese` ( `index` , `chinese` , `english` ) VALUES ('', '$chineseText', '$englishText');",$db);
$englishText = stripslashes($englishText);
$chineseText = stripslashes($chineseText);
echo "<p>Terms \"$chineseText\" = \"$englishText\" added</p>";
}
else if ($replace)
{
$result = mysql_query("UPDATE `ss_chinese` SET `english` = '$englishText' WHERE `index` = '$indexNumber' LIMIT 1 ;",$db);
echo "<p>Dictionary updated</p>";
}
else
{
echo "<p>LOGIC ERROR: passed button value is unknown</p>";
}
?>
<p><a href="chinese.php">Return</a></p>
</body>
</html>