Before we can discuss the the “ideal” data, we need realistic expectations about SMT’s capabilities. Our Linkedin group’s 10-sentence workshop discussions are designed to help with this.
SD users must understand how to use high-level components. Metaphorically, it's like the knowledge a car driver needs to operate his car. Drivers doesn't need to be engineers who can optimize components inside the internal combustion engine. They need more practical, every-day knowledge about ignition, shifter, gas pedal, break pedal, steering wheel, gas tank, oil changes, radiator coolant, etc.
Any given driver can choose to delve deeper into any or all of the detailed components. Granny needs skills to drive back and forth to the grocery store, identify an empty gas tank, fill the tank, change the oil every 3 months (because she’ll never exceed 3,000 miles). Speed Racer must have a deeper understanding of the inter-workings of all the components, but he still has a pit crew to optimize the car. A mechanical engineer might redesign spark plugs or refit a turbo boost, but he might be an average driver behind the wheel.
Until Slate Desktop, all SMT “drivers” also had to be mechanical engineers. As we move forward, we’re defining the boundaries between various driver skill levels. I have expertise (I hesitate to call myself an expert), but what works for me might not work for everyone. So, I propose a boundary. Your feedback helps us refine and redefine these boundaries.
Back to our subject… “ideal” data.
Once you know the capabilities of each SMT component and how those components interact, you can start designing your data sets. Based on traditional SMT publications, there’s a natural inclination to simply add more data and clean it better. Several SD users have learned that blindly adding data often degrades the linguistic performance they were getting from their smaller engines.
The SMT Model and its components (translation model and language model) are called models for a reason. SMT is classified as artificial intelligence in subclass machine learning. Big deal! Let me introduce a more practical concept that’s often overlooked. SMT is a predictive modeling system. It uses statistical models, created from historical events (e.g. you typing is an event), to predict an event that might occur under similar conditions in the future.
It’s like a weather model. The model receives sensor data and predicts the most likely size and path of a storm. Sensor data from the Philippines have little bearing on a weather model’s prediction of an Atlantic storm’s size and path.
Another metaphor? George Clooney exists in the real world. Then, look at these pictures of George: a studio photo portrait and two caricatures.
The photo portrait is a close and proportionate representation of George in the real world. It's the ideal 2-D model of George. Now, Imagine that one of the caricatures is your SMT model. You recognize it’s George. The eyes or nose or mouth might be perfect, but the brow or ears or cheeks are a bit “off.” Now you want to morph your caricature to look like the real George, or more accurately, like the portrait 2-D model. Blindly adding BIG DATA randomly won’t work. You’ll inflate the entire picture. The result is bigger. Maybe some parts have greater resolution, but it’s still a distorted caricature. Alternately, you take a surgical approach. You don’t want to change the perfect parts. You want to erase, over-draw or somehow change the parts that are “off.”
Again, let’s bring this back to SMT
Your “ideal” corpus is a proportionate (balanced) and representative subset of the real-world entity. First you need to decide what is the real-world entity you want to represent… George Clooney. Then, you have the proportionate and representative version… the portrait. Then, you have your disproportionate but mostly representative version… the caricature.
When an SMT model does a good job for some sentences, it means the corpus has a sufficient, balanced data to those token pairs and their interrelationships. Adding more TUs like the ones that succeeded will not improve anything, and it might actually make things worse. All these statistics are interrelated. Adding repetitive TUs to the already good balance might degrade performance for other sentences.
I don’t know of any academic studies that advocate what I’m about to propose, but I’ll propose it here. A suggestion segments that scores edit-distance zero means the engine succeeded. Therefore, the corpus doesn’t need more of these segment pairs. They should be excluded from the TUs you feedback into your inventory for retraining. It’s a revolutionary idea that academia somehow missed or maybe not so revolutionary because I just missed their studies.
There are academic studies that discourage using “close enough” post-edited segments into your inventory. For me, that’s a no-brainer because I disavow any post-editing because it always results in “close enough.” I believe (and sincerely hope) that our customers always use SD with the intention to proofread and correct the suggestions, not for “post-editing” as promoted by traditional MT experts.
Regarding SD updates, mixing, weighting and otherwise crafting the TM inventory is high on my priority list. Until then, it can be done but we need to work with you to help you manually setup a workflow to include your crafted inventory. Here’s a hint of what’s in 10-sentence workshop:
Problem: SD’s suggestions have source language tokens (words).
Cause: The source language vocabulary is missing from your TU pairs. It’s possible that it can also be present but rare. The underlying Moses uses advanced math and data techniques that ignore rare tokens to optimize memory usage.
- Update your terminology file. It’s an easy, fast and temporary fix
- Add more TU pairs with the missing vocabulary to your TM inventory and rebuild your engine. One advantage of working with small TMs (70K to 150K?) is need fewer TU pairs will have a more pronounced effect. Collect (or even craft) 10 to 50 sentence pairs of each desired vocabulary pair. Make sure these sentence pairs have a variety of usage, grammar and syntax. *see below
Problem: SD’s suggestions have the correct vocabulary in the wrong order.
Cause: The vocabulary is missing or rare in the target language data, or its occurs within fragments that contradict the order you want.
Resolution: Add more sentences with the missing target vocabulary to your LM inventory (manual process – ask for details). Then, rebuild your engine. Again, a few sentences can have a huge affect when working with smaller data. You can collect or craft 10 to 50 target language sentences for each target vocabulary.
Problem: SD’s suggestions have the wrong vocabulary.
Cause: There are two possible causes:
- The desired target vocabulary is missing (or rare) from TU pairs and SD learned a different source-to-target mapping. In this case, the translation model creates a pool of candidate target sentences that don’t have the desired target vocabulary.
- The desired target vocabulary is missing (or rare) from your target language data. In this case, the translation model may be creating target sentences that go into the pool, but the language model scores them poorly and they’re never selected for the final translation.
Resolution: The resolution depends on the cause. To diagnose the most likely cause, review suggestions output from a variety of source sentences with the desired source term that should generate the target term. If the target term never appears in the suggestions then #1 is likely the cause. If the target term sometimes appears in the suggestions, but now where you want, then #2 is likely the cause. There could be other corpus management tools to help you diagnose this problem, but they are not included in SD. Check eBay’s machine translation group on Linkedin.com. They have published some good articles about corpus processing tools that I was not aware of.
- If #1 is the diagnosed cause, start by updating your terminology file. Then, acquire/craft more TU pairs and rebuild your engine. Add collected/crafted LM data with the right term helps.
- If #2 is the diagnosed cause, add collected/crafted LM data and rebuild.
A final note about LM data. Be very careful with anything learn by reading the traditional SMT experts about LM or monolingual corpus. These sources generally advocate collecting large quantities of monolingual data by scraping Internet websites for target language in related subject domains. They explain this works because it’s easier to collect monolingual vice paired TU data. Isn’t that logic is reversed? It works because it’s easy? Yes, they collect unpaired target language data because it’s easier. Unfortunately, that ease significantly increases the risk that those segments are not related to your personal translation style.
It’s impossible to understate how powerful the language model is when creating the final suggestion. If the wrong vocabulary usage and style predominate in your LM data, it will overpower dominate your results, even to extreme of selecting the least likely result as offered by your paired TM data. When working with small TMs, you’ll get better results by hand-selecting or hand-crafting a set of sentences with your desired usage.
Ultimately, the only real test is building the engine and testing its performance. I.e. the proof of the pudding is in the eating.