As I read our first Slate Desktop support ticket, one of the comments struck me...
build a test engine based on one large TM (as an easy start)...
I was reminded of a comment from Kenneth Heafield, the brilliant computational linguist who wrote the pivotal language modeling component inside the Moses. I had written to him with a question about configuring his open source tool for differet sizes of data. I explained that we expect our customers (translators) will
have LM corpora averaging 20 million tokens
Essentially, "tokens" are "words". In a translator's world, I think 20 million words is a respectable, possibly "large" size. Ken's reply,
... 20 million is tiny and cute...
Hmm, clearly there's a disconnect. You see, typical computational linguists and MT experts live in a world that counts sentences by the billions and words by the trillions... really. I'm not joking. From Ken's perspective working with trillions, "tiny and cute" really does describe 20 million words.
So as our new community moves forward, it will be helpful if we describe our experiences with less subjective and more objective measures. These two measures will be most helpful as we talk about our TM experiences.
- token (word) count in your TMs.
- segment (sentence) count in your TMs