September 1, 2005

Autonomous language learning

I was hoping David would write this up, but I’ll do my best. He said he was going to attempt to get the fulltext of the paper. I hope he does and decides to write an article on it. (*nudge* *nudge*)

Researchers have created a program which can infer the rules of grammar by being fed text in just about any language. It can do the same for music, gene sequences, and proteomics.

“This is the first time an unsupervised algorithm is shown capable of learning complex syntax, generating grammatical new sentences and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics,” he said.

Unlike previous attempts at developing computer algorithms for language learning, the new method, called Automatic Distillation of Structure (ADIOS), successfully identifies complex patterns in raw texts. The algorithm discovers the patterns by repeatedly aligning sentences and looking for overlapping parts.

For example, the sentences I would like to book a first-class flight to Chicago ,I want to book a first-class flight to Boston , and Book a first-class flight for me, please may give rise to the pattern book a first-class flight — if this candidate pattern passes the novel statistical significance test that is the core of the algorithm.

If the system also encounters the sentences I need to book a direct flight from New York to Tel Aviv andI would like to book an economy flight , it may infer that the phrases first-class , direct and economy are equivalent in the context of the new pattern. “Because such equivalence sets can contain other patterns — in turn containing further patterns, and so on — the resulting body of knowledge grows recursively, as a sort of forest of branching trees of possibilities,” said Edelman.

He added, “ADIOS relies on a statistical method for pattern extraction and on structured generalization — two processes that have been implicated in language acquisition. Our experiments show that it can acquire intricate structures from raw data, including transcripts of parents’ speech directed at 2- or 3-year-olds. This may eventually help researchers understand how children, who learn language in a similar item-by-item fashion and with very little supervision, eventually master the full complexities of their native tongue.”

The word “unsupervised” sounds dangerous, almost Skynet-like, but I think we’re still quite a ways from that. I bet such an algorithm could be adapted to scan for malicious code, viruses, or almost anything else. I could see implications when it comes to anti-terrorism and plain old espionage. How about a computer that could realistically “learn” various languages and be taught to scan for certain types of conversations? A higher-level, more abstract understanding of language could certainly flag fewer false positives and be more accurate than keyword density scans of random conversation. Of course, I’m just speculating on how agencies like the CIA, NSA, and FBI actually do this information processing; they probably have other, better ways of doing it.

Getting more pragmatic, I bet this could even be used to build more capable spam filters and firewalls, since the algorithm seems blind to written language and binary, it could conceivably scan and learn anything, and check for any sort of anomaly.

I would think that the next step is moving from passive information processing to active interaction with another human being or problem: how long before a program using this algorithm can solve mathematical word problems, or perhaps even pass the Turing test? I think this is a huge step in both of those directions, but I remain leery of the theoretical capabilities of such an algorithm being used for unethical purposes.

| 7:15 pm |

No Comments »

No comments yet.

RSS feed for comments on this post. | TrackBack URI
You can also bookmark this on del.icio.us or check the cosmos

Leave a comment

XHTML ( You can use these tags): <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> .