AI: The Power of the Masses
Lenat started Cyc with the idea of gathering a massive amount of data from newspapers, magazines, etc. Gather little bits of knowledge about the human condition, link them all up, and hopefully some reasoning system can make use of it and produce behavior that's human-like. Since you probably haven't heard of Cyc, you'd guess that the technology hasn't reached its original goal. But the more I think about the web and companies like Google, the more I think that the basic idea behind Cyc has come of age. The web is being used to solve tough problems by intelligently processing huge amounts of data. The case study for this is Google.
I came across an old blog entry on the GooOS, Google Operating System (via Jon Aquino's excellent Rails Day entry yubnub.com). If you listen to Google tech heads about what they're doing with their uber-computer, you'll hear how they are harvesting real-world knowledge through analysis of web documents. Because there's so much stuff out there, they can be choosy about which sentences they'll extract knowledge from -- what is the reputation of the source, how easily a particular sentence is parsed, how likely will it give a nugget of the author's "truth", etc. You can see the results whenever you search for "JEK assassination" and Google helpfully asks if you really mean "JFK."
Language translation, one of those extremely difficult AI problems, can also benefit by processing a large number of documents from reputable authors. Example #2: Google's use of United Nations documents to power a translation system. It's amazing how researchers are able to tap into the knowledge embedded in the web, and now off the web. Aside from UN documents, Google will also digitize a vast collection of library books, which should give their knowledge extraction systems high-quality historical data. Makes me wonder if the system that passes the Turing Test will be suckled on the milk of repurposed human books and documents.