With my teammates Jean-Marie, Ahmed and Owen we proudly won the first fHACKtory hackathon by implementing a system that took its origin early back during my Phd thesis obtained in 2004. The idea behind TESS is that for many searches we make on the web we get non-adapted results because of the way things are indexed. Indeed, search engines search for documents (because that is what is indexed) when often what we really are searching for are objects. This is why a search for “software engineer” doesn’t give software engineers as results. It rather returns documents in which the words “software” and “engineer” appear. For such queries to be answered a bit more as expected, the idea is to change what we index: objects (in a broad sens ans people in this scope fit as objects) rather than documents.
The difficulty is that to do so we need to do a little more work when processing documents when crawling the web. Indeed, the taks is no longer to simply extract the terms of the documents, but we now need to identifiy the type of object(s) if any and to wrap the appropriate documents into objects (or rather object records). That is the job of a wrapper (or scraper). Unfortuneately, there is no straight forward way to build a wrapper which works on any site, so the idea is to include some machine learning to do both the page identification job and the object record extraction job.
During the hackathon we did a little bit of all this, and you can try out the result. With Jean-Marie, Ahmed and Owen, and maybe some additional students of the INSA of Lyon, we’ll be continuing on this project, make it evolve, and see what happens !
Here is the presentation (in French) of TESS we did at the first Blend Web Mix conference.
[niceyoutubelite id=”Y2r9-0Lhdus” end=”50]