Scraping Techniques to Extract Advertisements from Web Pages @ EuroPython 2011

thumbnail for this post

On January 22 2011 I was contacted by Mirko Urru for his thesis and on June 24 at 14:30 we were together to present his work at EuroPython: a very nice experience!


PDF icon   Slides - Scraping Techniques to Extract Advertisements from Web Pages



Online Advertising is an emerging research field, at the intersection of Information Retrieval, Machine Learning, Optimization, and Microeconomics. Its main goal is to choose the right ads to present to a user engaged in a given task, such as Sponsored Search Advertising or Contextual Advertising. The former puts ads on the page returned from a Web search engine following a query. The latter puts ads within the content of a generic, third party, Web page. The ads themselves are selected and served by automated systems based on the content displayed to the user.

Web scraping is the set of techniques used to automatically get some information from a website instead of manually copying it. In particular, we’re interested in studying and adopting scraping techniques for: i. accessing tags as object members ii. finding out tags whose name, contents or attributes match selection criteria iii. accessing tag attributes by using a dictionary-like syntax.

In this talk, we focus on the adoption of scraping techniques in the contextual advertising field. In particular, we present a system aimed at finding the most relevant ads for a generic web page p. Starting from p, the system selects a set of its inlinks (i.e., the pages that link p) and extracts the ads contained into them. Selection is performed querying the Google search engine, whereas extraction is made by using suitable scraping techniques.

More info at talk page at EuroPython (webarchive).