A printer that automatically reformats pages to reduce clutter could save on ink and make printouts easier to read. But would you accept a few additional ads on your document in return? Researchers at HP’s labs think so.
“By some statistics, almost half of the printouts on HP’s printers come from the Web, but the experience is really terrible compared to office or PDF documents,” says Parag Joshi, a member of HP’s Multimedia Interaction and Understanding Lab in Palo Alto, California. Joshi and his colleague Sam Liu led the development of software that extracts the pertinent text and images from an online article and discards the rest of the page, making for a much cleaner printout. The system removes clutter such as navigation elements or online ads, so less ink and paper are needed.
At the same time, the system can insert advertisements chosen to match the content. The software “is a way HP could generate additional revenue,” says Liu, “but it can also provide a better experience to the user and save them money.”
The ads that do appear on the printed version are chosen by algorithms that scan an article’s content, and they can be designed to better suit the printed page, says Joshi. For example, they may be more image-centric, or include coupons to be taken to a store for discounts–both tactics that are more common in print advertising than online.
To determine which parts of a Web page to keep and which to discard, the HP software first renders the page in the same way as a Web browser. It then analyzes how text and images are spread across different sections of the page to extract the core text and images. Several clues make it possible to accurately exclude everything but the content of an article, says Liu. For example, the fact that advertisements are often labeled as such and lack captions makes them easy to spot.
This aspect of the software is similar to the workings of browser plugins like Readability–now built into Apple’s browser Safari–that strip away everything but the body text of a page and present it in a clean, easy-to-read layout. But HP’s system also preserves relevant images and has to do the extra work of formatting the printed page, and including new advertisements.
Selecting the right ads for printouts involves extracting meaning from the text. “Once we identify the main content, we use machine learning to find matching semantic categories,” says Liu. Adverts relevant to those categories are then selected for insertion into the document.
The final layout is currently chosen from a small set of broadly similar templates that arrange an article into columns. “They produce documents that look like a news magazine,” says Liu. One planned feature currently in the works would automatically combine several articles to save paper, instead of printing them individually. “You could subscribe to an RSS feed and have a small magazine generated automatically, or combine articles from different sites,” says Joshi.