Lab of Advanced Algorithms and Applications
HighlightsWe are developing a new semantic-annotation technology for short textual fragments, called TAGME [ACM CIKM 2010, IEEE Software 2012]. This tool has been applied succesfully to many contexts concerning with the clustering, the classification and the similarity-comparison of short texts. We are also studying the design of of “Multi-objective” data compressor, a novel paradigm in which optimization techniques are plugged into the design of data compressors. For example, in the Bicriteria Data Compression ([ACM-SIAM SODA14], to appear), the compressor solves the following problem: "return a compressed file which minimizes the compressed space provided that the decompression time is less than T". Similarly, the compressor can "return a compressed file which minimizes the decompression time provided that the compressed space is less than S bits". A third line of research is a classic one for our group lasting for 20 years now and regarding the design of compressed indexes for big data (such as String B-tree [JACM 1999] and FM-index [JACM 2005]). Recently we have designed cache-oblivious compressed version of those indexes for dictionaries of strings [ESA 2013], and distribution-ware compressed indexes for data collections [Algorithmica 2013].
- Google Zurich
- Tiscali Italia
- [2013-2015] Regione Toscana, Net7, StudioFlu, SpazioDati
- [2013-2014] Bassilichi
- [2013-2014] Google Research Award
- [2012-2014] Italian MIUR-PRIN Project "ARS-Technomedia"
- [2011-2012] Telecom Italia Working Capital
-  Google Research Award
- [2010-2012] Italian MIUR-PRIN Project "The Mad Web"
- [2006-2011] Yahoo! Research
- [2009-2012] Italian MIUR-FIRB Project "Linguistica"
WikipediaSimilarity411 dataset released » January 23, 2014
Darth Vader is closely related to the Death Star, but totally unrelated to Homeopathy. A relatedness function is a function that, given two Wikipedia pages, returns their relatedness in a [0, 1] range. Many disambiguation techniques are based on a relatedness function.
But how would you measure a relatedness function? A possible way is, given a set of Wikipedia-pages pairs, to ask humans the relatedness between those pairs of pages, and check the output of the relatedness function against the relatedness found by humans.
Such a dataset is Milne&Witten’s WikipediaSimilarity353 dataset. Unfortunately, since Wikipedia is ever-growing, the dataset became obsolete, pointing to pages that don’t exist anymore. To address this issue, we cleaned it, updated a few references, and added a few pairs of pages.
We created a brand-new WikipediaSimilarity411 dataset anybody can use, which provides 411 Wikipedia-pages pairs. Please also note that the BAT-Framework has been updated and features the test of a relatedness function against this dataset.
“Bicriteria Data Compression” Accepted at SODA14! » December 1, 2013
We are pleased to announce that our paper “Bicriteria Data Compression” has been accepted at the ACM/SIAM Symposium On Discrete Algorithms (SODA) 14! The work will be illustrated on January 7, 4PM, Galleria North – Ballroom Level, Hilton Portland & Executive Tower.
TAGME service reaches 100M queries! » November 12, 2013
We are pleased to announce that our TAGME API service has been hit more than 100 millions in about 2 years. We are currently providing access for more than 100 users and sometimes the service has been able to handle more than 1 million queries per day.
Thank you very much to all users that have provided their valuable feedback.
bc-zip dataset » October 3, 2013
Here we publish the dataset used in our “Bicriteria Data Compression” paper (arxiv version).
Each file is a chunk of 1GB (2^30 bytes) extracted from the following sources:
- Wikipedia: Natural data. Extracted from a dump of English wikipedia, in XML format.
- U.S. Census: Database, statistical metrics of the U.S. population. Extracted from the U.S. Census database.
- DBLP: Bibliographic database in XML format. Extracted from the The DBLP Computer Science Bibliography project.
- PFAM: Biological data. Extracted from the PFAM database of of protein families.