The DTL Grants support research in tools, data, platforms and methodologies for shedding light on the use of personal data by online services, and to empower users to be in control of their personal data online.
The winners of the first DTL research grants are listed below. Each project received a lump sum of Euros 50,000. Click here for more information on the call for proposals.
The remaining proposals that made it to the top-third of all proposals have been awarded with a platform to present their work at the DTL2015 Conference, with a corresponding travel grant. Click here to view these proposals.
Online behavioral advertising (OBA), the targeting of advertisements based on a user's web browsing, remains a major source of privacy invasion. Although a number of privacy tools (e.g., Ghostery, Lightbeam, and Privacy Badger) can help users control OBA, average users are left utterly confused about OBA even after using such tools. We propose moving beyond existing tools, which alert users to tracking occurring at the current moment, by designing and testing a tool that takes a data-driven, personalized approach to privacy awareness. We hypothesize that users can better understand OBA and resultant privacy threats if equipped with a tool that visualizes instances of them being tracked over time.
We will build and test such a data-driven privacy tool that enables users to explore on precisely which webpages different companies have tracked them, as well as what those companies may have inferred about their interests. Studies have shown benefits in notifying users about the collection of data by smartphone apps. Our proposal translates these insights to the OBA domain, yet makes further intellectual contributions by exploring the impact of presenting different abstractions and granularities of the information tracked (e.g., showing "Doubleclick knows you visited the following 82 pages" versus "Doubleclick has likely concluded that you like 'European travel' based on your visits to these 82 pages"). In addition to releasing our privacy tool as a fully functional, open-source project, we will conduct a 75-participant, 2-week field trial comparing visualizations of personalized tracking data.
The combination of rich sensors and ubiquitous connectivity make mobile devices perfect vectors for invading the privacy of end users. We argue that improving privacy in this environment requires trusted third-party systems that enable auditing and control over PII leaks. However, previous attempts to address PII leaks fall short of enabling such auditing and control because they face challenges of a lack of visibility into network traffic generated by mobile devices and the inability to control the traffic.
The proposed research will enable the auditing and control of PII leaks in network traffic from mobile devices using indirection to improve visibility and control for PII leaks in mobile network traffic. Specifically, we use natively supported mobile OS features to redirect all of a device’s Internet traffic to a trusted server to identify and control privacy leaks in network traffic.
We will address the key challenges of how to identify and control PII leaks when users’ PII is not known a priori, nor is the set of apps that leak this information. First, to enable auditing through improved transparency, we will investigate how to use machine learning to reliably identify PII from network flows, and identify algorithms that incorporate user feedback to adapt to the changing landscape of privacy leaks. Second, we will build tools that allow users to control how their information is (or not) shared with second and third parties. These tools will be deployed as free, open-source applications that can run in a number of deployment scenarios, including on a device in a user’s home network, or in a shared cloud-based VM environment.
A recent report of the Interactive Advertising Bureau revealed that online advertising generated in 2014 $49.5B worth of revenue only in US, representing an increase of 16% with respect to 2013, which in turn exceeded 17% the revenue of 2012. A great advantage of online advertising over more traditional printed and TV advertising is its capability to target individuals with specialized advertisements tailored to their personal information. For instance the ad campaign planner from Facebook (FB) allows defining an audience using more than 13 different attributes related to personal information of the end-user. Therefore, an online advertiser can launch a campaign targeting a well-defined audience based on personal information attributes, thus an important part of the FB business model is built up on top of the personal information of its subscribers. Although there are no doubts of the legality of the business model implemented by FB and other major players in the Internet, there are some actors raising the request of generating tools that let end-users knowing what is the actual value of their personal information. In other words, how much money FB, Google, and other companies in the on-line advertising market make out of my personal information. Providing Internet users with simple and transparent tools that inform them of what is the value that their personal data generates is not only a civil society request, but a demand from governmental forces.
The goal of this project is to develop a tool that informs in real-time Internet end users regarding the economic value that the personal information associated to their browsing activity has generated. Due to the complexity of the problem we narrow down the scope of this tool to FB in this project, i.e., inform FB users in real time of the value that they are generating to FB. We refer to this tool as FB Data Valuation Tool (FDVT).
Our online browsing history is intensely personal. Our search terms and the web-pages we visit, reveal our fears, interests, illnesses, and secret ambitions. While many people are familiar with the concept of behavior-tracking and cookies, there is significantly less public awareness of just how personal our online behavior is.
A few years ago, the immersion project originating at the MIT Media Lab received world-wide press coverage by visualizing the latent social information contained in our email header information. We aim to do something similar for web-browsing. Using topic models, we aim to design a simple dashboard that allows individuals to visualize the content of their browsing, and observe how these topics change over time. Crucially, we will combine this visualization with information on data trackers (how many tracking parties, how much outgoing information), thus allowing users to directly observe what the data tracking means for them.
Collected, as well as computed data, will be stored in safe, individualized ‘vaults’ in a storage system following the OpenPDS framework specification. Ipso facto ensuring strict sovereignty of users over their data.
Personal information loss has been a worrisome issue for both researchers as well as regular users. Even though a lot of research has been done in security as well as privacy community, a personalized solution, addressing problems in both areas, which is useful to end users is missing. In this work, we present Appu, a browser extension, that automatically detects i) sensitive information of the user, ii) whether it is sufficiently secured, and iii) if it is getting leaked to third party domains.
To automatically detect users sensitive information, we developed scripting language to scrape this information from userâs existing accounts. Once the personal information store is populated with this information, Appu monitors userâs interaction with various accounts passively to detect further information spread. Appu also monitors whether any personal information is leaked to third parties. Over time, Appu presents the user a complete picture of personal information spread across the web. Appu also nudges the user to secure important but inadequately protected accounts.
The third-party online tracking ecosystem lacks transparency about (1) which companies track users, (2) what user data is being collected, (3) what technologies are being used for tracking, and (4) data flows between trackers. Automated measurement can enable transparency and has already resulted in greater privacy awareness, improved privacy tools, and, at times, regulatory enforcement actions.
At Princeton we have built OpenWPM, a platform for online tracking transparency. We have used it in several published studies to detect and reverse-engineer online tracking. We now aim to democratize web privacy measurement: transform it from a niche research field to a widely available tool. We will do this in two steps: use OpenWPM to publish a "web privacy census" — a monthly web-scale measurement of tracking and privacy, comprising 1 million sites. The census will detect and measure many or most of the types of known privacy violations reported by researchers so far: circumvention of cookie blocking, leakage of PII to third parties, canvas fingerprinting, and more. Second, we will build an analysis platform to allow anyone to analyze the census data with minimal expertise. The platform will have "1-click reproducibility'" which will allow packaging and distributing study data, scripts, and results in a format that's easy to replicate and extend.
The following proposals are amongst the top-third of all proposals and were offered a platform to share their ideas and work with other members of the DTL community. They were offered a presentation slot at the forthcoming DTL workshop in November 2015 (location and details will be announced in due time) as well as a travel grant to attend it.
Online ad networks are a characteristic example of online services that massively leverage user data for the purposes of behavioral targeting. A significant problem of these technologies is their lack of transparency. For this reason, the problem of reverse-engineering the behavioral targeting mechanisms of ad networks has recently attracted significant research interest. Existing approaches query ad networks using artificial user profiles, each of which pertains to a single user category. Nevertheless, well-designed ad services may not rely on such simple user categorizations: A user assigned to multiple categories may be presented with a set of ads quite different from the union of the set of ads pertaining to each one of their individual interests. Even more importantly, user interests may change or vary over time. Nevertheless, none of the existing reverse-engineering systems are capable of determining whether and how ad network targeting mechanisms adapt to such temporal dynamics.
The goal of this proposal is to develop a platform addressing these inadequacies by leveraging advanced machine learning methods. The proposed platform is capable of: (i) Intelligently creating a diverse set of (interest-based) user profiles to query ad networks with. It ensures that the (artificial) user profiles used to query the analyzed ad networks correspond to as diverse a set of combinations of user interests (characteristics) as possible. (ii) Obviating the need to rely on some publicly available tree of categories/user interests, as this can be restrictive to the analysis or even misleading. Instead, our platform is capable of reliably producing a tree-like content-based grouping (clustering) of websites into interest groups, in a completely unsupervised manner. (iii) Performing inference of the correlations between user characteristics and ad network outputs in a way that allows for large scale generalization. (iv) Determining whether and how temporal dynamics affect these correlations, and on how long temporal horizons.
We propose Alibi, a system that enables users to take direct advantage of the work online trackers do to record and interpret their behavior. The key idea is to use the readily available personalized content, generated by online trackers in real-time, as a means to verify an online user in a seamless and privacy-preserving manner. We propose to utilize such tracker-generated personalized content, submitted directly by the user, to construct a multi-tracker user-vector representation and use it in various online verification scenarios. The main research objectives of this project are to explore the fundamental properties of such user-vector representations, i.e., their construction, uniqueness, persistency, resilience, utility in online verification, etc. The key goal of this project is to design, implement, and evaluate the Alibi service, and make it publicly available.
Today’s systems produce a rapidly exploding amount of data, and the data further derives more data, forming a complex data propagation network that we call the data’s lineage. There are many reasons that users want systems to forget certain data including its lineage. From a privacy perspective, users who become concerned with new privacy risks of a system often want the system to forget their data and lineage. From a security perspective, if an attacker pollutes an anomaly detector by injecting manually crafted data into the training data set, the detector must forget the injected data to regain security. From a usability perspective, a user can remove noise and incorrect entries so that a recommendation engine gives useful recommendations. Therefore, we envision forgetting systems, capable of forgetting certain data and their lineages, completely and quickly.
In this proposal, we focus on making learning systems forget, the process of which we call machine unlearning, or simply unlearning. We present a general, efficient unlearning approach by transforming learning algorithms used by a system into a summation form. To forget a training data sample, our approach simply updates a small number of summations – asymptotically faster than retraining from scratch. Our approach is general, because the summation form is from the statistical query learning in which many machine learning algorithms can be implemented. Our approach also applies to all stages of machine learning, including feature selection and modeling.
In this project, we aim to bring greater transparency to algorithmic pricing implemented by mobile, on-demand services. Algorithmic pricing was pioneered in this space by Uber in the form of "surge pricing". While we applaud mobile, on-demand services for disrupting incumbents and stimulating moribund sectors of the economy, we also believe that the data and algorithms leveraged by these services should be transparent. Fundamentally, consumers and providers cannot make informed choices when marketplaces are opaque. Furthermore, black-box services are vulnerable to exploitation once their algorithms are understood, which creates opportunities for customers and providers to manipulate these services in ways that are not possible in transparent markets.
Users are currently given only very limited feedback from search providers as to what learning and inference of personal preferences is taking place. When a search engine infers that a particular advertising category is likely to be of interest to a user, and so more likely to generate click through and sales, it will tend to use this information when selecting which adverts to display. This can be used to detect search engine learning via analysis of changes in the choice of displayed adverts and to inform the user of this learning. In this project we will develop a browser plugin that provides such feedback, essentially by empowering the user via the kind of data analytic techniques used by the search engines themselves.
In principle, data transparency tools follow strict privacy guidelines to protect customersâ data while revealing how this data is being used by others. But those objectives are often at odds. To take a simple example, answering questions like which of my email caused this ad to appear brings user to the following dilemma: she can either enjoy (blindly) the (relative) privacy offered by a service like gmail, or if she decides to voice her concern, can alternatively propose her data to participate in a data-transparency experiment with various tools (e.g., Xray, AdFisher, Sunlight and other more specific ones). The later involves running the experiment herself entirely or providing the data in clear form to one of those tools run by a third party. Both increases privacy risks, because sensitive data are now being manipulated by other pieces of codes, sometimes under someone elseâs control. That explains that all tools mentioned above, and in fact with almost no exception all transparency research so far is run and validated on synthetic data-sets that are by nature not sensitive.
Here, our goal is to formally define zero-knowledge transparency, to reconcile the two needs of being informed and being safe when it comes to our data usage, and experiment with tools that provide this dual protection. As in our prior research, we aim at generic tools, that address a broad range of scenarios with the same underlying concepts. The first architecture we propose leverages differential correlation, as used in Xray for multiple services, to show that this tool can be made privacy-preserving with an additional simple architectural layers. The second architecture we envision is way broader: it leverages data bank with interactive queries such as air-cloak to separately solve privacy and transparency. We believe that most data transparency tools will require a similar complement and experiment with the robustness of this solution in the face of scale and other challenges posed.
Human and social data are an important source of knowledge useful for understanding human behaviour and for developing a wide range of user services. Unfortunately, this kind of data is sensitive, because people's activities described by these data may allow re-identification of individuals in a de-identified database and thus can potentially reveal intimate personal traits, such as religious or sexual preferences. Therefore, Data Providers, before sharing those data, must apply any sort of anonymization to lower the privacy risks, but they must be aware and capable of controlling also the data quality, since these two factors are often a tradeoff. This project proposes a framework to support the Data Provider in the privacy risk assessment of data to be shared. This framework measures both the empirical (not theoretical) privacy risk associated to users represented in the data and the data quality guaranteed only with users not at risk. It provides a mechanism allowing the exploration of a repertoire of possible data transformations with the aim of selecting one specific transformation that yields an adequate trade-off between data quality and privacy risk. The project will focus on mobility data studying the practical effectiveness of the framework over forms of mobility data required by specific knowledge-based services.
Despite the magnitude and the severity of the PII-leakage problem, there is, currently, a dearth of usable, privacy-enhancing technologies that detect and prevent PII leakage. To restore the control of users over their own personally identifiable information, we propose to design, implement, and evaluate LeakSentry, a browser extension that has the ability to identify leakage as that is happening and give users contextual information about the leakage as well as the power to allow it, or block it. Next to LeakSentry's stand-alone mode, users of LeakSentry will be able to opt-in to a crowd-wisdom program where they can learn from each other's choices. In addition, LeakSentry will have the ability to report the location of PII leakage, enabling us to create a PII-leaking page observatory, which can both apply pressure to the websites that were caught red-handed, as well as navigate other users away from them.
Online privacy is an important, hotly researched and demanded topic that gained even more relevance recently. However, existing mechanisms that protect usersâ privacy online, such as TOR and using VPN connections are complex, bring performance issues with them and, in case of the latter, add costs. Therefore, their widespread use is not applicable for the public. Browser vendors have recently established so-called private browsing modes that are largely misunderstood by users: They over-rate the level of protection offered by the services, which can lead to insecure behaviour. We aim to study user misconceptions, enhance their comprehension and scientifically evaluate the usability and applicability of more privacy-enhancing services such as TOR.
Digitally sharing our lives with others is a captivating and often addictive activity. Nowadays 1.8 billion photos are shared daily on social media. These images hold a wealth of personal information, ripe for exploitation by tailored advertising business models, but placed in the wrong hands this data can lead to disaster. In this project, we want to see how the increasing of a person’s awareness about potential personal data sensitivity issues influences their decisions about what and how to share, and moreover, how valuable they perceive their personal data to be. To achieve this ambitious goal we aim to (i) develop a novel methodology, applied within a mobile app, to inform users about the potential sensitivity of their images. Sensitivity will be modeled by exploiting automatic inferences coming from advanced computer vision and deep learning algorithms applied to personal photos and associated metadata; (ii) perform user-centric studies within a living-lab environment to assess how users’ posting behaviours and monetary valuation of mobile personal data are influenced by user awareness about content sharing risks.
Targeted advertising largely contributes to the support of free web services. However, it is also increasingly raising concerns from users, mainly due to its lack of transparency. The objective of this proposal is to increase the transparency of targeted advertising from the user’s point of view by providing users with a tool to understand why they are targeted with a particular ad and to infer what information the ad engines possibly have about them. Concretely, we propose to build a browser plugin that collects the ads shown to a user and provides her with analytics about these ads.
We are in a ‘personal data gold rush’ driven by advertising being the primary revenue source for most online companies. These companies accumulate extensive personal data about individuals with minimal concern for us, the subjects of this process. This can cause many harms: privacy infringement, personal and professional embarrassment, restricted access to labour markets, restricted access to best value pricing, and many others. There is a critical need to provide technologies that enable alternative practices, so that individuals can participate in the collection, management and consumption of their personal data.We are developing the Databox, a personal networked device (and associated services) that collates and mediates access to personal data, allowing us to recover control of our online lives. We hope the Databox is a first step to re-balancing power between us, the data subjects, and the corporations that collect and use our data.