Research at Brave

Papers

BatteryLab: A Distributed Platform for Battery Measurements

Authors: Matteo Varvello (Brave Software), Kleomenis Katevas (Imperial College London), Mihai Plesa (Brave Software), Hamed Haddadi (Brave Software and Imperial College London), Ben Livshits (Brave Software and Imperial College London)

HotNets 2019: Eighteenth ACM Workshop on Hot Topics in Networks

Recent advances in cloud computing have simplified the way that both software development and testing are performed. Unfortunately, this is not true for battery testing for which state of the art test-beds simply consist of one phone attached to a power meter. These test-beds have limited resources, access, and are overall hard to maintain; for these reasons, they often sit idle with no experiment to run. In this paper, we propose to share existing battery testing setups and build BatteryLab, a distributed platform for battery measurements. Our vision is to transform independent battery testing setups into vantage points of a planetary-scale measurement platform offering heterogeneous devices and testing conditions. In the paper, we design and deploy a combination of hardware and software solutions to enable BatteryLab’s vision. We then evaluate BatteryLab’s accuracy of battery reporting, along with some system benchmarking. We also demonstrate how BatteryLab can be used by researchers to investigate a simple research question.

 

AdGraph: A Machine Learning Approach to Automatic and Effective Adblocking

Authors: Umar Iqbal (The University of Iowa), Peter Snyder (Brave Software), Shitong Zhu (University of California Riverside), Benjamin Livshits (Brave Software and Imperial College London), Zhiyun Qian (University of California Riverside), Zubair Shafiq (The University of Iowa)

IEEE Symposium on Security and Privacy 2020

Filter lists are widely deployed by adblockers to block ads and other forms of undesirable content in web browsers. However, these filter lists are manually curated based on informal crowdsourced feedback, which brings with it a significant number of maintenance challenges. To address these challenges, we propose a machine learning approach for automatic and effective adblocking called AdGraph. Our approach relies on information obtained from multiple layers of the web stack (HTML, HTTP, and JavaScript) to train a machine learning classifier to block ads and trackers. Our evaluation on Alexa top-10K websites shows that AdGraph automatically and effectively blocks ads and trackers with 97.7% accuracy. Our manual analysis shows that AdGraph has better recall than filter lists, it blocks 16% more ads and trackers with 65% accuracy. We also show that AdGraph is fairly robust against adversarial obfuscation by publishers and advertisers that bypass filter lists.

 

SpeedReader: Reader Mode Made Fast and Private

Authors: Mohammad Ghasemisharif, Peter Snyder, Andrius Aucinas, Benjamin Livshits

WWW Conference 2019

Most popular web browsers include “reader modes” that improve the user experience by removing un-useful page elements. Reader modes reformat the page to hide elements that are not related to the page’s main content. Such page elements include site navigation, advertising related videos and images, and most JavaScript. The intended end result is that users can enjoy the content they are interested in, without distraction.

In this work, we consider whether the “reader mode” can be widened to also provide performance and privacy improvements. Instead of its use as a post-render feature to clean up the clutter on a page we propose SpeedReader as an alternative multistep pipeline that is part of the rendering pipeline. Once the tool decides during the initial phase of a page load that a page is suitable for reader mode use, it directly applies document tree translation before the page is rendered.

Based on our measurements, we believe that SpeedReader can be continuously enabled in order to drastically improve end-user experience, especially on slower mobile connections. Combined with our approach to predicting which pages should be rendered in reader mode with 91% accuracy, it achieves drastic speedups and bandwidth reductions of up to 27x and 84x respectively on average. We further find that our novel “reader mode” approach brings with it significant privacy improvements to users. Our approach effectively removes all commonly recognized trackers, issuing 115 fewer requests to third parties, and interacts with 64 fewer trackers on average, on transformed pages.

 

Working Drafts

Filter List Generation for Underserved Regions

Authors: Alexander Sjosten, Peter Snyder, Antonio Pastor, Panagiotis Papadopoulos, Benjamin Livshits

Oct 16, 2019

Filter lists play a large and growing role in protecting and assisting web users. The vast majority of popular filter lists are crowd-sourced, where a large number of people manually label resources related to undesirable web resources (e.g. ads, trackers, paywall libraries), so that they can be blocked by browsers and extensions.

Because only a small percentage of web users participate in the generation of filter lists, a crowd-sourcing strategy works well for blocking either uncommon resources that appear on “popular” websites, or resources that appear on a large number of “unpopular” websites. A crowd-sourcing strategy will performs poorly for parts of the web with small “crowds”, such as regions of the web serving languages with (relatively) few speakers.

This work addresses this problem through the combination of two novel techniques: (i) deep browser instrumentation that allows for the accurate generation of request chains, in a way that is robust in situations that confuse existing measurement techniques, and (ii) an ad classifier that uniquely combines perceptual and page-context features to remain accurate across multiple languages.

We apply our unique two-step filter list generation pipeline to three regions of the web that currently have poorly maintained filter lists: Sri Lanka, Hungary, and Albania. We generate new filter lists that complement existing filter lists. Our complementary lists block an additional 2,270 of ad and ad-related resources (1,901 unique) when applied to 6,475 pages targeting these three regions.

We hope that this work can be part of an increased effort at ensuring that the security, privacy, and performance benefits of web resource blocking can be shared with all users, and not only those in dominant linguistic or economic regions.

 

Percival: Making In-Browser Perceptual Ad Blocking Practical With Deep Learning

Authors: Zain ul Abi Din, Panagiotis Tigas, Samuel T. King, Benjamin Livshits

May 22, 2019

Online advertising has been a long-standing concern for user privacy and overall web experience. Several techniques have been proposed to block ads, mostly based on filter-lists and manually-written rules. While a typical ad blocker relies on manually-curated block lists, these inevitably get out-of-date, thus compromising the ultimate utility of this ad blocking approach. In this paper we present Percival, a browser-embedded, lightweight, deep learning-powered ad blocker. Percival embeds itself within the browser’s image rendering pipeline, which makes it possible to intercept every image obtained during page execution and to perform blocking based on applying machine learning for image classification to flag potential ads. Our implementation inside both Chromium and Brave browsers shows only a minor rendering performance overhead of 4.55%, demonstrating the feasibility of deploying traditionally heavy models (i.e. deep neural networks) inside the critical path of the rendering engine of a browser. We show that our image-based ad blocker can replicate EasyList rules with an accuracy of 96.76%. To show the versatility of the Percival’s approach we present case studies that demonstrate that Percival 1) does surprisingly well on ads in languages other than English; 2) Percival also performs well on blocking first-party Facebook ads, which have presented issues for other ad blockers. Percival proves that image-based perceptual ad blocking is an attractive complement to today’s dominant approach of block lists.

 

The Blind Men and the Internet: Multi-Vantage Point Web Measurements

Authors: Jordan Jueckstock, Shaown Sarker, Peter Snyder, Panagiotis Papadopoulos, Matteo Varvello, Benjamin Livshits, Alexandros Kapravelos

May 21, 2019

In this paper, we design and deploy a synchronized multi-vantage point web measurement study to explore the comparability of web measurements across vantage points (VPs). We describe in reproducible detail the system with which we performed synchronized crawls on the Alexa top 5K domains from four distinct network VPs: research university, cloud datacenter, residential network, and Tor gateway proxy. Apart from the expected poor results from Tor, we observed no shocking disparities across VPs, but we did find significant impact from the residential VP’s reliability and performance disadvantages. We also found subtle but distinct indicators that some third-party content consistently avoided crawls from our cloud VP. In summary, we infer that cloud VPs do fail to observe some content of interest to security and privacy researchers, who should consider augmenting cloud VPs with alternate VPs for cross-validation. Our results also imply that the added visibility provided by residential VPs over university VPs is marginal compared to the infrastructure complexity and network fragility they introduce.

 

Privacy-Preserving Bandits

Authors: Mohammad Malekzadeh, Dimitrios Athanasakis, Hamed Haddadi, Ben Livshits

Oct 3, 2019

Contextual bandit algorithms (CBAs) often rely on personal data to provide recommendations. This means that potentially sensitive data from past interactions are utilized to provide personalization to end-users. Using a local agent on the user’s device protects the user’s privacy, by keeping the data locally, however, the agent requires longer to produce useful recommendations, as it does not leverage feedback from other users. This paper proposes a technique we call Privacy-Preserving Bandits (P2B), a system that updates local agents by collecting feedback from other agents in a differentially-private manner. Comparisons of our proposed approach with a non-private, as well as a fully-private (local) system, show competitive performance on both synthetic benchmarks and real-world data. Specifically, we observed a decrease of 2.6% and 3.6% in multi-label classification accuracy, and a CTR increase of 0.0025 in online advertising for a privacy budget ε≈ 0.693. These results suggest P2B is an effective approach to problems arising in on-device privacy-preserving personalization.

 

Another Brick in the Paywall: The Popularity and Privacy Implications of Paywalls

Authors: Panagiotis Papadopoulos, Peter Snyder, Benjamin Livshits

Feb 18, 2019

Funding the production and distribution of quality online content is an open problem for content producers. Selling subscriptions to content, once considered passe, has been growing in popularity recently. Decreasing revenues from digital advertising, along with increasing ad fraud, have driven publishers to “lock” their content behind paywalls, thus denying access to non-subscribed users. How much do we know about the technology that may obliterate what we know as free web? What is its prevalence? How does it work? Is it better than ads when it comes to user privacy? How well is the premium content of publishers protected? In this study, we aim to address all the above by building a paywall detection mechanism and performing the first full-scale analysis of real-world paywall systems. Our results show that the prevalence of paywalls across the top sites in Great Britain reach 4.2%, in Australia 4.1%, in France 3.6% and globally 7.6%. We find that paywall use is especially pronounced among news sites, and that 33.4% of sites in the Alexa 1k ranking for global news sites have adopted paywalls. Further, we see a remarkable 25% of paywalled sites outsourcing their paywall functionality (including user tracking and access control enforcement) to third-parties. Putting aside the significant privacy concerns, these paywall deployments can be easily circumvented, and are thus mostly unable to protect publisher content.

 

Evaluating the End-User Experience of Private Browsing Mode

Authors: Ruba Abu-Salma, Benjamin Livshits

Nov 20, 2018

Nowadays, all major web browsers have a private browsing mode. However, the mode’s benefits and limitations are not particularly understood. Through the use of survey studies, prior work has found that most users are either unaware of private browsing or do not use it. Further, those who do use private browsing generally have misconceptions about what protection it provides.

However, prior work has not investigated why users misunderstand the benefits and limitations of private browsing. In this work, we do so by designing and conducting a two-part user study with 20 demographically-diverse participants: (1) a qualitative, interview-based study to explore users’ mental models of private browsing and its security goals; (2) a participatory design study to investigate whether existing browser disclosures, the in-browser explanations of private browsing mode, communicate the security goals of private browsing to users. We asked our participants to critique the browser disclosures of three web browsers: Brave, Firefox, and Google Chrome, and then design new ones.

We find that most participants had incorrect mental models of private browsing, influencing their understanding and usage of private browsing mode. Further, we find that existing browser disclosures are not only vague, but also misleading. None of the three studied browser disclosures communicates or explains the primary security goal of private browsing. Drawing from the results of our user study, we distill a set of design recommendations that we encourage browser designers to implement and test, in order to design more effective browser disclosures.

 

Who Filters the Filters: Understanding the Growth, Usefulness and Efficiency of Crowdsourced Ad Blocking

Authors: Antoine Vastel, Peter Snyder, Benjamin Livshits

Oct 22, 2018

Ad and tracking blocking extensions are among the most popular browser extensions. These extensions typically rely on filter lists to decide whether a URL is associated with tracking or advertising. Millions of web users rely on these lists to protect their privacy and improve their browsing experience. Despite their importance, the growth and health of these filter lists is poorly understood. These lists are maintained by a small number of contributors, who use a variety of undocumented heuristics to determine what rules should be included. These lists quickly accumulate rules over time, and rules are rarely removed. As a result, users’ browsing experiences are degraded as the number of stale, dead or otherwise not useful rules increasingly dwarfs the number of useful rules, with no attenuating benefit. This paper improves the understanding of crowdsourced filter lists by studying EasyList, the most popular filter list. We find that, over its 9 year history, EasyList has grown from several hundred rules, to well over 60,000. We then apply EasyList to a sample of 10,000 websites, and find that 90.16% of the resource blocking rules in EasyList provide no benefit to users, in common browsing scenarios. Based on these results, we provide a taxonomy of the ways advertisers evade EasyList rules. Finally, we propose optimizations for popular ad-blocking tools that provide over 99% of the coverage of existing tools, but 62.5% faster.

 

Blog Entries

Memory Savings in Brave: 33% to 66% memory reduction over Chrome

This research was conducted by Dr. Andrius Aucinas, performance researcher at Brave, and Dr. Ben Livshits, Brave’s Chief Scientist. We are continuing our series of posts evaluating Brave browser's performance. This time we look at one aspect that often frustrates web...

French regulator shows deep flaws in IAB’s consent framework and RTB

French regulator's decision against Vectaury confirms that IAB “Transparency & Consent Framework” does not obtain valid consent, and illustrates how even tiny adtech companies can unlawfully gather millions of people’s personal data from the online advertising “real time bidding system” (RTB). 

Evaluating the End-User Experience of Private Browsing Mode

Nowadays, all major web browsers have a private browsing mode. However, the mode's benefits and limitations are not particularly understood. Through the use of survey studies, prior work has found that most users are either unaware of private browsing or do not use it. Further, those who do use private browsing generally have misconceptions about what protection it provides. However, prior work has not investigated why users misunderstand the benefits and limitations of private browsing. In this work, we do so by designing and conducting a two-part user study with 20 demographically-diverse participants: (1) a qualitative, interview-based study to explore users' mental models of private browsing and its security goals; (2) a participatory design study to investigate whether existing browser disclosures, the in-browser explanations of private browsing mode, communicate the security goals of private browsing to users. We asked our participants to critique the browser disclosures of three web browsers: Brave, Firefox, and Google Chrome, and then design new ones. We find that most participants had incorrect mental models of private browsing, influencing their understanding and usage of private browsing mode. Further, we find that existing browser disclosures are not only vague, but also misleading. None of the three studied browser disclosures communicates or explains the primary security goal of private browsing. Drawing from the results of our user study, we distill a set of design recommendations that we encourage browser designers to implement and test, in order to design more effective browser disclosures.

SpeedReader: Reader Mode Made Fast and Private

Most popular web browsers include "reader modes" that improve the user experience by removing un-useful page elements. Reader modes reformat the page to hide elements that are not related to the page's main content. Such page elements include site navigation, advertising related videos and images, and most JavaScript. The intended end result is that users can enjoy the content they are interested in, without distraction. In this work, we consider whether the "reader mode" can be widened to also provide performance and privacy improvements. Instead of its use as a post-render feature to clean up the clutter on a page we propose SpeedReader as an alternative multistep pipeline that is part of the rendering pipeline. Once the tool decides during the initial phase of a page load that a page is suitable for reader mode use, it directly applies document tree translation before the page is rendered. Based on our measurements, we believe that SpeedReader can be continuously enabled in order to drastically improve end-user experience, especially on slower mobile connections. Combined with our approach to predicting which pages should be rendered in reader mode with 91% accuracy, it achieves drastic speedups and bandwidth reductions of up to 27x and 84x respectively on average. We further find that our novel "reader mode" approach brings with it significant privacy improvements to users. Our approach effectively removes all commonly recognized trackers, issuing 115 fewer requests to third parties, and interacts with 64 fewer trackers on average, on transformed pages.

Who Filters the Filters: Understanding the Growth, Usefulness and Efficiency of Crowdsourced Ad Blocking

Ad and tracking blocking extensions are among the most popular browser extensions. These extensions typically rely on filter lists to decide whether a URL is associated with tracking or advertising. Millions of web users rely on these lists to protect their privacy and improve their browsing experience. Despite their importance, the growth and health of these filter lists is poorly understood. These lists are maintained by a small number of contributors, who use a variety of undocumented heuristics to determine what rules should be included. These lists quickly accumulate rules over time, and rules are rarely removed. As a result, users' browsing experiences are degraded as the number of stale, dead or otherwise not useful rules increasingly dwarfs the number of useful rules, with no attenuating benefit. This paper improves the understanding of crowdsourced filter lists by studying EasyList, the most popular filter list. We find that, over its 9 year history, EasyList has grown from several hundred rules, to well over 60,000. We then apply EasyList to a sample of 10,000 websites, and find that 90.16% of the resource blocking rules in EasyList provide no benefit to users, in common browsing scenarios. Based on these results, we provide a taxonomy of the ways advertisers evade EasyList rules. Finally, we propose optimizations for popular ad-blocking tools that provide over 99% of the coverage of existing tools, but 62.5% faster.

Understanding Redirection-Based Tracking

This blog post describes ongoing work conducted at Brave by Peter Snyder and Ben Livshits. It is the third in a series of research-oriented posts that share both present investigations and future vision. We are constantly looking to improve and automate the privacy...

The Mounting Cost of Stale Ad Blocking Rules

This blog post describes ongoing work conducted at Brave by Antoine Vastel, Peter Snyder, and Ben Livshits. It is the second in a series of research-oriented posts that share both present investigations and future vision. We are constantly looking to improve and...

AdGraph: A Machine Learning Approach to Automatic and Effective Adblocking

Filter lists are widely deployed by adblockers to block ads and other forms of undesirable content in web browsers. However, these filter lists are manually curated based on informal crowdsourced feedback, which brings with it a significant number of maintenance challenges. To address these challenges, we propose a machine learning approach for automatic and effective adblocking called AdGraph. Our approach relies on information obtained from multiple layers of the web stack (HTML, HTTP, and JavaScript) to train a machine learning classifier to block ads and trackers. Our evaluation on Alexa top-10K websites shows that AdGraph automatically and effectively blocks ads and trackers with 97.7% accuracy. Our manual analysis shows that AdGraph has better recall than filter lists, it blocks 16% more ads and trackers with 65% accuracy. We also show that AdGraph is fairly robust against adversarial obfuscation by publishers and advertisers that bypass filter lists.