Secure Web Applications Group

Critical errors in our recent MADweb paper

It recently came to our attention that our MADweb 2021 paper “First, Do No Harm: Studying the manipulation of security headers in browser extensions” has two critical errors which causes our results to be incorrect.

Basic technique

To understand if an extension manipulates a request, we use the Chrome Devtools Protocol (CDP) to monitor network requests and responses. The idea is to make two requests, once without any extension (ground-truth data) and once with the extension under test. Subsequently, we extract the headers for both attempts and determine if they differed between ground truth and extension. To avoid temporal drift as best as possible, we collect the ground-truth headers for a given site and subsequently test this site with the extension under test. However, there are still two major problems with this approach.

Problem 1: Server-side randomness in headers

Something we did not think of at the time is the fact that servers may omit/add certain headers for some, but not all, requests. This can be caused by load balancers or through A/B testing. As an example, https://www.linkedin.com/litms/api/metadata/user includes a report-uri in 4/10 requests as of July 19, 2021. While we normalized parts that are random before comparison (such as nonces or the actual URL in report-uri), we did not account for different headers coming back from the same URL within a few seconds.

Hence, if any extension was “unlucky” in receiving the version without report-uri for the above-mentioned URL, we would compare the ground truth (with report-uri) against the one observed during test and incorrectly flag the extension as having manipulated the CSP. In the paper, we noted that 1,938 domains had used report-uri in their CSP and it was seemingly dropped on 25 domains, yet by 676 extensions. While our data does not allow us to pinpoint the exact number of incorrect detections (since the sites have changed since then), we strongly believe this to a major contribution to incorrectly flagging many extensions as having manipulated a CSP.

There is also the opposite case, in which the ground truth did not collect a header, but the test with an extension did. This would lead to extensions being flagged as having injected a header where in fact this was pure chance in the second response. We also found other examples, such as HSTS and XFO, which cause similar problems. As an example, we receive different HSTS max-age values for pinterest.com (either 63072000 or 31536000, seemingly depending on the time of day).

Problem 2: CDP vs. Extension Race Condition

Through an extensive analysis, we (unfortunately only recently) discovered that there is a race condition between Chrome extensions and the Chrome Devtools Protocol (CDP). While we cannot fully validate our suspicion, our tests showed the following problem: If an extension registers an event handler to react to receiving/sending headers, the corresponding event is not necessarily also sent to the CDP session. This means that we might not be able to collect any data for a certain extension. If, however, the site currently visited in the test had some header in the ground-truth crawl and even sends it on the second try, we might still not collect any headers for the given URL. Hence, this leads to a situation where we would incorrectly flag an extension as having dropped a security header.

Impact on our findings / Inability to correct them

We do not know the exact impact this had on our findings, since the work was conducted in late 2020. However, we analyzed the data and found that we flagged many more extensions with the all_urls privilege as being potential dangerous than we did for those with specific host permissions. Given our (belated) insights, this makes perfect sense: as these extensions operate on all URLs, we tested them against the top 100 sites (which include, among others, Pinterest and LinkedIn), which could trigger problem #1. Similarly, the chance of loosing the race seemingly is the same across all requests. Hence, the more URLs we test, the higher the chances problem #2 may occur.

Given these insights, we are unable to stand by the results. Moreover, we cannot reproduce the results with the outlined methodology, as this would inevitably lead to the same problems again. While potentially double-checking might work to alleviate Problem #1, we will be unable to address Problem #2 with the outlined approach.

As a result of this, we have asked the chairs to withdraw our paper from the workshop.