What is Browser Fingerprinting?
- Published on
Device fingerprinting refers to hardware and software information of remote computing devices collected for identification [1]. Browser fingerprinting refers to device fingerprinting done through the browser.
By default, device fingerprinting is used to identify users when Persistent cookies cannot be read, the client IP is hidden, or when accessed from one device using a different browser. And while it can be used for good purposes, such as preventing identity fraud or credit card use, it can also be used in gray areas, such as collecting data across users' privacy to personalize marketing. Additionally, in the data field of companies based on web applications, Clickstream data is collected and encountered through building directly during work or through various marketing tools or event trackers such as Google Analytics, SementIO, or Amplitude.
In return, most browsers are implementing fingerprinting in a way that can be controlled or prevented through settings in order to provide users with a more private environment.
In this article, we will briefly cover the evolution of web browsers and limit it to web browser fingerprinting, which is the main form of device fingerprinting. What browser fingerprinting techniques exist at this point, ** Let’s take a look at what defense techniques exist**.
Web browser history
In this chapter, we will focus on how HTML rendering in web browsers has changed [4].
Browser constraint notation through User-agent header
The philosophy of the early development of the web was to make it device-agnostic so that anyone could access the web, allowing it to run on devices with any structure. In the early 90s, HTTP and HTML were created to communicate between such devices, and web browsers designed by various teams to support them soon became standard. However, as the foundation of the web has evolved, pushing the boundaries of what is possible online, not all browsers and platforms have the latest features.
Some browsers only follow part of the specification and develop their own unique features. This direction ushered in the infamous “best supported on X” era.
To solve this compatibility issue, the HTTP protocol includes "User-Agent request header". Browsers began listing the browser name and version, sometimes including platform information, to avoid specific user agent restrictions. The history of the user agent header is very long, and it continues to be used to this day because modern browsers have a legacy from the first browsers (https://webaim.org/blog/user-agent-string-history/). The information contained within this header became more complex as various browser vendors began copying their competitors' values to indicate compatibility with other rendering engines in their browsers. For example, the user-agent for Chrome browser version 68 running on Linux is as follows.
Mozilla /5.0 (X11; Linux x86_64 ) AppleWebKit /537.36 (KHTML , like Gecko )
Chrome /68.0.3440.75 Safari /537.36
The only meaningful information above is "(X11; Linux x86_64)" and "Chrome/68.0.3440.75". "Gecko", "KHTML" or "Safari" exist to indicate compatibility with other layout engines.
Ultimately, this user-agent header becomes the first element that developers can check as it specifies differences between devices to help developers check browser restrictions.
Bridging the gap between web browsers and native applications
In the early days of the web, live changes required pages to be reloaded. In 1995, Brendan Eich added JavaScript to the Netscape Navigator browser to make web pages more dynamic. Since then, JavaScript has gained traction and started being included in various browsers. In June 1997, the specification for the language under the name "ECMAScript" was officially published.
As languages grew, browsers provided users with increasingly diverse features, and developers began developing to connect the browser and the platform. The ultimate goal of that direction was to make the browser feel more like a native application by combining various information from the user's environment. The first edition of ECMAScript accessed the device timezone of the user's operating system and reflected that information in the "Date" object.
The Evolution of Modern APIs
Modern browsers have transformed from simple HTML display tools to multimedia platforms compatible with a variety of formats and devices. Many APIs closely related to fingerprinting techniques have been developed as web standards through W3C to provide users with a richer experience and support mobile browsing environments.
The Canvas API provides objects, functions, and properties to draw or manipulate graphics on the canvas surface. WebGL is a graphics API that allows 3D objects to be manipulated in the browser via JavaScript without any other plugins. Additionally, the Web Audio API provides functions for audio processing.
In addition, there are APIs such as WebRTC for real-time communication, Geolocation for real-time positioning, WebAssembly to improve browser performance, and WebPayments and WebXR (under review) to improve web functions.
Browser Fingerprinting Techniques
Most browser fingerprinting techniques are based on client-side scripting languages introduced in the late 1990s, but full-scale research began with the research of Mayer in 2009. After this. And it can be said that the 2010 Panopticlick experiment was the first large-scale experiment to earn the academic term “browser fingerprinting.”
Through a simple script running inside the browser, the server can collect various information through API and HTTP Headers. An API is an interface that provides access to a specific object or function. While there are APIs that require permission, such as when accessing microphones and cameras, most are freely accessible through Javascript script [4].
Unlike cookies, which rely on IDs, browser fingerprinting is stateless.
Entropy for each property is as follows:
Canvas(and WebGL) Fingerprinting
Canvas fingerprinting is a browser fingerprinting technique that was first introduced in a paper by Mowery and Shacham in 2012 [3]. The author discovered that a certain fingerprint can be obtained by rendering the same text or WebGL Scene through the browser's Canvas API, and that it can be executed within a moment without the user being aware of it.
The same text may be rendered differently on different computers due to operating systems, font libraries, graphics cards, graphics drivers, and browsers. Differences in rendering may be due to font rasterization differences such as anti-aliasing, hinting or sub-pixel smoothing, differences in system fonts, API implementation differences, and even physical displays. Those trying to identify can write down as many different letters as possible to vary the results and enable better identification.
The entropy of canvas fingerprinting has not been extensively investigated, but according to [3], it is 5.73 bits of entropy, and considering that the experiment was limited, it is 10 bits of entropy. Even assuming entropy, we found that 1 in 1000 people share the same fingerprint.
The image below shows the basic flow of canvas fingerprinting. When a user visits a page, the fingerprinting script draws text in the set font and size and adds a background color (1). The script then runs the Canvas API's ToDataURL function to obtain the canvas pixel data (Base64 encoded value of binary pixel data) in dataURL. Finally, the script takes a hash of the encoded text pixel data and sends it to be stored for later use.
These hash values can be combined with high-entropy browser features such as plugin lists, font lists, or user agent values to be used as useful identifiers.
Most of the scripts used have a similar form, and such a fingerprinting library has been open sourced (fingerprintjs) [2].
WebGL is similar to Canvas and allows fingerprinting through a 3D surface.
AudioContext
The Web Audio API provides an interface to create audio processing pipelines. By connecting an audio module, anyone can generate an audio signal and apply characteristic operations such as compression or filtering to obtain a specific output. Scripts that process audio signals can be used for fingerprinting because, similar to canvas fingerprinting, they show characteristic differences for each device due to the software or hardware of the environment. This is a technique used by relatively few websites.
Browser Extensions
Modern browsers have offered user customization since the beginning, and one way is through browser extensions. Examples include adding small, homemade add-ons or widely used ad blockers. It is difficult to extract a list of all browser extensions due to lack of API support, but some are easily accessible depending on how the add-ons are integrated into the browser.
According to Sjösten, it is possible to determine whether an extension is installed or not by accessing a specific URL. For example, in order to display the extension's logo, the browser needs to know where the logo is stored on the device and fetches it in the form of "extension://extensionID/pathToFile". However, because these resources can be accessed from any web page context, this mechanism can be exploited by scripts to determine whether a particular extension is installed.
The second study related to extensions is that a specific installed extension changes the DOM in a specific way for execution, and the script can know which extension has been installed through the DOM change method.
The third study uses a timing side channel attack, which measures the time difference between requests by executing queries against fake and existing extensions. It is said that using this method, any browser extension can be detected.
The final study is a study (extension's bloat) where the existence of an extension is discovered due to faulty application logic. These use empty placeholder injection, script or style tag injection, or sending messages to the page. Of the 58,034 extensions in the Google Chrome Store, 5.7% had this type of bloat, and 61% of them were uniquely identified.
JavaScript Standards Conformance
Muzanni showed a way to reliably identify using a browser-based JavaScript engine. We analyzed and tested whether it can be compiled to the JavaScript standard and what features are supported. Through such tests, we found the minimum number of cases that could uniquely identify each combination, and through those cases, the browser's JavaScript engine was able to distinguish even if there was only a one-level version difference.
Additionally, similar research was conducted through analysis of the mutability of navigator and screen objects and information at the OS or architecture level.
CSS Querying, Font metrics
Unger tested unique CSS properties in specific browsers and was able to easily determine browser lineage through CSS property prefixes found in specific browsers.
Additionally, the shape of the character can be used for fingerprinting, taking into account the fact that it is rendered in different bouding boxes depending on the browser or device.
Benchmarking
Another way to access information about a device is to benchmark its CPU and GPU capabilities. Through JavaScript, a script executes a series of tasks and measures the time they complete. However, the most difficult part of using benchmarking is accurately interpreting differences and volatility. This is because, depending on instantaneous resource usage, there may be significant differences in benchmarking results on the same device.
Battery Status
The "Battery Status" Specification is an "API that provides information about the battery status of the hosting device." The API consists of the BatteryManager interface, which reports whether the device is charging or not. It also includes information such as charge amount and charging time. The intention behind providing these APIs was to help web developers create power-efficient apps.
However, despite its intentions, the Battery API can be highly abused. Charge amount can be used as a short-term identifier, and through repeated approaches, battery capacity can also be identified. To deal with these issues, many browser vendors are removing this API or hiding the information.
Fingerprinting defense techniques
The purpose of fingerprinting defense is to increase user privacy while preventing unwanted tracking. However, rather than saying that there is a perfect approach, most approaches are more about striking a balance between limiting the functionality of modern web browsers and increasing privacy.
Increasing device diversity
Fingerprint content change
The first way to prevent browser fingerprinting is to increase device diversity and hide the actual fingerprint results through noise. The basis of this method is that a third party relies on 'fingerprint stability' to map a fingerprint to a single device. The collected fingerprints are different and inconsistent as they are sent as random or pre-set values rather than actual values, making identification on the web impossible.
The inconsistency problem
This method is relatively powerful in a research setting, but the results are slightly different in a real-world setting (Paradox of Fingerprintable Privacy Enhancing Technologies). Instead of enhancing user privacy, certain tools make fingerprinting easier by rendering fingerprints more sensibly. These are spoofers and switchers that change the values collected in the actual script, which are extensions that exist in Chrome and Firefox. The most famous one is Firefox's Random Agent Spoofer, which provides "complete browser profiles (from real browsers / devices) at a user defined time interval" feature.
The idea of changing one value to another seems appropriate at first glance, but this is not recommended because browsers are constantly evolving with various properties strongly linked (the OS information is Linux, but the navigator.platform property points to Windows, etc. Accidental mismatches between attributes may occur).
Replacing the values of attributes
Defines and displays various profiles for the attributes used by the tracker, scores fingerprinting intent, executes fingerprinting defense code when a certain score is exceeded, and applies policies when triggered with a policy for changing specific attribute values. A method of changing it, a method of storing corresponding property values for each software component and providing randomly combined component (OS, browser, plugins, etc.) values when used (more stable as it does not have the problems mentioned above), etc. It exists.
Noise Injection
Most of them are string type values, but in cases such as Canvas or AudioContext API, they have a more complex data structure. Instead of simply changing certain values to preset values, you can also inject noise into these API processes. When noise is injected, the Canvas or AudioContext test values change slightly from run to run.
There are ways to have a script read a different value when trying to read a property value, or to change the related API by changing the Chromium or Firefox source code.
Challenges of how to change fingerprint content
Diversity can be increased by changing the fingerprint value, but in most cases, it is penetrated by a specific tracker. Attribute values are immutable enough to cause browsers to malfunction, so mismatches between attributes may make users identifiable. Changing attribute values for these fingerprinting defenses does not necessarily make the user more identifiable, but modern web browsers are constantly evolving and complex, so in reality, there are always small parts that can defeat such defense techniques.
Change browser
Since most device fingerprinting consists of specific browser information, information from two devices can be exposed by using two different browsers. This method allows third parties to receive two browser profiles, making tracking difficult.
The method is relatively simple, but looking at the results of various studies, it is difficult to determine whether the results are actually effective, and there are third parties that can be identified through information from the OS or hardware layer.
Submit a homogeneous fingerprint
Another defense technique is to ensure that all devices on the web show the same fingerprint. It uses the Tor network in a manner chosen by the Tor Browser, known as the Tor Browser Bundle (TBB).
Tor Browser
theory
Although the Tor network prevents attackers from discovering the client's real IP, it does not change the content of the http request, so it can be identified if a cookie ID or browser fingerprint is present in the payload. To prevent this, the Tor browser meets certain requirements, one of which includes Cross-Origin Fingerprinting Unlinkability targeting browser fingerprinting. Although randomization is more effective in preventing fingerprinting, we chose a strategy where all Tor users use one fingerprint.
The design document introduces 24 different modifications to the Tor browser, the main ones being blocking the Canvas and WebGL APIs, removing plugins, and including a default font bundle to prevent font enumeration. Regardless of whether you use Windows, Mac, or Linux, the Tor browser will show that your device is Windows.
real
Although the Tor browser is considered one of the most powerful ways to defend against browser fingerprinting, it has several drawbacks. The fingerprint used by the Tor browser is very well known. User-agent, screen resolution, and IP address of known Tor exits are sufficient information to identify Tor Browser from other browsers. Although it has nothing to do with personal identification, it has a significant impact on your browsing experience. In fact, one study found that 3.67% of Alexa's top 1,000 sites either block Tor users altogether or offer limited services.
The second problem with the Tor browser is that there can be differences between browsers in areas such as screen resolution. In the very long-running Tor browser, when a user tried to change the window size, a warning message was issued warning that the user could be identified. Likewise, if a user has an unusual screen resolution, they may become identifiable.
The third problem is one that can be identified through OS-level information, and this part is also indicated in the document as "We tried to eliminate it as much as possible, but OS-level defense is not a priority."
Lastly, by default, the Tor browser is safe because it has the same fingerprint as various users, but if some customization results in even the slightest identifiable attribute, it becomes noticeably identifiable.
UniGL
Research has shown that the reason the WebGL API shows differences across devices when creating complex 3D scenes is due to differences in floating-point operations in the graphics layer of the system. To ensure consistent 3D rendering, these researchers created software called UniGL, which has the ability to explicitly or implicitly override floating operations written in GLSL programs. In this way, the same WebGL fingerprint is displayed for a specific rendering task.
Decreasing the surface of browser APIs
The final method is to reduce the amount of information a tracking script can collect by lowering the browser API footprint. One way is to disable plugins so that no additional fingerprint vectors, such as Flash or Silverlight, are present.
Other methods include blocking Javascript execution itself, using ad blockers (Adblock Plus, Ghostery, uBlock Origin, Disconnect), and disabling browser functions to prevent the use of specific APIs. However, this method may also limit the functionality of your browser.
Reference
[1] https://en.wikipedia.org/wiki/Device_fingerprint
[2] www.ftc.gov/system/files/documents/public_comments/2015/10/00064-98109.pdf
[3] Pixel Perfect: Fingerprinting Canvas in HTML5
[4] Browser Fingerprinting: A survey
[5] FP-Block : usable web privacy by controlling browser fingerprinting
- Published on
OKR - Measure for output
- Published on