Cookie monsters

Published on by Arnau Siches

Table of contents

Where I analyse how cookies are littering the world. For this exercise I'll be using Nushell to wrangle with Firefox Profile data enriched with data coming from the Multi-Account Containers add-on.

Let me start by defining two terms I will be using during the analysis:

  • cookie littering : The act done by a website of storing unsolicited information like browser cookies or any other form of Web storage. This typically happens when doing some casual browsing, landing on a webpage and making the effort to reject all โ€œnon-essential cookiesโ€. This includes the most pernicious webites that directly don't ask for consent, store unsolicited information first and then they let you know that consent is assumed if you use their website.
  • casual browsing : The act of landing on a website after following a link or putting a URL directly in the browser navigation bar. This excludes recurrent use of a website where you have an account with them.

# Exploring how Firefox stores cookies

Let's start with a glance to what kind of files are stored in the profile. I'm excluding some temporary and backup files to reduce noise.

ls **/*
| where type == 'file'
| get name
| group-by { path parse | get extension }
| transpose extension value
| where extension !~ '-(wal|shm)$' and extension !~ '(jsonlz4-.+|baklz4)$'
| insert count { |row| $row.value | length }
| sort-by -r count
| reject value
โ”‚  # โ”‚ extension โ”‚ count โ”‚
โ”‚  0 โ”‚ final     โ”‚  1693 โ”‚
โ”‚  1 โ”‚ sqlite    โ”‚   216 โ”‚
โ”‚  2 โ”‚ jsonlz4   โ”‚   187 โ”‚
โ”‚  3 โ”‚           โ”‚    91 โ”‚
โ”‚  4 โ”‚ json      โ”‚    46 โ”‚
โ”‚  5 โ”‚ txt       โ”‚    44 โ”‚
โ”‚  6 โ”‚ xpi       โ”‚    10 โ”‚
โ”‚  7 โ”‚ bin       โ”‚     3 โ”‚
โ”‚  8 โ”‚ db        โ”‚     3 โ”‚
โ”‚  9 โ”‚ dylib     โ”‚     2 โ”‚
โ”‚ 10 โ”‚ mozlz4    โ”‚     2 โ”‚
โ”‚ 11 โ”‚ marker    โ”‚     1 โ”‚
โ”‚ 12 โ”‚ js        โ”‚     1 โ”‚
โ”‚ 13 โ”‚ sig       โ”‚     1 โ”‚
โ”‚ 14 โ”‚ info      โ”‚     1 โ”‚
โ”‚ 15 โ”‚ ini       โ”‚     1 โ”‚
โ”‚ 16 โ”‚ lz4       โ”‚     1 โ”‚

There is a lot in a profile however this analysis only needs a tiny fraction of the files, namely:

  • cookies.sqlite โ€” Where the cookies are stored
  • storage/**/ls/data.sqlite โ€” Where each local storage data is stored

My current session has 77 databases matching storage/**/ls/data.sqlite.

# Multi-Account Containers

Before exploring how data is layed out, there is a detour that will help filter out non-casual browsing: the Multi-Account Containers add-on.

The Multi-Account Containers add-on lets you segregate cookies by container. Each container can be configured with an allowed list of domains so trusted and untrusted websites can be kept separate including casual browsing.

This add-on stores some of its configuration in containers.json which looks something like:

  "identities": [
      "userContextId": 6,
      "name": "github"
      "userContextId": 10,
      "name": "google"

Any unknown website will go into the default container which has no userContextId. This fact is what will help identify and filter out non-casual browsing.

The cookies.sqlite database contains one table defined as:

CREATE TABLE moz_cookies (
  originAttributes TEXT NOT NULL DEFAULT '',
  name TEXT,
  value TEXT,
  host TEXT,
  path TEXT,
  expiry INTEGER,
  lastAccessed INTEGER,
  creationTime INTEGER,
  isSecure INTEGER,
  isHttpOnly INTEGER,
  inBrowserElement INTEGER DEFAULT 0,
  rawSameSite INTEGER DEFAULT 0,
  schemeMap INTEGER DEFAULT 0,
  CONSTRAINT moz_uniqueid UNIQUE (name, host, path, originAttributes)

This table keeps the data you would expect from a cookie but it also tracks which container a cookie belongs to in the originAttributes field.

open temp/cookies.sqlite
| query db "select host, originAttributes, name from moz_cookies"
| where host =~ "youtube"
| first
โ”‚ host             โ”‚      โ”‚
โ”‚ originAttributes โ”‚ ^userContextId=10 โ”‚
โ”‚ name             โ”‚ CONSENT           โ”‚

This record shows how the host belongs to the container 10. A quick lookup to containers.json shows that 10 is the identifier for the container I use for Google.

Websites not bound to a container have no userContextId. The following query shows the main filter I'll be using to filter out non-casual browsing cookies.

open temp/cookies.sqlite
| query db "select host, originAttributes, name from moz_cookies"
| where originAttributes !~ "userContainerId"
| first
โ”‚ host             โ”‚                                          โ”‚
โ”‚ originAttributes โ”‚                                                     โ”‚
โ”‚ name             โ”‚ dd_cookie_test_4d98f155-3011-4887-8587-ee4c70b8765e โ”‚

The last bit that is relevant to know from the cookie database is that the originAttributes field can have a partitionKey either alone or combined with a userContextId which represents the domain of a 3rd party cookie.

For example, a 3rd party cookie for Youtube (container 10) would look like:

open temp/cookies.sqlite
| query db "select * from moz_cookies"
| select host originAttributes name
| where originAttributes =~ "userContainerId=10" and originAttributes =~ "partitionKey"
| first
โ”‚ host             โ”‚                                              โ”‚
โ”‚ originAttributes โ”‚ ^userContextId=10& โ”‚
โ”‚ name             โ”‚ CONSENT                                                  โ”‚

# The Local Storage database

Not strictly cookies but definitely part of cookie littering. These are slightly more involved because storage is segregated per domain. The filesystem structure looks like:

โ”‚  โ””โ”€โ”€ ls
โ”‚     โ””โ”€โ”€ data.sqlite
   โ””โ”€โ”€ ls
      โ””โ”€โ”€ data.sqlite

Within each data.sqlite database there is a data table:

  utf16_length INTEGER NOT NULL,
  conversion_type INTEGER NOT NULL,
  compression_type INTEGER NOT NULL,
  last_access_time INTEGER NOT NULL DEFAULT 0,

key, value and last_access_time are self-explanatory but the other three require a bit of digging. Based on my partial understanding after checking geko-dev these fields mean the following:

  • utf16_length is the combined byte length of key and value. We don't need it.
  • conversion_type is either 0 or 1 and 1 means the data is encoded in UTF-8. This is my educated guess given that all my data is of type 1 and I've been able to decode it as UTF-8.
  • compression_type signals whether the value is compressed.
    • 0 means uncompressed.
    • 1 means compressed with snappy.
    • 2, is CompressionType::NUM_TYPES. I don't know what this type is for nor I have any data with it. I'll ignore it Seems to represent the number of types in the CompressionType enum so it should be safe to ignore it. Thanks Karl! .

To have a peek at the data stored in the value blob I'm using szip, a CLI similar to gzip using the Snappy algorithm.

glob storage/default/**/data.sqlite
| wrap path
| insert data { |row|
    open $row.path
    | query db "select key, value, compression_type from data"
| flatten --all
| where compression_type != 2
| update value { |row|
    if ($row.value | is-empty)) {
    } else {
      match $row.compression_type {
        0 => { $row.value | decode utf-8 },
        1 => { $row.value | ^szip -d --raw },
| explore -i

# Analysis

Armed with this we can start the analysis of casual browsing littering. This analysis is minimal given that I used data stored after a couple of hours of casual browsing. I plan to collect snapshots of this data regularly so I can do a more meaningful analysis in a few months time.

To ensure I don't mess up my Firefox profile, I copied the relevant SQLite files into a temp/ directory, flattening the structure of local storage a bit.

cp cookies.sqlite $temp

glob storage/default/**/ls/data.sqlite 
| where $it !~ "moz-extension"
| each { |path|
    let name = $path | path dirname -n 2 | path basename
    cp $path $"($temp)/storage/($name).sqlite"

First, let's get a feel for the size of the data. I want to focus on casual browsing so I'll exclude any entry with a userContextId.

open temp/cookies.sqlite
| query db "select count(1) as cookies from moz_cookies where originAttributes not like '%userContextId%'"
| insert local_storage (
    ls temp/storage/
    | where name !~ "userContextId"
    | reduce -f 0 { |row, acc|
         open $
         | query db "select count(1) as count from data"
         | get 0.count | $in + $acc
โ”‚ # โ”‚ cookies โ”‚ local_storage โ”‚
โ”‚ 0 โ”‚      55 โ”‚            93 โ”‚

With the queries below, we can see that about half websites store just one cookie. Assuming all of them required one cookie to capture consent, this means that the majority of websites require cookies to operate properly. For some value of "operate properly".

open temp/cookies.sqlite
| query db "select host, name, originAttributes from moz_cookies"
| where originAttributes !~ "userContextId"
| group-by { $ | str replace -r '^(?:\.|www\.)' '' }
| transpose host value
| update value { |row| $row.value | length }
| sort-by value
| histogram value
โ”‚ # โ”‚ value โ”‚ count โ”‚ quantile โ”‚ percentage โ”‚                    frequency                    โ”‚
โ”‚ 0 โ”‚     1 โ”‚    18 โ”‚     0.62 โ”‚ 62.07%     โ”‚ *********************************************** โ”‚
โ”‚   โ”‚       โ”‚       โ”‚          โ”‚            โ”‚ ***************                                 โ”‚
โ”‚ 1 โ”‚     2 โ”‚     5 โ”‚     0.17 โ”‚ 17.24%     โ”‚ *****************                               โ”‚
โ”‚ 2 โ”‚     3 โ”‚     3 โ”‚     0.10 โ”‚ 10.34%     โ”‚ **********                                      โ”‚
โ”‚ 3 โ”‚     5 โ”‚     1 โ”‚     0.03 โ”‚ 3.45%      โ”‚ ***                                             โ”‚
โ”‚ 4 โ”‚     6 โ”‚     1 โ”‚     0.03 โ”‚ 3.45%      โ”‚ ***                                             โ”‚
โ”‚ 5 โ”‚     7 โ”‚     1 โ”‚     0.03 โ”‚ 3.45%      โ”‚ ***                                             โ”‚

value in the table above means "number of cookies per host". count is the number of hosts with that number of cookies. Note that I haven't done any reconciliation to merge hosts that belong to the same domain. For example, and are counted separately. Properly aggregating would probably yield a slightly more concerning picture.

The histogram clearly shows cookie littering. Most websites I visit for casual browsing are for reading an article they have published or to check their offering so it's hard to justify any so-called โ€œrequired cookiesโ€ beyond a single consent one.

ls temp/storage/
| where name !~ "userContextId"
| each { |row| open $ | query db "select count(key) as count from data" |  get 0.count }
| sort
| histogram
โ”‚ # โ”‚ value โ”‚ count โ”‚ quantile โ”‚ percentage โ”‚                   frequency                   โ”‚
โ”‚ 0 โ”‚     1 โ”‚    16 โ”‚     0.46 โ”‚ 45.71%     โ”‚ ********************************************* โ”‚
โ”‚ 1 โ”‚     2 โ”‚     8 โ”‚     0.23 โ”‚ 22.86%     โ”‚ **********************                        โ”‚
โ”‚ 2 โ”‚     4 โ”‚     3 โ”‚     0.09 โ”‚ 8.57%      โ”‚ ********                                      โ”‚
โ”‚ 3 โ”‚     3 โ”‚     2 โ”‚     0.06 โ”‚ 5.71%      โ”‚ *****                                         โ”‚
โ”‚ 4 โ”‚     6 โ”‚     2 โ”‚     0.06 โ”‚ 5.71%      โ”‚ *****                                         โ”‚
โ”‚ 5 โ”‚     5 โ”‚     1 โ”‚     0.03 โ”‚ 2.86%      โ”‚ **                                            โ”‚
โ”‚ 6 โ”‚     7 โ”‚     1 โ”‚     0.03 โ”‚ 2.86%      โ”‚ **                                            โ”‚
โ”‚ 7 โ”‚     8 โ”‚     1 โ”‚     0.03 โ”‚ 2.86%      โ”‚ **                                            โ”‚
โ”‚ 8 โ”‚    11 โ”‚     1 โ”‚     0.03 โ”‚ 2.86%      โ”‚ **                                            โ”‚

value in the table above means โ€œnumber of local storage entries per domainโ€. count is the number of domains with the same number of local storage entries.

This whole table is hard to justify. Local storage for casual browsing is unnecessary and a waste for most circumstances.

open temp/cookies.sqlite
| query db "select host, name, originAttributes from moz_cookies"
| where originAttributes !~ "userContextId"
| group-by { $ | str replace -r '^(?:\.|www\.)' '' }
| transpose host value
| insert count { |row| $row.value | length }
| reject value
| sort-by -r count
| first 5
โ”‚ # โ”‚         host         โ”‚ count โ”‚
โ”‚ 0 โ”‚           โ”‚     7 โ”‚
โ”‚ 1 โ”‚           โ”‚     7 โ”‚
โ”‚ 2 โ”‚  โ”‚     3 โ”‚
โ”‚ 3 โ”‚         โ”‚     4 โ”‚
โ”‚ 4 โ”‚     โ”‚     3 โ”‚

# Top local storage litterers

With the following we can narrow down the websites storing more than 10KB for no apparent good reason.

ls temp/storage/
| reject type modified
| where name !~ "userContextId"
| insert count { |row|
    open $
    | query db "select count(key) as count from data"
    |  get 0.count
| update name { |row| $ | path basename }
| sort-by count
| where size >= 10KB
โ”‚ # โ”‚                 name                 โ”‚   size   โ”‚ count โ”‚
โ”‚ 0 โ”‚  โ”‚ 405.5 KB โ”‚    11 โ”‚
โ”‚ 1 โ”‚            โ”‚  18.4 KB โ”‚     6 โ”‚
โ”‚ 2 โ”‚      โ”‚ 325.6 KB โ”‚     4 โ”‚

Eventbrite took me by surprise with these 405.5KB. I was expecting Medium to be at the very top but 18.4KB can't compare.

Now, the Wikipedia storing 325.6KB is both surprising and concerning. Particularly because most of it seems to be chunks of JavaScript as if they were using local storage as a cache. A cache with no expiry date that is.

# Closing thoughts

The world of cookies and by extension local storage is a mess. Regulations like the Directive (EU) 2018/1972 of the European Parliament and of the Council of 11 December 2018 establishing the European Electronic Communications Code (Recast)Text with EEA relevance. claim to have improved the situation but it's frankly hard to see. We have lots of littering and on top of that we have to suffer a dreadful variety of cookie walls.

By the way, if you follow the EU directive link you'll observe that after you have rejected all non-essential cookies they still put 11 cookies in your browser. 11 cookies for dropping into a website you might not even want to navigate feels like abuse to me.