Privacy Respecting Analytics

Surely, everyone has already heard the term “Web(site) Analytics” or a similar adaptation. And of course, it is obvious that in today’s world, we stumble upon analytics techniques at every move we make online. Be it on social media, news websites, your favourite search engine, your e-mail provider or in some instances even your employer. [1]

But let’s take a step back and ask not only what (web) analytics is, but also what its implications are, what a single party can really use them for to find out about every action we take and every desire we feel. And of course we should ask ourselves: Is this really necessary - Aren’t there better and privacy-friendly ways to achieve our goals?

What is Web-Analytics?

The Web Analytics Association (WAA) defines Web Analytics briefly as:

“the measurement, collection, analysis and reporting of web data to understand and optimize web usage.” [2]

While this definition gives a broader understanding of what web analytics entails, it does not clearly define which methods can and should be used, which tools are required and feasible, what the target unit of such analysis may be, which steps are to be taken in order to preserve their privacy (if that is at all desired…) and what “internet data” is in the first place. Furthermore, the WAAs definition is clearly missing a statement about the purpose of such web analytics. [3, 4]

However, such considerations are crucial, as they have vast impact on the user experience, as well as their Self-Determination and of course on the effectiveness of analytics by some website operator.

Tracking Traffic versus Tracking People

Internet Marketing experts traditionally try to gain knowledge about traffic on and interaction with some website or web-application by analysing visitors' behaviour while on the site. This approach, although being the standard for several decades now has multiple issues, often times disregarded by corporate marketing teams without deeper considerations while also pushing the effect of the “digital incapacitation” as Mülhoff puts it. [5, 6]

Although such behavioural analysis appears to be reasonably effective in terms of most marketing measures (mostly established by advertisement companies themselves), it is important to acknowledge, that human behaviour and thinking is relatively complex, hard to distil into a theoretical or mathematical model and influenced by a number of factors such as culture, age, sex, socio-economic background and many more. One way or another, regardless of how much data is gathered, human behaviour can only be explained or predicted to a certain extent. [3, 5, 7]

While it can make sense to have some information about over all traffic as well as one’s true (as opposed to targeted) audience, such information usually is greatly over-valued, particularly in the fine granularity that many web-analytics tools offer. Information used for strategic decisions can more often than not be referred from data aggregates. Ironically, collected micro-level information is often condensed and processed into summaries which are easier to understand by decision makers in organisations or companies. [8]

The information a website operator is often times actually interested in, are traffic measures, as well as some additional information about broad, non-personal demographic of visitors. However, it is usually completely irrelevant to know characteristics of each and every visitor individually. Instead, it is more important how many (or what share of) visitors come from country X, how many use the web-client (browser) Y and how many look at a specific sub-page of your website, etc. I.e. characteristic-tables which are independent from one another. [9]

To make it more comprehensible what is meant by “tracking traffic instead of tracking people”, consider running a website where seven unique visitors browse one or more pages at a given point in time. You could either track what each unique visitor is doing on your site (tracking people) or you can set up characteristics counters for the specific information you are interested in (tracking traffic).

In either way, they key-information stays the same. However, while “tracking people” maps out the detailed profile and movement of each unique visitor on a site, thereby collecting substantial amounts of behavioural information, “tracking traffic” can be used to infer only distributions of characteristics.

From such aggregates generated by “tracking traffic”, a skilled analyst (who, in any case, is required to translate raw data to marketing strategy proposals) is still able to infer important and detailed information, without infringing on people’s privacy or even worse disclose their identity.

Ultimately, the topic of micro-level information versus aggregated information is closely related to the issue of Data Minimalism. Rather than collecting as much information as possible and trying to see what can be done with the data afterwards, an effective and privacy-respecting data collection procedure should be planned “from the back to the front”. That means that one should think about which information is needed and how it can be obtained, while evaluating a potential loss of users' privacy versus actual gains for one’s marketing and product strategy.

First-Party versus Third-Party tracking

Especially regarding Open Source solutions for web tracking, a website operator at some point has to think about whether the service is to be run by themselves or by a third party (a so-called provider). While the software stack in use typically remains identical, such decision can still have a direct impact on users' privacy as well as effectiveness of the data collection (due to Ad-blockers and similar anti-tracking software) and maintenance or oversight about data storage or information transfer.

For instance, using a third party data processor can be problematic under the European Union’s General Data Protection Regulation (GDPR) regarding accountability, users' consent as well as compliance to automatic and explicit request to delete such personal data of the provider. [10, 11]

Instead, when self-hosting the tracking solution (first-party), the website operator has full control on what can be collected, how and where the data is stored and how long it is kept. For instance, a website operator acting on the principle of Data Minimalism would be able to directly control how much (or how little) data they do collect from visitors. Similarly, they can decide to honour a visitor’s wish to not be tracked at all, if they ask for it. Considering the EU’s GDPR or similar privacy regulations around the world, a website operator might decide to not collect any Personal Identifiable Information (PII) whatsoever, and not set any tracking-cookies in order to further strengthen visitors' right to privacy (and avoid the need for annoying consent banners). [12, 13, 14, 15]

Of course, for most proprietary solutions, a website operator does not have the option to self-host the analytics software on their own infrastructure. Furthermore, they are bound to trust their analytics-provider not only with “their” metrics (which, particularly for digital businesses can be used to infer on their market value), but also with their visitors' data, i.e. deciding on the privacy of others. Using third-party content or services does always imply that such third party could have access to any information of your visitors, while the website operator can hardly limit what information can and what information should not be shared. [16]

Centralised Analytics in the scheme of Surveillance Capitalism

A point worth noting is the interaction between third-party web analytics and Digital Monopolies. By leveraging their market position and the (often hyped up) demand and ever growing “hunger” for (user-) data driven applications and services by so-called digital businesses, globally acting advertising companies succeeded in arranging a state where close to the entire web is using and relying on their Analytics Services. [17]

Not only does that mean, that these handful of advertisement companies make it hard for smaller competitors to succeed or push for innovation (particularly regarding users' privacy), these actors also have access to vast amounts of user data from all over the world and from many spectra of people, learning about individuals' behaviour, desires and living situations and helping them to improve the revenue of their actual business: micro-targeted advertisement. [17, 18]

The effects of such extent of the so-called “Surveillance Capitalism” could be easily seen during several direct or indirect attempts (and successes) to manipulate democratic procedures and political opinion formation, for example before, during and even after the United Kingdom’s BREXIT referendum or the 2016 presidential election in the United States of America in conjunction with micro-targeted political campaigning, disinformation campaign and the involvement of - among others - the political consulting company Cambridge Analytica.

It is easy to realise, that the amount and granularity of user data and knowledge about people’s behaviour and wishes leads to immense political and economic power as well as influence on public opinion. Decentralising the information on visitor and user-data is, next to strict and precise rules on (online-) privacy, an important first step out of the growing threat of market domination and Digital Monopolies.

Establishing guidelines for “Ethical Analytics”

The question now stands “Can we do better?”. And of course the answer is “Yes!”. There are few precise guidelines or attempts to establish “ethical” or “privacy-respecting” web analytics out there, however there are some hints and projects that implicitly follow such guidelines. So here is one such attempt to build such guidelines and also an invitation for others to chime in and contribute. [19, 20, 21]

TL; DR:

1. Is analytics required?

Firstly, as mentioned before, a website operator should consider whether they need any analytics in the first place. Additionally, define which problem you want to solve and what information is actually needed for that. Is user-information required or is it unnecessary?

In order to give a real-world example from our own considerations, scaling a cloud-platform to a large number of users requires a lot of information about the utilised capacity of the server- and network-infrastructure. However, to gather this information, user-tracking within our services is really unnecessary (and could be even misleading). Instead, we can directly measure server & service workloads (without knowing or caring who is producing them) and scale the infrastructure accordingly without ever tracking what anyone is doing.

2. Track Traffic, not People

If a website operator comes to the conclusion that they do require analytics, opt for tracking traffic instead of tracking your users individually. Most, if not all relevant information can be gathered this way and according decision made on resulting aggregates without infringing on your visitors' privacy. A lot of information can also be gathered from Server-Log Analysis which also eradicates the use of slowdowns due to analytics-code run in visitors' web-browser.

Also think about clever solutions for A/B-Testing service features or sub-page transitions with counters or “events”, instead of tracking every single movement of individual visitors.

Furthermore - and this is a key aspect of any “ethical” tracking procedure: Tracking-cookies should not be part of a website operator’s analysis tooling. This goes hand in hand with the fourth point of these guidelines, because particularly regarding third-party tracking, cross-site tracing is a valid concern. There are privacy-respecting ways of counting unique visitors to your website or service, that do not involve cookies or omni-present surveillance.

3. Prefer FOSS over proprietary software

There is an abundance of Free/Libre and Open Source Analytics solutions out there, so there is hardly a reason to opt for proprietary web analytics. There are many reason to choose Open Source software in general, some of which we have discussed in the past already:

“Digital Self-Determination”
“Platform Economy & Digital Monopolies”
“Data and information security”
“Free Software - What is it, why is it important?")

Furthermore, using Open Source solutions can build trust as it implies (and delivers) a certain degree of transparency. Being open about what and why you are collection information is not only often times required by law (e.g. the EU’s GDPR), it can also reflect on a website operator’s care for visitors' right to privacy and anonymity and how they evaluate users' freedoms in their business strategy.

4. Evaluate Third-Party offerings carefully

Particularly when using Open Source solutions (which you absolutely should), consider hosting the web analytics solution yourself. This prevents third parties having access to your visitors' information and also to your website’s statistics.

Hosting such solutions by yourself may not always be feasible however. Maintaining such solution requires time and knowledge. Running an outdated or otherwise insecure service may open up your infrastructure to intruders. This is true for any network-attached service, such as websites or other services. In such cases, using a third-party can be suitable, particularly if you are not collecting any private identifiable information, but only traffic aggregates. In such scenarios, choose a trustworthy provider.

What else?

There surely are more ideas out there by people that have been thinking about ethical and privacy respecting traffic measurements and engagement analysis for years or even decades. So here is a call to action: Please chime in and help us define a solid definition and guidelines for webservice operators to follow.

Sources

Mojeek Team (2021): Time to Ban Surveillance-Based Advertising. Online at mojeek.com (Visited on 2021-07-24)
Web Analytics Association (2008): The Official WAA Definition of Web Analytics. Online at webanalyticsassociation.org (WebArchive)
Jansen, B. J. (2009): Understanding User-Web Interactions via Web Analytics. Synthesis Lectures on Information Concepts, Retrieval, and Services, 1(1), 1–102. DOI: 10.2200/s00191ed1v01y200904icr006
Zheng, J. & Peltsverger, S. (2015): Web Analytics Overview. In Encyclopedia of Information Science and Technology, Chapter 756, URL: researchgate.net
Mühlhoff, R. (2018): Digitale Entmündigung und User Experience Design. In: Leviathan – Berliner Zeitschrift für Sozialwissenschaft. 46(4), 551-574, DOI: 10.5771/0340-0425-2018-4-551
Miyazaki, A. D. (2008): Online Privacy and the Disclosure of Cookie Use - Effects on Consumer Trust and Anticipated Patronage. Journal of Public Policy & Marketing, 27(1), 19-33. DOI: 10.1509/jppm.27.1.19
Javris, P. (2020): Does targeted digital advertiseing work? Online at usefathom.com (Visited on 2021-07-23)
Saric, M. (2020): How we use web analytics to measure our startup’s progress and make better decisions. Online at plausible.io (Visited on 2021-07-23)
That, U. (2018): The analytics tool I want. Online at plausible.io (Visited on 2021-07-23)
European Parliament (2016): REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL. Online at europa.eu (PDF) (Visited on 2021-07-23)
Ford, N. (2020): GDPR - Third party data processors' responsibilities. Online at itgovernance.eu (Visited on 2021-07-23)
Electronic Frontier Foundation: Do Not Track. Online at eff.org (Visted on 2021-07-23)
Matomo Team (2017): The new GDPR data protection regulation and potential consequences on Matomo. Online at matomo.org (Visited on 2021-07-23)
Kohr, J. (2020): How to keep personally identifiable information safe. Online at matomo.org (Visited on 2021-07-23)
Kohr, J. (2020): What is data anonymization in web analytics? Online at matomo.org (Visited on 2021-07-23)
“Innocraft” (2017): 12 ways Matomo Analytics helps you to protect your visitor’s privacy. Online at matomo.org (Visited on 2021-07-23)
Saric, M. (2020): Why you should stop using Google Analytics on your website. Online at plausible.io (Visited on 2021-07-23)
Jarvis, P. (2020): Why digital privacy matters even more in 2021. Online at usefathom.com (Visited on 2021-07-23)
Rezgur, A., Bouguettaya, A., & Eltoweissy, M. Y. (2003): Privacy on the web: Facts, Challenges, and Solutions. In IEEE Security & Privacy Magazine, 1(6), 40-49. DOI: 10.1109/msecp.2003.1253567
Paolini, M. (2010): Twitter Chatter - Web Analytics Code of Ethics. Online at mpaolini.com (Archived version at Webarchive). (Visited on 2021-07-24)
Request Metrics Team (2021): Privacy and Ethical Web Analytics. Online at requestmetrics.com (Visited on 2021-07-20)