Legal:Data publication guidelines

Translate this page

The right to privacy is at the core of how communities contribute to Wikimedia projects and upholding this right is central to our human rights commitments. These data publication guidelines are the best practices at the Wikimedia Foundation for managing risk in data publication. They complement our Data retention guidelines and contribute to our commitment to protect users' data as elaborated in our privacy policy.

Similar guidelines pertaining to data collection are forthcoming, in order to more fully govern the entire lifecycle of data in Wikimedia Foundation systems.

Data publication risk tiering grid

Data classification	Confidential	Restricted
Risk level	Tier 1: High risk	Tier 2: Medium risk	Tier 3: Low risk
Risk level	Data that could certainly be used to cause harm	Data that could likely or possibly be used to cause harm	Data that is unlikely to be used to cause harm or is private for administrative reasons
Examples (non-exhaustive list)	Data containing PII see the data classification policy and privacy policy Granular analyses of country protection list countries fundraising data Recurring data releases of medium risk data	High-level analyses of country protection list countries fundraising data Granular analyses of non-country protection list countries projects editing data interaction data reading data Recurring data releases of low risk data	High-level analyses of non-country protection list countries projects editing data interaction data reading data Any analyses that utilize differential privacy^[1] Collations and combinations of already-public data that it may be inconvenient/difficult for external parties to access
Response time goal	3 work weeks	5 work days	N/A
Expected % of requests (internal metric)	15%	35%	50%
What this means for Wikimedia Foundation teams
Follow-up actions	Do not upload this data to non-Wikimedia Foundation servers Clear outputs before committing code, even to private Gitlab repos Legal and Security will consider publication of high risk data on a case-by-case basis after review and risk mitigation	Unsanitized data can be uploaded to private servers outside of the Wikimedia Foundation (private Gitlab repos, Slack, Drive, etc.) Sanitized data is considered to be low risk, and can be uploaded to public servers outside of the Wikimedia Foundation (Gitlab, presentations, mailing lists, etc.). Data sanitization involves clearing all outputs that display raw data filtering out or obfuscating granular analyses as defined by the threshold table below Legal and Security will consider publication of medium risk data on a case-by-case basis after review and risk mitigation	This data can be uploaded to public servers outside of the Wikimedia Foundation (Gitlab, presentations, mailing lists, etc.)

Note: the country protection list is a reference guide for countries potentially dangerous for internet freedom and not indicative of the Foundation's working relationship with each country

Frequently asked questions

Q: What is the Risk Tiering Grid used for? The Risk Tiering Grid is to help Wikimedia Foundation teams that work with data know when their work requires privacy review by Legal and Security.
Q: What are the key risks the Tiering Grid measures? The key risks are on both the overuse and underuse of the spectrum. If this is used in such a way that too many things are being triaged to Legal and Security, then Legal and Security become the bottleneck for necessary workflow. On the other hand, if projects go live that would have been halted or mitigated under privacy review, that exposes the Foundation to privacy risks — including reputational, legal, and security risks.
Q: Who are the intended audiences of the Tiering Grid? Teams that work with data in product and tech.
Q: What is changed from the existing risk review process? The existing review process required every single schema and data project to undergo Legal review. This both was not being followed, and was not practical to follow for either data teams or Legal.
Q: What is the process for updating the Tiering Grid or resolving Tiering disagreements?
- Get Privacy approval
- Anyone can initiate an update/amendment but approval must be sought across the board before implementing
- Ongoing feedback immediately following launch, regular recalibration thereafter (say, every quarter or half)
Q: What should I do if I am unsure whether to reach out to the Legal and Security teams? When in doubt, it is better to err on the side of caution and submit a L3SC request.

Threshold table

Use this table to determine whether your analysis is granular or high-level, informing which tier/risk level the analysis is considered as. Note: thresholds are determined based solely on the statistics being released — i.e. if you are only releasing information about edits, you do not need to account for how many editors generated the edits.

Data unit type	Classification of analysis based on counts
Data unit type	"Granular"	"High-level"
Users (including unique devices)	<25	≥25
Edits	<50	≥50
App interactions	<100	≥100
Views	<250	≥250

For reverts, report rate and a rough total if the reverted edit count or total edit count are less than the threshold. For example:

If 8 out of 49 edits were reverted:
- "16.3% reverted (out of <50 edits)"
If 49 out of 49 edits were reverted:
- "100% reverted (out of <50 edits)"
If 20 out of 580 edits were reverted:
- "3.4% reverted (out of ~600 edits)"
- "3.4% reverted (out of >500 edits)"
If 50 out of 50 edits were reverted:
- OK to leave as-is (both counts meet threshold)

This guidance also applies to reporting below-threshold percentages for other data types.

Publication risk mitigation checklist

This self-service checklist is intended to help data scientists and analysts lower the risk of a high or medium risk data publication and reduce unintentional disclosures of private information.

Before you post data publicly (which includes pushing a notebook to gerrit or gitlab), have you

entered this data publication into the data publication log form?
cleared outputs that display raw data?
cleared outputs that display granular data (as defined in the threshold table above)?
obfuscated rows that display granular data? For example:

Python

R

# imagine we are doing an analysis of the number of *users* to try a feature

# set constants
threshold = 25
col = "num_users"

# obfuscate rows
df.loc[df[col] < threshold, col] = f'<{threshold}'

library(tidyverse)
library(glue)

# {{tunit|69|set constants}}
threshold <- 25

df <- df |>
  mutate(num_users = ifelse(num_users < threshold, glue("<{threshold}"), num_users))

filtered out rows that display granular data? For example:

Python

R

# imagine we are doing an analysis of *app interactions* on the users did

# {{tunit|69|set constants}}
threshold = 100
col = "num_interactions"

# filter out rows below threshold
df = df[df[col] >= threshold]

library(tidyverse)

# {{tunit|69|set constants}}
threshold <- 100

df <- df |>
  filter(num_interactions >= threshold)

General risk heuristics

Below, "X > Y > Z" means that X is riskier than Y, which is in turn riskier than Z.

Data type:
- Geography:
  - city > (sub-national) region > country > subcontinent > continent > global
  - country protection list > non-country protection list
- Device details:
  - raw User-Agent > browser or OS type > device type
  - raw IP > partially-redacted IP range
- Temporal:
  - dt > hourly > daily > monthly
- Combos of multiple keys > any key on its own (i.e. country + project > country or project)
User activity type:
- fundraising activity > editing activity > interaction activity > reading activity
Wikimedia Foundation activity type:
- data collection > data analysis
- granular analysis > high-level analysis

Contact us

If you think that these guidelines have potentially been breached, or if you have questions or comments about compliance with the guidelines, please contact us at privacy wikimedia.org.

Notes

↑ This process requires specialist help to ensure that the DP algorithm is correctly configured, as well as adequate documentation.

[1] This process requires specialist help to ensure that the DP algorithm is correctly configured, as well as adequate documentation.

[1]