Worrying confessions: A look at data safety labels on Android

The Google Play Store recently introduced a data safety section in order to give users accessible insights into apps’ data collection practices. We analyzed the labels of 43,927 popular apps. Almost one third of the apps with a label claims not to collect any data. But we also saw often downloaded apps, including apps meant for children, admitting to collecting and sharing highly sensitive data like the user’s sexual orientation or health information for tracking and advertising purposes. To verify the declarations, we recorded the network traffic of 500 apps, finding more than one quarter of them transmitting tracking data not declared in their data safety label.

Stylized photo with a blue tint of food containers, above that the text: “Analysis: Data safety labels on Android”

At the end of April 2022, Google launched their new data safety section for Android apps, a feature meant to give users reliable information about how apps distributed through the Play Store handle their users’ data. App developers are required to list the types of data their apps process and the purposes each data type is used for. They also need to distinguish for each, whether they collect this data for themselves or whether they share it with third parties. In addition, developers have to declare whether users can ask for their data to be deleted.
This information is then displayed in the Play Store as the data safety label, with the stated goal of allowing users to decide themselves whether they are okay with an app’s privacy practices before installing it.

Screenshot of the “data safety” card on the Google Play Store page for the “Amazon Shopping” app. According to the card: “This app may share these data types with third parties: Location, Personal info and 7 others, This app may collect these data types: Location, Personal info and 10 others, Data is encrypted in transit, You can request that data be deleted” — Summary of the data safety label for the “Amazon Shopping” app

Google’s launch of the data safety labels follows a similar effort by Apple, which introduced the very similar privacy labels for iOS back in late 2020. In both cases, all information in the labels is self-declared by the app developers and it is unclear whether and to what extent Google and Apple verify the details. There is a risk of intentionally or accidentally false declarations by developers misleading users into believing that an app is more privacy-friendly than it actually is. We have already contributed to a study into the honesty of privacy labels on iOS and showed that some labels contain obvious inconsistencies like claiming to collect user IDs not linked to the user and that 16 % of the checked apps transmitted data not declared in their label.

Since it has now been a few months since the introduction of the data safety labels and many apps have provided one, it’s time for us to look into the situation on Android.

What do the labels say?

We’ll start by getting a general overview of what the apps say in their data safety labels. For that, we want to look at the most popular apps. The Play Store compiles top charts for each category. Through the website, one can only view the top 45 apps per category, but it is possible to access the full top charts using an internal API endpoint. For the following statistics, we looked at the data safety labels of the top apps across all categories, with 43,927 apps in total (after deduplicating those appearing in multiple charts).

According to Google’s documentation, all apps were supposed to provide a data safety label by July 20, 2022. Now, one and a half months after that deadline, more than one fifth of apps (9,255) has still not provided one yet. These apps can no longer publish updates and “may face additional enforcement actions in the future, such as the removal of [the] app’s store listing from Google Play”.

29.8 % (10,347) of the apps that do provide one, say they neither share nor collect any data, and 57.2 % (19,848) claim to at least not share any data with third parties. Those numbers sound encouraging as many apps can indeed function entirely locally on the phone without transmitting data but remember that those are self-declarations by the developers and we can’t tell yet whether these claims are actually truthful.

But what about that apps that do say they process data? The situation is looking less privacy-friendly here: The four most commonly declared data types are all for tracking purposes: device IDs, crash logs, app interaction, and diagnostic data. Only after those do we see data types that some apps might actually need, like user IDs and the user’s name.

Bar graph plotting the different data types that can appear in a data safety label against the number of apps declaring the respective type in their label, distinguishing by “data collected” and “data shared”. The “number of apps” axis goes from 0 to 16,000. The five most common data types are (in descending order): Device or other IDs, Crash logs, App interactions, Diagnostics, Email address. The five least common data types are (in ascending order): Credit score, SMS or MMS, Political or religious beliefs, Calender events, Race and ethnicity. For all data types, “data collected” is significantly more common than “data shared”. — Number of apps that collect and/or share the respective data type according to their data safety label.

65.5 % (22,728) of apps with a data safety label self-declare to collect or share at least one data type that is only useful for tracking

Google groups the different data types into categories (full list). We consider the following categories only useful for tracking: App activity, App info and performance, Device or other IDs

. That’s almost all of the apps that don’t claim not to collect or share any data! Meanwhile, only 53.8 % (18,661) self-declare to collect or share at least one data type that can be used for purposes other than tracking

We consider the following categories of data types potentially useful for purposes other than tracking: Location, Personal info, Financial info, Health and fitness, Messages, Photos and videos, Audio files, Files and docs, Calendar, Contacts

. And 10 % (3,348) only share data with third parties but don’t collect any themselves—how generous of them.

The picture stays the same when looking at the purposes the labels give for the collected data types: Analytics is also the most commonly declared purpose, followed by App functionality and Advertising or marketing.

Bar graph plotting the different purposes that can appear in a data safety label against the number of apps declaring the respective purpose in their label, distinguishing by “data collected” and “data shared”. The “number of apps” axis goes from 0 to 17,500. The purposes are (in descending order): Analytics, App functionality, Advertising or marketing, Account management, Fraud prevention, security, and compliance, Personalization, Developer communications. For all purposes, “data collected” is significantly more common than “data shared”. — Number of apps that collect and/or share the data for the respective purposes according to their data safety label.

In addition to listing the data types and purposes, apps also need to declare whether users can request deletion of their data. We should expect this to be the case for all apps considering that it’s required by the GDPR. Nonetheless, 27.2 % (9,428) of apps with a label say that users cannot request deletion, but most of them at least declare that they neither collect nor share any data. Excluding those, 5.5 % (1,911) say that they collect and/or share data but users cannot request data deletion.

Worrying confessions

While looking at the data safety labels, we noticed a worrying number of apps declaring that they collect or even share highly sensitive data including information about their user’s sexual orientation, political or religious beliefs, and health for tracking or advertising purposes. Remember that these are self-declarations by the app developers, not allegations by us or third parties. The app developers themselves seem to have no problem with admitting to this incredibly problematic data use.

Here are just a few examples of well-known apps with many downloads doing this

A full list of such declarations is available as a CSV.

Facebook collects political or religious beliefs, the sexual orientation, and health info for analytics purposes
Amazon Shopping collects health info for analytics purposes
Roblox collects the sexual orientation for analytics purposes and shares it for analytics, and advertising or marketing purposes
SoundCloud: Play Music & Songs shares the sexual orientation for advertising or marketing purposes
My Little Pony: Magic Princess collects the sexual orientation for analytics, and advertising or marketing purposes and shares it for advertising or marketing purposes
FarmVille 2: Country Escape collects the sexual orientation for advertising or marketing purposes
9GAG: Funny GIF, Meme & Video shares the sexual orientation for analytics purposes
Zalando Lounge - Shopping Club collects and shares the sexual orientation for analytics, and advertising or marketing purposes
momox: Bücher & mehr verkaufen collects and shares the sexual orientation for advertising or marketing purposes
nebenan.de - your social network for neighbours collects the sexual orientation for advertising or marketing purposes

It’s unclear whether all the apps actually use the data in this way, but even if these were overzealous “just-in-case” declarations because developers don’t know what the trackers they include in their apps do, it shows a concerning disregard for their users’ privacy.

It is unclear why any of them would need to process this data in the first place, let alone for tracking or advertising purposes. This is especially true considering that all these data types fall under the “special categories of personal data” for which the GDPR affords additional protections (Art. 9 GDPR). Some companies like to claim a legitimate interest (Art. 6(1)(f) GDPR) for tracking to avoid having to ask the user for consent. That practice is questionable even for non-sensitive data, but definitely not applicable for special categories of personal data.

Especially shocking: Some of the apps listed above are explicitly and exclusively targeted at children. The GDPR rightfully recognizes that children need even stricter protection with regard to their personal data (Recital 38 GDPR) and thus sets even higher requirements for processing their data. Collecting and even sharing special categories of personal data about children for analytics or advertising purposes is absolutely unacceptable.

Checking labels against actual traffic

Finally, we ran a traffic analysis on the top 500 apps overall

⁴

While we ran the analysis on all the apps, it was only successful for 442 apps. Of the remaining ones, seven could not be downloaded for our emulator due to specific device requirements, and 51 crashed during the traffic recording.

to check the truthfulness of the declarations in the labels. We installed and started each app in an Android emulator and let it running for a minute without any user input. In the background, we recorded the entire network traffic.

Here’s an overview of the data types we observed being transmitted:

Number of times that the observed data types were transmitted per app and tracker in the recorded network traffic, grouped by whether they were transmitted together with a unique user or device ID (i.e. pseudonymously) or without identifiers for the user or device (i.e. anonymously).

We can see apps commonly transmitting device parameters like Android version, phone model, screen size, carrier, battery status, and volume. As we didn’t interact with the apps at all, it’s not surprising that there isn’t really any traffic related to actual app functionality but rather tracking and advertising traffic for the most part. But it is worth noting that even benign data types like app ID and version or screen size are usually transmitted in conjunction with a unique ID for the user or device (i.e. pseudonymously)

⁵

We consider the data in a request pseudonymous if the request contains at least one unique identifier for the device or user, namely the device’s Google Advertising ID (including hashed forms thereof), the user’s public IP address, or a tracker-specific unique ID.

, making them personal data under the GDPR (Recital 26(2) GDPR).

We can now compare the recorded network traffic with the declarations in the data safety labels. Of course, we can only check a small subset of the possible data types since we don’t interact with the apps at all. Similarly, we can only definitively say when data is transmitted but if we don’t observe data being transmitted, it doesn’t necessarily mean that it never is. Also note that Google is less strict in their requirements than the GDPR’s definition of “processing”. For example, according to Google’s policies, apps don’t need to list data sent to a server but deleted immediately after handling the request under “collected data”. We don’t (and can’t) consider these exceptions in our automated analysis.

Stacked bar graph showing the distribution of whether apps correctly declared each analyzed data type and purpose. The “number of apps” axis goes from 0 to 400. The data types are: Location, SMS or MMS, Contacts, Diagnostics, Other app performance data, Device or other IDs. The purposes are: Analytics, Advertising or marketing. The possible judgements are: not declared but observed, declared but not observed, correctly declared, correctly not declared. Regarding the data types: More than half of apps correctly didn’t declare the Location data type, a handful correctly declared it, another handful didn’t declare it even though it was observed, and the rest declared it but it was not observed. For both SMS or MMS and Contacts, the vast majority of apps correctly didn’t declare the respective type, the rest declared it but it was not observed. For Diagnostics, Other app performance data, and Device or other IDs, around half of the apps either correctly did or didn’t declare the respective data type, around 12 % didn’t declare it, even though it was observed, and the rest declared it but it was not observed. Regarding the purposes: For both Analytics and Advertising or marketing, around one third of the apps either correctly did or didn’t declare the respective data type, around 6 % didn’t declare it, even though it was observed, and the rest declared it but it was not observed. — Evaluation of the correctness of the data types and purposes in the analyzed data safety labels. Remember that we can only definitively say when data is collected but can’t confirm that it is never collected.

Keeping that in mind, at least from what we saw, most of the declarations were correct but we did also observe missing declarations. Most notably, more than one quarter of apps transmitted tracking data

⁶

By “tracking data”, we mean the data types Diagnostics, Other app performance data, and Device or other IDs. Google doesn’t clearly define what falls under those. For the purposes of this analysis, we consider the following information as falling under the respective type:

Diagnostics: roaming status, is device rooted?, is device an emulator?, network connection type, WiFi and cellular signal strength, charging status, battery percentage, sensor data (accelerometer, rotation), RAM usage, disk usage, uptime, volume
Other app performance data: device name, carrier, local IPs, BSSID
Device or other IDs: Google advertising ID, hashed Google advertising ID, IMEI, MAC address, public IP address (included in the request path or body), other unique user, session, or device IDs

that they didn’t declare. A handful of apps transmitted the user’s location without declaring that. Additionally, a little more than 5.7 % and 6.3 % of apps contacted known tracking and advertising servers respectively without declaring the corresponding purpose anywhere in their label.

These results are in line with what we previously saw for iOS privacy labels. These labels can be a helpful tool in making important information about data collection practices that was previously buried in privacy policies approachable and easier to grasp for users. But if the labels are solely based on self-declarations by app developers, they can also dangerously misrepresent the actual data collection, misleading users into wrongly believing that apps are privacy-friendly even when they aren’t actually.
But the declarations in the labels also highlight the vast collection of tracking and advertising data that is worringly ubiquitous across the web and mobile and sometimes concerns data that is completely inappropriate to collect. Disclosing these practices is not enough. Tracking practices need to be significantly dialed back, and—at the very least—users need to be given a genuine and informed choice in the matter, as the GDPR already requires.

Analysis data set and source code

The data safety labels that the analysis in this post is based on, were downloaded on September 07, 2022. We are publishing our full data set, including the recorded network traffic. We also have a separate CSV with just the worrying declarations described above.
The source code for the analysis is available on GitHub.

Google groups the different data types into categories (full list). We consider the following categories only useful for tracking: App activity, App info and performance, Device or other IDs ↩︎
We consider the following categories of data types potentially useful for purposes other than tracking: Location, Personal info, Financial info, Health and fitness, Messages, Photos and videos, Audio files, Files and docs, Calendar, Contacts ↩︎
A full list of such declarations is available as a CSV. ↩︎
While we ran the analysis on all the apps, it was only successful for 442 apps. Of the remaining ones, seven could not be downloaded for our emulator due to specific device requirements, and 51 crashed during the traffic recording. ↩︎
We consider the data in a request pseudonymous if the request contains at least one unique identifier for the device or user, namely the device’s Google Advertising ID (including hashed forms thereof), the user’s public IP address, or a tracker-specific unique ID. ↩︎
By “tracking data”, we mean the data types Diagnostics, Other app performance data, and Device or other IDs. Google doesn’t clearly define what falls under those. For the purposes of this analysis, we consider the following information as falling under the respective type:
- Diagnostics: roaming status, is device rooted?, is device an emulator?, network connection type, WiFi and cellular signal strength, charging status, battery percentage, sensor data (accelerometer, rotation), RAM usage, disk usage, uptime, volume
- Other app performance data: device name, carrier, local IPs, BSSID
- Device or other IDs: Google advertising ID, hashed Google advertising ID, IMEI, MAC address, public IP address (included in the request path or body), other unique user, session, or device IDs
↩︎

written by Benjamin Altpeter
on 2022-09-18 at 10:17
licensed under: Creative Commons Attribution 4.0 International License

Photo adapted after: “text photo” by Sam Moghadam Khamseh (Unsplash license)

Worrying confessions: A look at data safety labels on Android

What do the labels say?

Worrying confessions

Checking labels against actual traffic

Analysis data set and source code

Comments
Subscribe to the comments on this post using your RSS/Atom feed reader.

Leave a comment

Language

Country

What do the labels say?

Worrying confessions

Checking labels against actual traffic

Analysis data set and source code

CommentsSubscribe to the comments on this post using your RSS/Atom feed reader.

Leave a comment

Comments
Subscribe to the comments on this post using your RSS/Atom feed reader.