EPIC logo

Google Flu Trends and Privacy

Top News | Background | News Items | Resources

Top News

Background

How Google Flu Trends Works

Google Flu Trends is a Google utility for locating geographic areas where people are searching for the word "flu" and related terms. Google believes such searches correlate with outbreaks of influenza, and can potentially aid in influenza prevention. However, the ability to pinpoint the location of users through search engine queries raises significant privacy concerns.

Google Flu Trends is an extension of Google Trends, a technology that analyzes search queries submitted by Google users. User search data is stored on Google's servers, and retained by the search engine giant. This information includes six elements: 1) the search query itself; 2) the Internet Protocol (IP) Address of the searcher; 3) the date and time of the query; 4) the requested URL; 5) the browser and operating system being used; and 6) a unique cookie ID assigned to the browser. In addition, Google can retain a unique account identifier that tracks a user's activity across different computers. It is possible to use a database containing user search data to sort by time and location, to locate and identify the source of search queries, and to build individual profiles.

Google's analysis suggests that certain search terms are good indicators of flu activity. According to Google, Flu Trends uses aggregate search data to estimate flu activity. Google found that these computed statistical analyses were almost two weeks faster than traditional flu analysis by agencies such as the Centers for Disease Control and Prevention (CDC). Google is sharing Flu Trends data with the CDC, part of the US Department of Health and Human Services.

Google Flu Trends essentially uses Google Trends to track the frequency and location of users' searches for terms like "flu,""flu symptoms," "influenza," etc. Although Google has said that it will only reveal aggregate data, there are no clear legal or technological privacy safeguards that prevent the disclosure of individual search histories. Without such privacy safeguards Google Flu Trends could be used to reidentify users who search for medical information. Such user-specific investigations could be compelled, even over Google’s objection, by court order or Presidential authority. The Google Trends technology supports analysis of individual users' search histories. In some circumstances, it can be used to compile "a list of [a user's] top searches and clicks and other info about [her] search activity."

Privacy Concerns Involving Google Flu Trends

Search engine data retention raises substantial privacy concerns. Search logs include the query text, the date and time of the search, as well as persistent identifiers, such as IP Addresses, cookies, and unique account identifiers. This combination creates detailed, searchable profiles linked to individual Internet users. Even absent IP Address data, cookies, or unique account identifiers, individual search histories can often be easily matched to users. Dr. Latanya Sweeney, professor of computer science at Carnegie Mellon University, has performed research regarding re-identification of ostensibly "anonymous" data. In Trail Re-identification: Learning Who You are From Where You Have Been, Dr. Sweeney writes, "Many [Internet users] falsely believe they cannot be identified. The term 're-identification' refers to correctly relating seemingly anonymous data to explicitly identifying information (such as the name or address) of the person who is the subject of those data." Dr. Sweeny further explains that re-identification algorithms "are extensible to tracking collocations of people, which is an objective of homeland defense surveillance." In Re-Identification of DNA through an Automated Linkage Process, Dr. Sweeny describes how "seemingly anonymous" medical database records "can be related to publicly available health information to uniquely and specifically identify the persons who are the subjects of the information" despite the fact that the database records "contain no accompanying explicit identifiers such as name, address, or Social Security number and contain no additional fields of personal information."

Google Flu Trends is based on user searches submitted to the search engine. Typically, Google collects several pieces of data before returning a search query result, including IP addresses and unique cookie IDs. These pieces of information are stored on Google's servers, and users do not have the ability to control the data after it is submitted to the search engine. Google has stated that it will anonymize search data after a period of nine months, but technical experts have questioned the efficacy of the "anonymization" technique. Google obfuscates the fourth octet but retains the rest of the IP address. At most, the redacted IP address is one of 254 other users. Moreover, the unique cookie assigned by Google to the browser remains unchanged over time and can be easily used by Google (or any entity with powers to subpoena Google) to trace back the search query down to a specific user. This linking of a search term to a specific user can re-identify search terms back to an individual that had been previously "de-identified" by Google.

Health and medical information should be safeguarded from potential privacy violations. For example, a simple search for "AIDS" results in Google displaying web sites containing advice on symptoms, treatments and risk factors. Searches for a specific drug shows not only the drug uses, but also side-effects and interactions. If such personal data is not only made public, but also shared with the Government, privacy concerns are bound to skyrocket. Google has a responsibility to protect personal information, and health data is especially sensitive. When health information is collected from masses of people in a specific geographic area, strong legal and technological safeguards need to be in place.

The US Supreme Court has recognized the "individual interest in avoiding disclosure of personal matters" and "the interest in independence in making certain kinds of important decisions" Whalen v. Roe, 429 U.S. 589 (1977). Using search terms in contravention of the purpose for which they were collected would be a violation of such assured privacy safeguards.

Historically, identification through aggregated data has been subject to abuse. The Department of Homeland Security sought information from the US Census about Muslim Americans in the United States after 9/11. Census data was used during the Second World War to identity and then displace Japanese Americans. There are not sufficient safeguards against such uses. Therefore, automatic, permanent, one-way anonymization of such information is necessary to ensure that personal health data is not used in any way contrary to the users' preferences. The anonymization should be by design.

Identifying medical search data also raises significant First Amendment concerns. If users believe that their search histories are not anonymous, or can be accessed without their consent, users may not be forthcoming in medical discussions or avoid web searches. This could lead to the chilling of speech online. As the Supreme Court of Colorado observed, "The First Amendment to the United States Constitution protects more than simply the right to speak freely. It is established that it safeguards a wide spectrum of activities, including ... the right to receive information and ideas..." Tattered Cover v. City of Thornton, 44 P.3d 1044 (Colo. 2002). Further, the US Supreme Court has observed that "identification and fear of reprisals might deter perfectly peaceful discussions on public matters of importance." Talley v. State of California, 362 U.S. 60 (1960).

Additionally, the knowledge of search terms from certain areas can lead to adverse assumptions and inferences in education and employment. Colleges may look at online health profiles of people before admitting them and employers may want to avoid hiring employees matching certain profiles. It may even lead insurance companies to identify areas prone to particular diseases. It is therefore essential that such data be assuredly and permanently anonymized. This can done only if there is proper disclosure of what information is being retained, and how it is being used.

EPIC and Medical Privacy

EPIC advocates for strong medical privacy safeguards. In August 2007, EPIC and 16 experts in privacy and technology filed a "friend of the court" brief in IMS Health v. Ayotte, a case concerning a New Hampshire state law banning the sale of prescriber-identifiable prescription drug data for marketing purposes. The experts urged the First Circuit Court of Appeals to reverse the ruling of the lower court, which held that the NH Prescription Confidentiality Act violated the free speech rights of data mining companies. The experts said the lower court should be reversed because there is a substantial privacy interest in de-identified patient data that the lower court failed to consider. This privacy interest, in part flows from the reality that data may not be, in fact, truly de-identified, and also because de-identified data impacts actual individuals. In January 2006, EPIC Urged the CDC to Limit Passenger Data Collection. EPIC said in comments to the Centers for Disease Control and Prevention that it should limit a proposed rule that would require airline and shipping industries to gather passenger information, maintain it electronically for at least 60 days, and release it to the CDC within 12 hours of a request. EPIC urged the CDC to narrow the scope of data collected and set strict security standards to keep passenger data secure from unauthorized access and misuse. The CDC also should require the clear and open disclosure that travelers can refuse to submit their information without facing penalties, EPIC said.

News Items

Resources


EPIC Privacy Page | EPIC Home Page

Last Updated: May 1, 2009
Page URL: http://www.epic.org/privacy/flutrends/default.html