[co-authors: Clarissa Kwee, Carol Lin]
What is data scraping?
Data scraping is an automated process through which computer programs extract vast amounts of data from the internet at a faster rate than manual data collection methods.
Some businesses scrape data for internal purposes, such as generating leads, or to create products and services available for public use, such as price comparison tools. Data brokers sell scraped data to third-party entities, such as for targeted advertising or market research purposes.
One area which has led to exponential growth in data-scraping is the training of generative artificial intelligence (AI) systems. Web crawlers gather significant volumes of information from the public internet and the resulting data is used to train large-language models.
Key privacy issues associated with data scraping
The collection and handling of personal information collected via data scraping raises significant privacy concerns, including potential contraventions of the Australian Privacy Principles (APPs) under the Privacy Act 1988 (Cth) (Privacy Act).
The fact that personal information is published online and available for scraping does not negate the need for compliance with the APPs.
What enforcement actions have been taken in response to data scraping?
As of August 2025, the Australian privacy regulator, the Office of the Australian Information Commissioner (OAIC), has made a number of determinations in respect of data scraping activities which were found to have interfered with individuals’ privacy in breach of the Privacy Act.
Clearview AI, Inc (Clearview)
The OAIC determined that Clearview had breached the Privacy Act through its use of a web crawler to collect online images of individuals. Clearview uploaded the images into a database to create a facial recognition tool, which was used by law enforcement agencies to identify individuals.
The issues identified by the OAIC included the following:
- Lack of Consent: Sensitive information was collected from several sources, including Australian servers, without the consent of the individuals, in breach of APP 3.
- Inadequate Practices and Policies: Clearview failed to implement proper practices and policies to ensure compliance with the APPs, as required under APP 1. Notably, Clearview continued to collect images from websites even after the OAIC investigated Clearview and found it to be in breach of the Privacy Act.
The OAIC also considered that Clearview had failed to collect personal information using lawful and fair means (as required under APP 3) and to provide certain information to individuals at the time of collection (as required under APP 5). However, these two points were not upheld following Clearview’s appeal to the Administrative Appeals Tribunal.
Master Wealth Control Pty Ltd (DG Institute) (DG Institute) and Property Lovers Pty Ltd (Property Lovers)
DG Institute and Property Lovers scraped the personal information of individuals associated with divorces, bankruptcies and deceased estates from court lists and other publicly available data such as death and funeral notices. The information was matched with data obtained from third party databases and used to generate leads lists. Such leads lists, which contained personal information, were shared with participants of the property investment course run by DG Institute and later transferred to its associated entity, Property Lovers.
The OAIC found a number of contraventions of the APPs, including:
- Unfair Collection: The collection of personal information from publicly available sources for inclusion in the leads lists was unfair and breached APP 3.5, having regard to the purpose of the leads lists, the vulnerabilities of the individuals concerned and the possible adverse consequences for these individuals.
- Failure to notify individuals: DG Institute and Property Lovers failed to notify individuals of certain information when collecting personal information via data scraping (as required under APP 5). Here, the OAIC considered it was reasonable in the circumstances to notify the individuals at the point when information collected from the court lists was matched with information from third party sources, as the personal information in the leads lists would be sufficient to identify and contact the individuals directly.
- Failure to ensure quality of personal information: Insufficient steps were taken to ensure the personal information used and disclosed to compile their leads lists was accurate, up-to-date, complete and relevant, in breach of APP 10.
- Inadequate privacy policy: The privacy policies of DG Institute and Property Lovers did not adequately describe the data scraping activities. There was insufficient coverage of the types of personal information collected, how that information was collected and the purposes for which it was collected, used and disclosed.
Amongst other actions, the OAIC required the companies to destroy the leads lists, and the personal information collected to compile those lists. Property Lovers was also required to publish a written apology for its interference with the privacy of individuals.
OAIC’s position on data scraping
The OAIC and several of its international counterparts have released joint statements on data scraping and the protection of privacy (Joint Statements). These can be accessed here and here.
The key message in the Joint Statements is that website operators have an obligation to protect personal information on their websites from unlawful scraping, and mass data scraping could lead to data breaches or the exploitation of personal information for commercial gain.
Additional guidance (available here) is also available for businesses scraping data, or using scraped data, to train generative AI models.
Practical tips
Businesses looking to scrape, or use scraped, data, should bear in mind the following requirements:
- Privacy policies: Privacy policies should be reviewed and updates, to ensure that they adequately describe how personal information will be collected, used and disclosed in connection with data-scraping activities.
- Collection notices: Individuals must be notified of key information, either upon the initial scraping of the data, or subsequently when new data sets are generated. If using scraped data to train AI models, the categories of personal information used to develop the model and information about the websites scraped should also be included.
- Consent: Clearly explain what consent is for, the purposes of the processing (including if data is being used to train an AI model) and how a data subject may withdraw their consent.
- Fair and lawful means: Ensure that the scraping of publicly available personal information is lawful and fair. This could include, for example, collecting personal information directly from individuals rather than from third party sources.
- Data minimisation: Implement data minimisation techniques, including limiting the fields of personal information collected, and/or de-identifying any personal information collected before it is used or disclosed.
For website owners, best practice controls to protect websites against data scraping include:
- ‘Rate limit’ visits per account: Consider limiting the number of visits per hour or day, and further limit access if unusual activity is detected.
- Bot detection and monitoring: Take steps to detect bots, such as by using CAPTCHAs, and block IP address where data scraping activity is identified. Monitor patterns in bot activity, as they may reveal scraping activity.
- Monitor for scraping activity: Monitor websites or platforms for high activity, such as new accounts that quickly and aggressively begin searching for other users.
- Website terms of use: Where applicable, ensure that the website terms of use clearly prohibit data scraping. If data scraping is permitted, website terms should require data scrapers to comply with contractual terms, such as limitations on the type or volume of information that may be scraped, and the purposes for which the information may be used. You should also regularly monitor third parties’ compliance with the contractual limitations.
- Address suspected/confirmed data scraping: If data is scraped from your website, appropriate steps to take include enforcing website terms prohibiting data scraping, requiring the deletion of scraped information (and obtaining undertakings to this effect), and sending cease and desist letters. If applicable, the OAIC should be notified if the data scraping constitutes an eligible data breach.
[View source.]