Web scraping is a sensitive issue. Should a third party be allowed to visit a website and use automated tools to gather and store information at scale from that website? What if that information includes personal data? What does the law say? Can it be prevented? This is what we’ll discuss.
What is web scraping?
Web scraping is the use of automation to collect data from websites. In effect, it is little different to a person visiting a website to see what can be discovered – except the use of bots makes it thousands of times quicker and more efficient across many more sites.
It is rarely, if ever, ad hoc. The organization conducting the scraping knows what information is being sought, and which sites should be visited. Examples include ecommerce sites seeking to learn competitive pricing and/or holiday season campaigns. Real Estate agencies might scrape other agencies to learn what properties are being sold, where and for what price.
“Web scraping is the extraction of website information,” explains Nick Rago, field CTO at Salt Security. “While web scraping has valid business purposes, such as research, analysis, and news distribution, it can also be used for malicious purposes, such as sensitive data mining.”
The scraped data is often in html format. This is sent to another application that converts it into a format suitable for analysis, such as a spreadsheet. A frequent purpose is to obtain information about competitors to allow the development of more competitive projects or offerings. There is, then, a clear business incentive to do so. But is it legal?
Legal or illegal
There is no clear statement on whether web scraping is legal or illegal – it is a sensitive issue that currently lacks comprehensive legal regulation or a clear industry consensus. Denas Grybauskas, head of legal at Oxylabs (a Lithuanian company providing proxies and specializing in web scraping) comments, “Web scraping is relatively new and thus shares the same problem with other new technologies – regulation is developing a lot slower than the technology itself.”
hiQ vs LinkedIn
LinkedIn sent a cease and desist notice to hiQ claiming the scraping was in breach of the Computer Fraud and Abuse Act (CFAA). hiQ disagreed and took the issue to court. A five-year legal journey eventually ended with the Ninth Circuit ruling that scraping publicly available web data is not precluded under the CFAA. At its basis, scraping public data does not involve hacking the site.
The media led with headlines such as ‘Web scraping is legal’. This is an over-simplification. What the court ruled is that it is not illegal under CFAA – and even this, frankly, could be overturned if the Supreme Court takes a different view. There may also be different regulations in different jurisdictions – both at state level within the US, and most certainly at the international level with regulations such as GDPR.
European regulators vs Clearview
Clearview.ai says of its services, “Our platform, powered by facial recognition technology, includes the largest known database of 30+ billion facial images sourced from public-only web sources, including news media, mugshot websites, public social media, and other open sources.” Put simply, Clearview scrapes all possible websites for facial images, to provide “a revolutionary, web-based intelligence platform for law enforcement to use as a tool to help generate high-quality investigative leads.” This is not currently illegal in the US.
In Britain, however, the UK privacy regulator fined Clearview $9.4 million for contravening the UK’s version of GDPR. The regulator commented, “The company not only enables identification of those people, but effectively monitors their behavior and offers it as a commercial service. That is unacceptable.”
But at around the same time, a study from the research university KU Leuven reported, “From an EU data protection perspective, the collection and processing of photographs and related information has no legal basis. The data protection principles are not respected, and data subjects cannot exercise their rights. But with no physical presence in the EU, Clearview AI does not seem to be concerned by the unenforceable decisions of the DPAs.”
The French data protection agency, CNIL, announced a €20 million (approximately $19.5 million) fine on Clearview on Thursday, October 20, 2022. Last year, CNIL ordered Clearview to stop processing personal data, but has not had a response.
Apart from international legislation, Clearview must now also take account of specific US state-level legislation. In May 2022, the firm agreed to settle a case brought by the ACLU accusing it of violating a strict biometric privacy law in the state of Illinois. The settlement stops Clearview from making its ‘faceprint’ database available to most businesses or other private entities in the US; but does not limit Clearview from working with federal or state agencies other than those in Illinois.
The implication from such cases is that web scrapers need to consider what they are scraping, and what different regulations may come into play. Clearly, scraping personal data is likely to be subject to various privacy regulations around the world. “In addition to regulations that differ from region to region,” warns Grybauskas, “there’s a long list of laws that might become relevant in specific circumstances.”
A clearly illegal and increasing version of web scraping occurred with the Optus breach announced on September 22, 2022. Modern websites use APIs to serve dynamically generated content to the client/browser. As a result, malicious web scraping bots have begun to focus on the APIs. In this instance, the Guardian comments, “Reports suggest Optus had an application programming interface (API) available online that did not require authorization or authentication to access customer data.”
Rago takes up the story. “The API breach experienced recently by Australian telecommunications company, Optus, provides a good example of malicious API data scraping or web scraping, where the intent was to harvest sensitive data from a publicly exposed API. In that incident, the attacker leveraged an ‘open’ or unauthenticated API to exfiltrate thousands of user records. Thus, web scraping has evolved into API scraping, making it even more difficult to detect.”
Scott Gerlach, co-founder and CSO at StackHawk, confirms the importance of APIs in malicious web scraping. “Many site owners may think my app or website isn’t big enough to draw attention, but the data collected by an organization is very valuable to bad actors,” he told SecurityWeek. “And with more websites and apps moving towards API-driven architectures, you must also ensure the APIs transferring data are secure.”
The legal/illegal balancing act
It isn’t possible to say whether web scraping is legal or illegal. It depends on the method of scraping, the data scraped, the purpose of the scraping, and the jurisdiction concerned.
Aleksandras Sulzenko, the product owner at Oxylabs, seeks to navigate the lack of clear regulations on two fronts – which he describes as infrastructure and usage. The ‘infrastructure’ is basically the proxies he uses to deliver the service. He uses residential proxies, but only where the owner knows and consents to the usage and is rewarded for it.
‘Usage’ is the actual scraping. His primary concern is to do no harm to the website being scraped. So, he has three priorities: “We limit the rate of the requests to avoid causing any traffic harm; we go through extensive KYC procedures to be confident that our solution is only being used for legitimate purposes; and we only scrape publicly accessible data.”
On the last, this means he doesn’t allow customers to scrape data that sits behind a login, and that means he effectively avoids any possible conflict with CFAA in the US because nothing can be construed as hacking.
Defending against web scraping
While ‘legal’ web scraping is widely used in business, it remains a sensitive issue. This is most obvious where personal data is scraped. LinkedIn, for example, is basically a professional CV showpiece – so users of LinkedIn are actively advertising their personal details. Having those details collected and collated en masse, and then sold to strangers is less appealing.
Clearview’s image scraping in the US is similar. Social media users post photos and selfies because they want to be known and recognized. But having those images scraped and sold on to third parties, including law enforcement, so that they can be recognized in realtime in different locations by image recognition camera systems is not so welcome.
Web scraping is widespread in many different industry sectors. It’s just an aspect of doing business. Where the scraping process is designed to be ‘low and slow’, the ‘victim’ may even be unaware of its occurrence. Some companies may simply assume that it happens, because they do it themselves, scraping competitor data.
Where scraping is unwanted, the Oxylabs legal type of scraping can be defeated by insisting visitors have an account that they must log into. “You can prevent scraping by placing all the data you want to hide behind login requirements that can be strengthened by MFA,” comments Sulzenko. “But it’s a trade-off because this creates more friction for the legitimate customers you want to allow in.”
This is the trap faced, for example, by content and news sites. Take SecurityWeek itself. SecurityWeek wants its content to be seen and read freely. This means not requiring visitors to have an account that must be logged into. But that, in turn, means the content is more easily scraped and perhaps republished elsewhere under a different name. It happens.
Illegal scraping – the type performed by hackers – can only be mitigated by better security. “To prevent malicious web scraping, site owners need visibility into every API endpoint and the data exposed,” explains Gerlach. “Testing web interfaces and APIs for vulnerabilities frequently and early on improves overall security posture and provides insight to act quickly if needed.”
Rago adds, “Organizations must be careful that they only expose the information that they want exposed.” A retailer may want to openly share product, pricing, and inventory information, but probably doesn’t want to share customer and payment data. “To reduce risk,” he continued, “organizations need good visibility and governance around their data exposure and maintain proper security around web interfaces and the underlying APIs that transport this sensitive data.”
Related: The Big Business of Bad Bots