Terms of service documents are a common feature of organizations' websites. Although there is no blanket requirement for organizations to provide these documents, their provision often serves essential legal purposes. Users of a website are expected to agree with the contents of a terms of service document, but users tend to ignore these documents as they are often lengthy and difficult to comprehend. As a step towards understanding the landscape of these documents at a large scale, we present a first-of-its-kind terms of service corpus containing 247,212 English language terms of service documents obtained from company websites sampled from Free Company Dataset. We examine the URLs and contents of the documents and find that some websites that purport to post terms of service actually do not provide them. We analyze reasons for unavailability and determine the overall availability of terms of service in a given set of website domains. We also identify that some websites provide an agreement that combines terms of service with a privacy policy, which is often an obligatory separate document. Using topic modeling, we analyze the themes in these combined documents by comparing them with themes found in separate terms of service and privacy policies. Results suggest that such single-page agreements miss some of the most prevalent topics available in typical privacy policies and terms of service documents and that many disproportionately cover privacy policy topics as compared to terms of service topics.
[1]
Timothy Baldwin,et al.
langid.py: An Off-the-shelf Language Identification Tool
,
2012,
ACL.
[2]
Norman Sadeh,et al.
Question Answering for Privacy Policies: Combining Computational and Legal Perspectives
,
2019,
EMNLP.
[3]
Peter Fankhauser,et al.
Boilerplate detection using shallow text features
,
2010,
WSDM '10.
[4]
C. Lee Giles,et al.
Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies
,
2020,
ACL.
[5]
Frederick Liu,et al.
The Creation and Analysis of a Website Privacy Policy Corpus
,
2016,
ACL.
[6]
Anne Oeldorf-Hirsch,et al.
The Biggest Lie on the Internet: Ignoring the Privacy Policies and Terms of Service Policies of Social Networking Services
,
2020
.
[7]
Fei Liu,et al.
Automatic Detection of Vague Words and Sentences in Privacy Policies
,
2018,
EMNLP.
[8]
Norman M. Sadeh,et al.
MAPS: Scaling Privacy Compliance Analysis to a Million Apps
,
2019,
Proc. Priv. Enhancing Technol..
[9]
Gurmeet Singh Manku,et al.
Detecting near-duplicates for web crawling
,
2007,
WWW '07.
[10]
Moses Charikar,et al.
Similarity estimation techniques from rounding algorithms
,
2002,
STOC '02.
[11]
Geoffrey Zweig,et al.
Syntactic Clustering of the Web
,
1997,
Comput. Networks.