𝔖 Bobbio Scriptorium
✦   LIBER   ✦

Text characteristics of English language university Web sites

✍ Scribed by Mike Thelwall


Book ID
101652521
Publisher
John Wiley and Sons
Year
2005
Tongue
English
Weight
143 KB
Volume
56
Category
Article
ISSN
1532-2882

No coin nor oath required. For personal study only.

✦ Synopsis


Abstract

The nature of the contents of academic Web sites is of direct relevance to the new field of scientific Web intelligence, and for search engine and topic‐specific crawler designers. We analyze word frequencies in national academic Webs using the Web sites of three English‐speaking nations: Australia, New Zealand, and the United Kingdom. Strong regularities were found in page size and word frequency distributions, but with significant anomalies. At least 26% of pages contain no words. High frequency words include university names and acronyms, Internet terminology, and computing product names: not always words in common usage away from the Web. A minority of low frequency words are spelling mistakes, with other common types including nonwords, proper names, foreign language terms or computer science variable names. Based upon these findings, recommendations for data cleansing and filtering are made, particularly for clustering applications.


πŸ“œ SIMILAR VOLUMES