skip to main content
Guest
e-Shelf
My Account
Sign out
Sign in
This feature requires javascript
New Search
Journals by Title
Help
Language:
English
Français
Deutsch
This feature required javascript
This feature requires javascript
Primo Search
Great Falls College MSU
Great Falls College MSU
TRAILS Collections
MT Academic Libraries
EBSCO
EBSCO
Search For:
Clear Search Box
Search in:
Great Falls College MSU
Or hit Enter to replace search target
Or select another collection:
Search in:
Great Falls College MSU
Search in:
Great Falls College MSU Print Collection
Search in:
Great Falls College MSU Course Reserves
Advanced Search
Browse Search
This feature requires javascript
This feature requires javascript
Scraping SERPs for Archival Seeds: It Matters When You Start
Nwala, Alexander ; Weigle, Michele ; Nelson, Michael
Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, 2018-05-23, p.263-272
Full text available
Citations
Cited by
View Online
Details
Recommendations
Availability
Times Cited
This feature requires javascript
Actions
Add to e-Shelf
Remove from e-Shelf
E-mail
Print
Permalink
Citation
EasyBib
EndNote
RefWorks
Delicious
Export RIS
Export BibTeX
This feature requires javascript
Title:
Scraping SERPs for Archival Seeds: It Matters When You Start
Author:
Nwala, Alexander
;
Weigle, Michele
;
Nelson, Michael
Subjects:
crawling
;
discoverability
;
collection building
;
web archiving
;
Computer Science - Digital Libraries
Is Part Of:
Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, 2018-05-23, p.263-272
Description:
Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper, we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories: "healthcare bill," "manchester bombing," "london terrorism," "trump russia," "travel ban," "hurricane harvey," and "hurricane irma." We tracked each URI in all collections over time to estimate the discoverability of URIs from the first five SERPs. Our results showed that the daily average rate at which stories were replaced on the default Google SERP ranged from 0.21 - 0.54, and a weekly rate of 0.39 - 0.79, suggesting the fast replacement of older stories by newer stories. The probability of finding the same URI of a news story after one day from the initial appearance on the SERP ranged from 0.34 - 0.44. After a week, the probability of finding the same news stories diminishes rapidly to 0.01 - 0.11. In addition to the reporting of these probabilities, we also provide two predictive models for estimating the probability of finding the URI of an arbitrary news story on SERPs as a function of time. The web archiving community considers link rot and content drift important reasons for collection building. Similarly, our findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in order to capture the evolution of the events, because it becomes more difficult to find the same news stories with the same queries on Google, as time progresses.
Publisher:
ACM
Language:
English
Identifier:
ISBN:
1450351786
ISBN:
9781450351782
DOI:
10.1145/3197026.3197056
Source:
arXiv.org
Links
View record in Cornell University
This feature requires javascript
This feature requires javascript
Back to results list
This feature requires javascript
This feature requires javascript
Searching Remote Databases, Please Wait
Searching for
in
scope:(01TRAILS_MSU_GFC),primo_central_multiple_fe
Show me what you have so far
This feature requires javascript
This feature requires javascript