What happens when a web site goes missing either due to technical or organizational reasons? This happens every day for lots of different reasons–fires, floods, computer viruses, no legacy plan or interest after the death of a specialized content provider, inattention, or lack of back-ups can all mean that a favorite web destination disappears. Frank McCown, Norou Diawara, and Michael Nelson, Old Dominion University Computer Science Department, studied how or if content from 300 randomly-selected websites could be reconstructed from readily available web sources over a 14 week period in their paper, Finding lost web content: Factors Affecting Website Reconstruction from the Web Infratructure presented by McCown on June 20 at JCDL in Vancouver.
What is the web infrastructure? Caches of data can be “repositories” as well as entities such as the Internet Archive. Altogether McCown refered to this data landscape as the Web infrastructure. Content can be recovered from web sites far in the past or currently deceased particularly, as McCown pointed out, “If I can make my web site have a high page rank in Google it will be more recoverable.
Warrick used cache dates to find materials. They found PDFs and images as well as text that were “out of context”with no provenance information. The tool has been used by different individuals and groups who wanted or needed the information that was handlded by the web site.
To evaluate “getting everything back” They took live web sites, restored them, and then compared them. They also looked at how much change might be considered “a bad thing” by applying penalties to different types of missing resources and situlations. The ability to reconstruct was the same week to week, generally. If they could reconstruct one week they could do it the same way the next week. They ultimately recovered 61% of all content using the Warrick web repository crawler.






Leave a Comment
* You can follow any responses to this entry through the RSS 2.0 feed.