I find it frustrating when a website locks up a dataset behind a query-based searchbox. It makes sense only when the data are proprietary, and then only if the developer applies sufficient security precautions to prevent mass downloads.
To be clear, when you provide a dataset a .txt or .pdf file is excellent for human eyes to read, but a .pdf requires OCR analysis to extract the data that the developer already had in a perfectly good database.
I find that a pipe-delimited .csv file is the simplest format to receive data. For my websites, for example, a cemetery burial list of perhaps 5000 persons by last name, given names and other details can easily be added to a compilation of many cemeteries, where a person can go and look to learn that Uncle Joe died in Montana and was buried there. The researcher wouldn’t already know which cemetery website to visit. Let the cemetery run the cemetery and the cemetery website, and let me and other web developers download the burial lists and compile them into big databases which search engines will crawl.
It makes sense to trap the data behind a query-based search box only if they are proprietary, in which case, the query-based search box approach is useless to protect the data unless there are sufficient safeguards. Some give the whole dataset when the end-user simply hits the SEARCH button without entering any search terms.
When the data are public, it is best to provide one link to download a simple .txt file or .pdf file for human eyes, and another link to download the same data as a pipe-delimited .csv file for anybody who wants the whole dataset. A search box can help humans who want specific records, but remember that a Google-based search box will yield only the results that are crawled and indexed by Google. News media should provide a table of contents by the date the story ran.
You… do know what the “c” in csv stands for, right?
They’ve got it in big databases.
That’s… not what a query based search box is for. The query based search box is for if you want the list of burials for Rose Hill cemetery, where there’s 250 people buried, you dont need to pull a dataset of 250,000,000 records from across the country, and load down the server with your request for data you dont want.
The information may well be in the public domain, but that doesn’t mean the database is. In examples like the one you mention, often it has taken hundreds of volunteers hundreds of hours to compile the database, so they won’t feel inclined to give away the whole caboodle.
By the way, this post is just so developers will think about not only the visitor who needs the nearest agent but also the visitor who wants the whole list of agents; there are of course reasons not to make it easy to download the whole thing but also there are developers who never imagined somebody would want to download the whole thing.
Gonna be honest, basically no developer is going to think about someone wanting to download their data unless they are specifically making a dataset for distribution. In fact, most websites put info in their terms of service PROHIBITING you from doing that…
PDF files are entirely printable characters, where printable characters include carriage returns and line feeds. Data in PDF files such as images are encoded using printable characters. Some documents seem to attempt to disable copying but normally text data in PDF files are text.
There are many things that websites (the people) do that I wish I could change. Such as, I think they should not call tracking (available as a built-in feature in most browsers) ad-blocking (something we must use an extension to get).