Frustration with online datasets

bbs · February 5, 2024, 4:23pm

I find it frustrating when a website locks up a dataset behind a query-based searchbox. It makes sense only when the data are proprietary, and then only if the developer applies sufficient security precautions to prevent mass downloads.

To be clear, when you provide a dataset a .txt or .pdf file is excellent for human eyes to read, but a .pdf requires OCR analysis to extract the data that the developer already had in a perfectly good database.

I find that a pipe-delimited .csv file is the simplest format to receive data. For my websites, for example, a cemetery burial list of perhaps 5000 persons by last name, given names and other details can easily be added to a compilation of many cemeteries, where a person can go and look to learn that Uncle Joe died in Montana and was buried there. The researcher wouldn’t already know which cemetery website to visit. Let the cemetery run the cemetery and the cemetery website, and let me and other web developers download the burial lists and compile them into big databases which search engines will crawl.

It makes sense to trap the data behind a query-based search box only if they are proprietary, in which case, the query-based search box approach is useless to protect the data unless there are sufficient safeguards. Some give the whole dataset when the end-user simply hits the SEARCH button without entering any search terms.

When the data are public, it is best to provide one link to download a simple .txt file or .pdf file for human eyes, and another link to download the same data as a pipe-delimited .csv file for anybody who wants the whole dataset. A search box can help humans who want specific records, but remember that a Google-based search box will yield only the results that are crawled and indexed by Google. News media should provide a table of contents by the date the story ran.

m_hutley · February 5, 2024, 4:27pm

You… do know what the “c” in csv stands for, right?

They’ve got it in big databases.

That’s… not what a query based search box is for. The query based search box is for if you want the list of burials for Rose Hill cemetery, where there’s 250 people buried, you dont need to pull a dataset of 250,000,000 records from across the country, and load down the server with your request for data you dont want.

DaveMaxwell · February 5, 2024, 4:34pm

Or copyrighted, or privacy restricted (which is not the same as proprietary), or the data is updated constantly (people do make mistakes in data entry ALL THE TIME).

Or just the fact that they want people to come to their sites, not just come swipe a buttload of data then leave to never come back.

Gandalf · February 5, 2024, 4:50pm

The information may well be in the public domain, but that doesn’t mean the database is. In examples like the one you mention, often it has taken hundreds of volunteers hundreds of hours to compile the database, so they won’t feel inclined to give away the whole caboodle.

Thallius · February 5, 2024, 4:59pm

Wtf…. You want to tell us that not everything is for free we find in the net? Incredible

bbs · February 5, 2024, 6:11pm

That’s what they call it, just like “aluminum angle iron” in construction.

m_hutley · February 5, 2024, 6:14pm

Well, no… they call a Pipe-Separated Values file a PSV… and a Comma-Separated Values file a CSV. And a Tab separated values file a TSV, and…

bbs · February 5, 2024, 6:31pm

Well in Python I use an import called csv.reader and I can specify the delimiter if it ain’t a comma.

bbs · February 5, 2024, 6:33pm

By the way, this post is just so developers will think about not only the visitor who needs the nearest agent but also the visitor who wants the whole list of agents; there are of course reasons not to make it easy to download the whole thing but also there are developers who never imagined somebody would want to download the whole thing.

m_hutley · February 5, 2024, 6:40pm

Gonna be honest, basically no developer is going to think about someone wanting to download their data unless they are specifically making a dataset for distribution. In fact, most websites put info in their terms of service PROHIBITING you from doing that…

TechnoBear · February 6, 2024, 12:04pm

Off topic:

At least one UK government department doesn’t. They require data uploaded as a .csv file which is pipe-separated.

DaveMaxwell · February 6, 2024, 12:47pm

csv has kind of become the “standard” short name for any delimited file, especially with government agencies. Way too often do I hear “I need a csv file. Can you separate it by the pipe symbol?”

Gandalf · February 6, 2024, 1:39pm

My “csv” files are delimited by a carat ^ (carat separated!)

rpkamp · February 6, 2024, 4:35pm

I love to separate the fields with the letter “c” (“c” separated values)

My users aren’t amused. I don’t know why.

m_hutley · February 6, 2024, 5:23pm

I mean, if we had settled on DSV (Delimiter Seperated Values), the world would have been simpler. But, such is not history.

rpg_digital · February 6, 2024, 5:34pm

off-topic:
This is the same government that freely share their data by leaving it on the train?

SamuelCalifornia · February 9, 2024, 9:49pm

PDF files are entirely printable characters, where printable characters include carriage returns and line feeds. Data in PDF files such as images are encoded using printable characters. Some documents seem to attempt to disable copying but normally text data in PDF files are text.

There are many things that websites (the people) do that I wish I could change. Such as, I think they should not call tracking (available as a built-in feature in most browsers) ad-blocking (something we must use an extension to get).

Martyr2 · February 9, 2024, 10:14pm

Then it must have “Bugs”… as in Bugs Bunny that is.