Reading unknown data - ideas

dante7 · April 17, 2011, 10:32pm

What would be the best way to analyze collected data if the structure nor the data type is known to the developer (assume that I’m using PHP)? Shortly speaking, I’m trying to collect some personal (public) information about users by reading HTML code from specific pages (since I don’t have access to DB of these pages). The content is dynamic - the scope of data varies from ‘name’ to ‘hobbies’ (like “hobbies: sports, reading” and it is also can be arranged in this way “interests - diving; football”). I understand that the situation isn’t easy, so maybe you could give me some advice or tips on how to try and solve this? Any help appreciated.

sourcez · April 17, 2011, 11:52pm

Well if you have recurring fields, or only a limited number of possible fields on the site why not just grab the page html and pick them out? You could simply loop through an array looking for content. Or does the user specify the content heading?

Sorry, just reread the post! Could you treat punctuation as delimiters? You’d then be grabbing data that would look a lot like keywords. This would work best if you could define the headers.

dante7 · April 18, 2011, 11:33am

Thanks for the quick reply.
Treating punctuation as delimiters may solve only half of all the problems, and in fact, would create new ones. The reason of this, is because the content could be written in narrative sentences, like (for example):

We visited Hartford, Connecticut, last summer.

In this case, the ‘Connecticut’ would be considered as a keyword, although the whole sentence should be considered as a complete element.

I should’ve explained already in the start that my idea is to read different “résumés”. This means that the content is dynamic, as I had already mentioned in my previous post. The good news is that usually people tend to separate their content with differently-styled headers, like:

[B]Education[/B]

2005-2008 University of...
2001-2005 School of ...

[B]Hobbies[/B]

Art, literature

So in this case, some guidelines could be extracted on how to analyze this text. However, should the user avoid formatting the text, analysis of data would be a big problem (at least as I understand it now).

What would be the solution then? Should I store different headers (their synonyms) and try to separate the content by searching for them? The idea is of a very high risk, because of the dynamic content - it could be almost anything. How would you do it?

wwb_99 · April 18, 2011, 1:30pm

What you typically want to do here is to write a data structure for your application then let your scrapers take the data they find and push it into the structure. Most of the magic happens out there.

This is a really tricky problem which is why google makes the big bucks.

Topic		Replies	Views
Collecting data, putting them in DB table PHP	8	551	September 22, 2011
How to filter data? PHP	31	1581	April 7, 2011
Parse HTML PHP	8	1221	October 8, 2014
Collecting data from external site PHP	5	409	December 4, 2011
Getting Data from within the Document PHP	4	397	February 7, 2010

Reading unknown data - ideas

Related topics