What would be the best way to analyze collected data if the structure nor the data type is known to the developer (assume that I’m using PHP)? Shortly speaking, I’m trying to collect some personal (public) information about users by reading HTML code from specific pages (since I don’t have access to DB of these pages). The content is dynamic - the scope of data varies from ‘name’ to ‘hobbies’ (like “hobbies: sports, reading” and it is also can be arranged in this way “interests - diving; football”). I understand that the situation isn’t easy, so maybe you could give me some advice or tips on how to try and solve this? Any help appreciated.
Well if you have recurring fields, or only a limited number of possible fields on the site why not just grab the page html and pick them out? You could simply loop through an array looking for content. Or does the user specify the content heading?
Sorry, just reread the post! Could you treat punctuation as delimiters? You’d then be grabbing data that would look a lot like keywords. This would work best if you could define the headers.
Thanks for the quick reply.
Treating punctuation as delimiters may solve only half of all the problems, and in fact, would create new ones. The reason of this, is because the content could be written in narrative sentences, like (for example):
We visited Hartford, Connecticut, last summer.
In this case, the ‘Connecticut’ would be considered as a keyword, although the whole sentence should be considered as a complete element.
I should’ve explained already in the start that my idea is to read different “résumés”. This means that the content is dynamic, as I had already mentioned in my previous post. The good news is that usually people tend to separate their content with differently-styled headers, like:
[B]Education[/B]
2005-2008 University of...
2001-2005 School of ...
[B]Hobbies[/B]
Art, literature
So in this case, some guidelines could be extracted on how to analyze this text. However, should the user avoid formatting the text, analysis of data would be a big problem (at least as I understand it now).
What would be the solution then? Should I store different headers (their synonyms) and try to separate the content by searching for them? The idea is of a very high risk, because of the dynamic content - it could be almost anything. How would you do it?
What you typically want to do here is to write a data structure for your application then let your scrapers take the data they find and push it into the structure. Most of the magic happens out there.
This is a really tricky problem which is why google makes the big bucks.