We are a humble team trying to scrape ALL the websites as databases across the entire Internet. Eventually we would have a constantly updated collection of maybe 1,000,000 important websites whose data are re-fetched by the day or hour as they update their website.
Programs would then be able to traverse the ocean of data automatically without having to parse the texts or web pages at all.
The data are well extracted from the raw sources, cleaned, normalized tagged, and indexed, ready for searching, re-arranging, re-deploying, auto-analyzing, connection, association, etc. as the programs will be able to actually understand them as JSON / RDF.
This is different from Google in that the data are all well extracted and constructed as databases, rather than raw texts that are searched by keyword based natural language AIs.
Does this sound like a good business idea if we managed to pull it through?
The data will be constantly updated and pushed to your end, computer, email, SMS, website, app, server, API, ERP, SAP.
Would you buy a membership like this that enables you to get the data of any website?
What about copyright issues. Wouldn’t this mean my website which, let’s say, lists species of birds which i’ve spent years building a database containing 1000’s of birds and associated data would now be available to anyone else just to copy my data from your service and output onto their website? or am i missing something?
Whilst i agree plain facts can’t be copyrighted if i make a specific collection of those facts into a dataset then that dataset belongs to me. It will have taken me time, effort and money to create that dataset and then you would be profitting off of that.
Taking the bird example forward what if i had been doing field research for years and i physically captured (and then released) birds and measured their length and weight. Whilst they are facts, surely that information belongs to me. Or anther example: if i sold electronics and i measured every dimension so my users would know the size of a product, surely that information dataset belongs to me as i did the work.
I would adopt something like the youtube model and pay the contributors a royalty when their information is used. Everyone (well a lot of people) will be happier with that arrangement.
I can understand where you are coming from @Noppy but if the data is already on a website I can get it anyway.
We are back to the same thing with photographs on the web. If you don’t want anyone to take it don’t upload it.
But the OP is not like a search engine as he is not directing the user to the website to view the data which may leave him open to prosecution.
1,000,000 is not many sites, you may only have a couple of sites with information I want to view and personally I would not pay for the privilege.
I’m interested to know exactly how you will automate this process. If should be possible on a site that has very clear, semantic, well structured content and mark-up, or one that offers structured data in some form. But the truth is an awful lot of sites have very shoddy, nonsemantic, unstructured and often invalid mark-up which machines would have difficulty interpreting.
Yes, true. That’s exactly where our platform provides value that the data of the entire Internet can now be easily used, re-mixed, re-deployed without spending an arm and a leg.
A data platform like this is meant to be scraped, as the data are all public as APIs. So feel free to scrape as much as you want!
Less work for everybody.
Of course you have to pay a small fee to compensate the costs.
That’s definitely a way to go. You are right. People’s work should be respected by compensation. That’s a tricky part that we are trying to solve.
But still, more than HALF of the Internet traffic is now by bots crawling and scraping everything they find useful. Everyone is doing it for their own good. It’s a done deal and I doubt anything would ever change that people would simply stop scraping and totally respect other people’s work. God knows if their work is REALLY their work in the first place if you know what I mean.
Scraping costs much Internet bandwidth, especially with thousands of different crawlers and scrapers harvesting the same popular sites, over and over again. The same piece of data are scraped by so many different people writing the same scrapers, run so many times. It’s a tremendous waste of bandwidth, power, and human resources.
Our solution is to become the go-to place for data on the Internet, where people exchange data under some licenses or by compensations, so scraping itself comes to an end, with the same piece of data scraped only once but served thousands of times.
It’s a win-win-win for everybody. However data source websites can opt out any time they want.
There are far more sites on the Internet, I agree, but sites being constantly updated with data / content valuable enough to be scraped, don’t come more than 1,000,000.
How about offering a search of data? The data record displayed would have a link to the source? Like Google. That would make things better? I never thought of this. That’s genius thank you!
What if our platform saves you 5 - 10 hours of work? What’s it worth for you in dollars?
ISP offers infra access to the Internet, or human consumable content. But not data.
We are still investigating the possibility of this business model and only have some prototypes. Maybe you would be interested in giving your insights and suggestions? We would be very much honored and grateful!
I’m not buying into what you’re selling but if I were I would recommend starting small with one industry / category and seeing how things go from there. As you have success expand the portfolio / products to include other industries or categories. I would probably recommend doing some market analysis to determine *best market to start with in terms of able to afford your service, willing to pay, and in demand. I will say that scrapping websites without consent is a legal and technical gray area – not something I would recommend building a business upon. Working hand and hand with companies that supply the data is another story.
It seems that arrangement would imply the sites agreement to being scraped. Of course having the agreement stated explicitly and in writing in a contract for each of the sites would offer the best legal protection.
You can see we have HTML, JSON, Excel, CSV, and PDF for all of the data. More formats like MySQL and MS SQL will be added soon.
Would you be interested in something like this considering we will also have the APPS (https://datasn.io/p/763) for you to combine, manipulate, transform, filter, remix, and integrate the data to your own system / website / app?
After investing too much in this project, we are actually pretty desperate right now. Seemed we might solved the wrong problem that didn’t exist after all.
So I started the thread here for your kind suggestions and ideas regarding the business model, seeking any help and insights we can find to brainstorm our way out.
It seems it may be a good time to change course from just data to one of the following areas:
Specialize in data that are crucial to AI and machine learning, e.g. providing tagged skin disease images to train programs to identify skin diseases
Shift from data to apps that provide easy to use data pipeline and integration tools, e.g. seamless integration of data from multiple sources for content publications or information analysis in real time
See Internet as global climate and we are the global climate observatory that constantly publishes real time data, analysis results and weather (economical) reports.
Your say about our ideas?
We would really appreciate any help we can get. It’s not easy to be a startup.
If any of you are interested in trying our product, just PM me and I’ll exclusively provide any SitePoint member free access to our entire data collections for 2 years. All you may want to do is to provide feedback for us to tune our model to the market. Otherwise we are sitting ducks here.
After seeing the website I’m actually pleasantly surprised. Whether or not this is a viable business I don’t know but it does look like you know what you are doing. Most people just come on here with an idea but you do actually have a decent implementation.
I looked at the Car Parts API that you provided, and after looking at it, I am not sure what it can be used for?
I mean, what customer type is that API calls intended for? I.e., what is the use case the data is intended for?
With your ideas, moving on to cover data for machine learning is not a bad idea as there will be a high demand for this in the future. Assuming you can provide good data sets, the problem here as well will be the marketing. I.e., get your product out to the people that could need it.
Another idea could be to provide datasets that solve a business requirement. For example, data set for validation of zip codes, addresses, etc. for different countries. This would be different from the current data model since it is data that you cannot create by spidering other websites.