I’m starting work on a big project to publish a lot of government development data online. The data consists of different kinds of datasets – budgets, HR and personnel, development milestones for various projects, zip codes, census info etc. I’m looking to build a scalable architecture for this, and wanted some thoughts on what technology and architecture to pick. I have worked with smaller databases and have working knowledge of databases, but this is the first time i'm exploring connecting and integrating several databases
Here are the key requirements:
- All dataset can be assumed to be in the same database format (most likely MySQL)
- The datasets need to have the ability to connect to each other. In other words, I need to be able to run queries that span multiple different datasets (for example: what is the average income of families with atleast 2 kids in area code X)
- The architecture needs to be scalable to easily allow plugging in more datasets. I should be able to add a whole new dataset and maybe write a small wrapper, and everything should continue functioning normally.
- Any technologies used need to be open-source and/or free.
Eventually the goal is for people to be able to form and run queries via a web interface.
I’d really appreciate any thoughts or pointers on this.