Lately it seems there has been an increase in datafeed driven/affiliate content sites out there. I myself have made quite a few. I have also seen the issue of what exactly is duplicate content discussed a few times recently.
We all know Google says that duplicate content is a "don't" and as such you risk being banned or penalized for doing it. But what exactly is duplicate content? It isn't just affiliate datafeed sites, such as those using Amazon AWS, that have duplicate content. People often create sites using feeds from Wikipedia and DMOZ, is this duplicate content? You could find a press release from Tivo on thousands of news, financial, or electronics websites. Is that duplicate content? What about game cheat sites that all list the same cheats?
I think we can all agree that when a single individual or business owns two websites with the exact same content that it is spam. But what about the thousands of websites owned by different people that all use the same content? Amazon AWS (Amazon Web Services) sites are not unique, they only offer affiliate content, and thus it'd seem Google would like to get rid of them in favor of listings for Amazon.com. In this situation it is easy to figure out who should get listed because there is a parent company everyone is affiliates with.
What about game cheat sites though? If you wanted to get rid of all the duplicate content how do you decide which one stays? DMOZ editors have faced this issue for a long time. You have two sites with the same content, which one is listed? My solution when I was an editor was to list them both, the reason is that maybe one site might be down when a user tries to visit it, so a certain amount of redundancy makes the directory more useful.
New datafeed enabled affiliate programs show up every day, as do new datafeed driven websites. Eventually there will be too many, search engines will have to do something, but what? There will be too many for manual review, and any automatic system could hurt other sites with duplicate content such as news sites and game cheat sites, etc. You might be able to write an algorithm that detects most Amazon AWS sites, but what about the thousands of other affiliate programs out there? And even then you're still just getting most of the websites. People will find a way around any filters.