few minutes ago I got a new challenge from my Boss and to be honest at the moment I have no clue if and if, how this can be developed.
So I thought I just ask all of you to give me your thoughts and discuss with me how to solve this problem:
We have a process where the customer is sending questionnaires to us. We have a team which is doing some kind of first step review and then sends this questionnaire to some specialists which can answer the questions the level 1 team can’t answer and then send it back to the team and they send it back to the customer.
What we now want to achieve is to reduce the need of the specialists. To do so, we want to save all answered questions in a database to be able to “reuse” them if they come up again.
For this the questionnaires must be kind of “parsed” and splitted into Sections and Questions and Answers.
The problem is, that the questionnaires can have every type of format. It can be a simple .txt file, a word document with tables and checkmarks, an Excel file with multiple sheets and even macros, a PowerPoint and so on and so on…
At the end I know that it is impossible to have a 100% solution which automatically can handle all this different types of files but it would be nice to have some kind of “best result” in relation to “minimum of expense in development”.
So for example: Maybe it’s the best way to convert all the different file formats first to an .txt file? Or maybe there is already a library which is able to convert excel to word and power point to word, so I can break down this formats to one?
Maybe there is another solution I do not know but you have heard of?
Any help is appreciated.
Who controls the creation of the questionnaires? Your company or the outside customers?
If the customers, I can’t think of a viable way to parse them since you’re not going to know the formats.
If the company, the easier answer would be to create a consistent questionnaire format (or a few if needed) which can then be parsed appropriately. I say easier because as soon as freetext answers apply to the questionnaire, all bets are out the window trying to match up answers to questions.
Yes the questionnaire are from the customers. That’s why they can have all formats. As I said we do not need to support all formats from scratch we can also start with one format like word or whatever and then later add other ones. But also in that case it would be nice to start with the one which covers the most others.
It also must not be a complete automatic process. Would also be greatful if the software is doing some kind of pre work and then shows this to the team member who the. Moves the right questions to the right answers where the software failed.
To be clear. This is not a small application which much be develops in 1 week. We are talking about a big complex application at the end and that’s why it’s more important for me to have a good basis to start with.
How big are the chances that questions will actually repeat? Is there some data on that? Otherwise calculating ROI on this will be very hard, which in turn makes it impossible to come up with a sensible budget.
That said, parsing natural language is always very hard. A poor man’s solution would keep keywords and check if keywords overlap between the new question and old questions. You probably want something a little bit fuzzy here to deal with singular and plurals and so on. Something like trigram matching would be a good first candidate for that. Or maybe soundex, but I don’t know if that support anything other than English.
In a further iteration you may want to look into machine learning algorithms. Something like K nearest neighbors. But that will only really work on a large enough sample set where the questions aren’t too different from each other. Neural networks might also work, but it’s near impossible to find out why they are spitting out the results they do, so maybe not very applicable here.
In the end it really boils down to how many questions you have available and how many of those actually repeat. Garbage in, garbage out.
Any easier approach may just be some sort of build it as you go. May take longer to relieve the specialist usage, but would be more effective than trying to automagically parse all those different types of formats.
But create a CMS or wiki and build a FAQ. As the level 1s get a question they can’t answer first search the CMS/Wiki (keywords, tags, content, whatever). If they find it, they can copy/paste to return to the customer. If not, they open a new entry which gets routed to the specialists. The specialists can either answer the question (which saves a new entry to the CMS/Wiki for future use) or can point them to a pre-existing entry (perhaps the question was worded badly).
If you’re smart, you build level one entries there too. Allows for easy training, and consistent answers.
If you do not want to reinvent the wheel, it sound to me like stackoverflow team