SitePoint Sponsor

User Tag List

Results 1 to 5 of 5
  1. #1
    SitePoint Enthusiast adear11's Avatar
    Join Date
    Oct 2003
    Location
    Richland MS
    Posts
    56
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Need help with indexing pdf files in different directories

    I am in need of some ideas about how to solve a problem for a client.

    Currently they have on their site a section that contains board meeting minutes in pdf format. These pdfs have been indexed using htdig and are searchable by keyword.

    They want to add minutes from meetings of another group, and they want them indexed and searchable.

    The pdf files will be in a structure similar to:

    mydomain.com/pdfs/BoardOfSupervisors/

    mydomain.com/pdfs/WaterAssociation/

    Each of these urls will contain pdf versions of each groups minutes.

    My question is, what is the best way to separate the indexing of these urls so that when I do a search for Board of Supervisors minutes I'm not shown results that are from Water Association minutes? I can't have the two groups results mixing together.

    Given that I'm a php programmer I would prefer a solution that uses php. I haven't been able to determine what the best course of action will be.

  2. #2
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2008
    Posts
    5,757
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I'm not familiar with htdig, but if the url is available, parse_url() and extract that directory to provide seperation.

  3. #3
    SitePoint Enthusiast adear11's Avatar
    Join Date
    Oct 2003
    Location
    Richland MS
    Posts
    56
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I don't want to index the files everytime someone does a search. Htdig indexes all the current pdf files, and then saves the index data to a database. Searches are run against the database. I'm not sure how parsing the url would help.

  4. #4
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2008
    Posts
    5,757
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I figured that maybe the url is stored in the database.

    Edit:


    I'm not suggesting you index on the fly.
    I'm suggesting you filter your search results on the field which contains the url.
    parse_url() was just mentioned to hint this.

  5. #5
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    Aren't they more likely to want to do 3 types of search?

    1 mydomain.com/pdfs/BoardOfSupervisors/ only

    2 mydomain.com/pdfs/WaterAssociation/ only

    3 Both of them

    If this is the case I think you'd need to tell htdig to create 3 indexes, but not knowing enough about htdig either, I can't say if this is possible.

    And, doesn't each set of minutes have a corresponding agenda too?


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •