SitePoint Sponsor

User Tag List

Results 1 to 7 of 7
  1. #1
    SitePoint Zealot ricklach's Avatar
    Join Date
    Nov 2004
    Location
    Victoria BC
    Posts
    116
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Looking for Ideas on Parcing a GEDCOM file

    I posted a message on this topic previously but at the time did not realize how difficult a task it was going to be - consequently, I am looking for suggestions on how best to tackle the problem. A GEDCOM file (http://homepages.rootsweb.com/~pmcbr...TO_FAMILY_LINK ) is a highly structured file with a pattern. This is how individual data is structured [code=]INDIVIDUAL_RECORD: =

    n @<XREF:INDI>@ INDI {1:1}
    +1 RESN <RESTRICTION_NOTICE> {0:1}
    +1 <<PERSONAL_NAME_STRUCTURE>> {0:M}
    +1 SEX <SEX_VALUE> {0:1}
    +1 <<INDIVIDUAL_EVENT_STRUCTURE>> {0:M}
    +1 <<INDIVIDUAL_ATTRIBUTE_STRUCTURE>> {0:M}
    +1 <<LDS_INDIVIDUAL_ORDINANCE>> {0:M}
    +1 <<CHILD_TO_FAMILY_LINK>> {0:M}
    +1 <<SPOUSE_TO_FAMILY_LINK>> {0:M}
    +1 SUBM @<XREF:SUBM>@ {0:M}
    +1 <<ASSOCIATION_STRUCTURE>> {0:M}
    +1 ALIA @<XREF:INDI>@ {0:M}
    +1 ANCI @<XREF:SUBM>@ {0:M}
    +1 DESI @<XREF:SUBM>@ {0:M}
    +1 <<SOURCE_CITATION>> {0:M}
    +1 <<MULTIMEDIA_LINK>> {0:M}
    +1 <<NOTE_STRUCTURE>> {0:M}
    +1 RFN <PERMANENT_RECORD_FILE_NUMBER> {0:1}
    +1 AFN <ANCESTRAL_FILE_NUMBER> {0:1}
    +1 REFN <USER_REFERENCE_NUMBER> {0:M}
    +2 TYPE <USER_REFERENCE_TYPE> {0:1}
    +1 RIN <AUTOMATED_RECORD_ID> {0:1}
    +1 <<CHANGE_DATE>> {0:1}[/code]
    Things in <<>> brachets are further structured. For example, <<PERSONAL_NAME_STRUCTURE>> has this structure: [code=]PERSONAL_NAME_STRUCTURE: =

    n NAME <NAME_PERSONAL> {1:1}
    +1 NPFX <NAME_PIECE_PREFIX> {0:1}
    +1 GIVN <NAME_PIECE_GIVEN> {0:1}
    +1 NICK <NAME_PIECE_NICKNAME> {0:1}
    +1 SPFX <NAME_PIECE_SURNAME_PREFIX> {0:1}
    +1 SURN <NAME_PIECE_SURNAME> {0:1}
    +1 NSFX <NAME_PIECE_SUFFIX> {0:1}
    +1 <<SOURCE_CITATION>> {0:M}
    +2 <<NOTE_STRUCTURE>> {0:M}
    +2 <<MULTIMEDIA_LINK>> {0:M}
    +1 <<NOTE_STRUCTURE>> {0:M}[/code]
    Not all fields exist in a record but every field must have some logic associated with it to save it into a database. A typical individual record may look like this: [code=] 1. 0 @I1@ INDI
    2. 1 REFN 1
    3. 1 NAME Maurice /Lampron/ Lacharité
    4. 2 GIVN Maurice
    5. 2 SURN Lampron
    6. 2 NSFX Lacharité
    7. 2 SOUR @S188@
    8. 3 PAGE Page 1
    9. 2 SOUR @S1457@
    10. 1 NAME Maurice /Laspron/ dit Lacharité
    11. 2 GIVN Maurice
    12. 2 SURN Laspron
    13. 2 NSFX dit Lacharité
    14. 2 SOUR @S31@
    15. 3 PAGE See File 1133-1
    16. 2 SOUR @S43@
    17. 3 PAGE See File 1927-1
    18. 2 SOUR @S98@
    19. 3 PAGE See Page 659
    20. 1 NAME Maurice /Lapron/ dit Lacharité
    21. 2 GIVN Maurice
    22. 2 SURN Lapron
    23. 2 NSFX dit Lacharité
    24. 2 SOUR @S247@
    25. 3 PAGE E-copy of the marriage entry for Maurice Lapron dit Lacharité and Jeann
    26. 4 CONC e Archambault.
    27. 1 SEX M
    28. 1 CHAN
    29. 2 DATE 27 NOV 2006
    30. 1 BIRT
    31. 2 DATE 26 AUG 1685
    32. 2 PLAC Nicolet River, Nicolet, Nicolet-Yamaska, Québec, Canada, 461300N0723700W
    33. 2 SOUR @S31@
    34. 3 PAGE See File 1133-1
    35. 2 SOUR @S284@
    36. 3 PAGE Extract from the church register
    37. 2 SOUR @S1457@
    38. 1 EVEN
    39. 2 TYPE Anecdote
    40. 2 NOTE Maurice, son of Jean Laspron dit Lacharite and Anne Michelle Renaud, wa
    41. 3 CONC s born on 26 Aug 1685 and baptized at the cripécuriale de Cressé, as Maur
    42. 3 CONC ice Lapron dit Lacharite on 2 Sep 1685 on the Nicolet River, Nicolet, Qué
    43. 3 CONC bec, Canada. The following is a transcript of the original church record
    44. 3 CONC :
    45. 3 CONT Le deuxième jour de septembre del'an mil six cent quatre vingt cinq pa
    46. 3 CONC r moy [moi], J.G. de Brurlon, curé de l'Eglise paroissiale de Notre Dam
    47. 3 CONC e des Trois Rivières, a esté [été] baptisé en la maison cripécuriale de C
    48. 3 CONC ressé, oû l'on dit la messe, Maurice, fils de Jean Lapron dit Lacharité e
    49. 3 CONC t de Michelle Anne Renaud sa femme, habitants du dit lieu de Cressé. L'E
    50. 3 CONC nfant est né du vingt sixième d'aoust [août] dela mesme [meme] année. So
    51. 3 CONC n parrain fut Maurice Cardin, fils de Pierre Loiseau et la marraine (Loru
    52. 3 CONC sse ?) Lemirre, femme de Pierre Pepin, tous habitants du dit lieu de Cres
    53. 3 CONC sé, lesquels ont déclaré ne ....., si signer, de ce (..quis ?) suivant l'
    54. 3 CONC ordonnance. - J.G. de Brurlon"
    55. 3 CONT He may also have been known as Maurice Laspron dit Lacharité.
    56. 3 CONT At around age 20 he must have moved to Pointe aux Trembles because on 1
    57. 3 CONC 3 Apr 1711 he married Marie Aubuchon, daughter of Jean Aubuchon dit Lespe
    58. 3 CONC rance and Marguerite Sédillot at Eglise Enfant-Jésus, Pointe aux Trembles
    59. 3 CONC , Ile de Montréal, Québec, Canada. He had two known children with Marie b
    60. 3 CONC ut there were probably more. Following Marie Aubuchon's death sometime be
    61. 3 CONC fore 1749, he married Marie Jeanne Archambault, daughter of Laurent Archa
    62. 3 CONC mbault and Anne Courtemanche on 7 Jan 1749 at L'Enfant Jesus, Pointe au
    63. 3 CONC x Trembles, Isle de Montréal, Québec, Canada.
    64. 3 CONT
    65. 3 CONT The coureurs de bois were a hardy and sometimes savage group of Frenchm
    66. 3 CONC en that illicitly traded with the Indians to get the pick of the firs an
    67. 3 CONC d sometimes get the better of them in a trade by getting them drunk. Sin
    68. 3 CONC ce Montréal was the headquarters of these lawless men, it is entirely pos
    69. 3 CONC sible that Maurice was indeed a coureurs de bois. It is known that he wa
    70. 3 CONC s employed by one of the fur trading companies, probably the Company of N
    71. 3 CONC ew France or of a Hundred Associates as it became known, circa 23 May 171
    72. 3 CONC 7. There is plenty more history to be discovered.
    73. 3 CONT
    74. 3 CONT He died on 19 Dec 1749, just 11 months after his second marriage, and wa
    75. 3 CONC s buried on 20 Dec 1749 in the cemetery, Pointe du Trembles, Pointe du Tr
    76. 3 CONC embles, Isle de Montréal, Québec, Canada. He was 64 years old.
    77. 1 DEAT
    78. 2 DATE 19 DEC 1749
    79. 2 PLAC Pointe-aux-Trembles, Montréal, Québec, Canada, 453900N0733000W
    80. 2 SOUR @S247@
    81. 3 PAGE E-copy of the burial entry for Maurice Lapron dit Lacharité.
    82. 1 BURI
    83. 2 DATE 20 DEC 1749
    84. 2 PLAC Cemetery, Pointe-aux-Trembles, Pointe-aux-Trembles, Montréal, Québec, Canada, 453900N0733000W
    85. 2 SOUR @S247@
    86. 3 PAGE E-copy of the burial entry for Maurice Lapron dit Lacharité.
    87. 1 EVEN
    88. 2 TYPE Baptism
    89. 2 DATE 02 SEP 1685
    90. 2 PLAC Cripecuriale de Cressé, Nicolet, Nicolet-Yamaska, Québec, Canada, 461300N0723700W
    91. 2 NOTE The following is from the original church record: "Baptism of Maurice La
    92. 3 CONC pron dit Lacharité - Le deuxième jour de septembre del'an mil six cent qu
    93. 3 CONC atre vingt cinq par moy [moi], J.G. de Brurlon, curé de l'Eglise paroissi
    94. 3 CONC ale de Notre Dame des Trois Rivières, a esté [été] baptisé en la maison c
    95. 3 CONC ripécuriale de Cressé, oû l'on dit la messe, Maurice, fils de Jean Lapro
    96. 3 CONC n dit Lacharité et de Michelle Anne Renaud sa femme, habitants du dit lie
    97. 3 CONC u de Cressé. L'Enfant est né du vingt sixième d'aoust [août] dela mesm
    98. 3 CONC e [meme] année. Son parrain fut Maurice Cardin, fils de Pierre Loiseau e
    99. 3 CONC t la marraine (Lorusse ?) Lemirre, femme de Pierre Pepin, tous habitant
    100. 3 CONC s du dit lieu de Cressé, lesquels ont déclaré ne ....., si signer, de c
    101. 3 CONC e (..quis ?) suivant l'ordonnance. - J.G. de Brurlon"
    102. 2 SOUR @S44@
    103. 1 OBJE
    104. 2 FORM JPEG
    105. 2 TITL Burial
    106. 2 FILE c:\The Master Genealogist\Documents\1-02 Maurice Lampron Lacharite Burial.jpg
    107. 1 OBJE
    108. 2 FORM JPEG
    109. 2 TITL Marriage
    110. 2 FILE c:\The Master Genealogist\Documents\1-03 Maurice Lampron Lacharite Marriage.jpg
    111. 1 OBJE
    112. 2 FORM JPEG
    113. 2 TITL Marriage
    114. 2 FILE c:\The Master Genealogist\Documents\1-01 Maurice Lampron Lacharite Marriage.jpg
    115. 1 FAMS @F1@
    116. 1 FAMS @F2@
    117. 1 FAMS @F3@
    118. 1 FAMC @F4@[/code]
    . If we take the initial bit of data [code=]@I1@ INDI
    2. 1 REFN 1
    3. 1 NAME Maurice /Lampron/ Lacharité
    4. 2 GIVN Maurice
    5. 2 SURN Lampron
    6. 2 NSFX Lacharité
    7. 2 SOUR @S188@
    8. 3 PAGE Page 1
    9. 2 SOUR @S1457@
    10. 1 NAME Maurice /Laspron/ dit Lacharité
    11. 2 GIVN Maurice
    12. 2 SURN Laspron
    13. 2 NSFX dit Lacharité
    14. 2 SOUR @S31@
    15. 3 PAGE See File 1133-1
    16. 2 SOUR @S43@
    17. 3 PAGE See File 1927-1
    18. 2 SOUR @S98@
    19. 3 PAGE See Page 659
    20. 1 NAME Maurice /Lapron/ dit Lacharité
    21. 2 GIVN Maurice
    22. 2 SURN Lapron
    23. 2 NSFX dit Lacharité
    24. 2 SOUR @S247@
    25. 3 PAGE E-copy of the marriage entry for Maurice Lapron dit Lacharité and Jeann
    26. 4 CONC e Archambault.[/code]
    the first line (@I1@ INDI) identifies the individual with an ID of 1 - this must be appended to whatever methodology is determined - ie. individual[1] = . Then each individual line must be parsed (and concatinated if necessary) so that, for example: [{REFN=> 1}, {NAME=> Maurice /Lampron/ Lacharité, primary=>true}(note: the first name is always the primary name, others are alias names) {GIVN=> Maurice, SURN=> Lampron, NSFX=> Lacharité, {SOUR=> @S188@, PAGE=> "Page 1"}, SOUR=> @S1457@}... and so on]. This array would contain all the details for one individual - all the data between the (@I1@ INDI) and following (@Ix@ INDI)tag where x could be any number. So the module would have to read all the lines between the two @ xx @ and plaace them into an array and then that array would have to be parsed to separate out the individual components and place them into rows in a table. This is a rather simplistic approach that I have taken and I have overlooked the fact that you could get multiple hashes that looked similar and that is my perplexing problem. There must be an easier way to do this. All suggestions and approaches are welcome.
    Ruby, Ruby when will you be mine

  2. #2
    SitePoint Guru
    Join Date
    Aug 2005
    Posts
    986
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ah now I understand what the numbers on each line mean

    You could substitute 4n spaces for each number n. Then you substitute ATTR: for each ATTR. Now you can parse this file as yaml.

  3. #3
    SitePoint Zealot ricklach's Avatar
    Join Date
    Nov 2004
    Location
    Victoria BC
    Posts
    116
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The line numbers down the side are not part of the original file - each line starts with something like 1 NAME Maurice /Lampron/ Lacharité. A 2 means that there is a sub-tag and a 3 is a continuation of a sub-tag or a note/string. The problem I am having trouble wrapping my head around is the one of taking all the information for one individual, putting it into some easily read array with unique tags and then reading that array and putting various components into a names table in a database. For example, this line: 1 NAME Maurice /Lampron/ Lacharité is a "NAME" tag with a complete name. The next several lines take that name and assign its various components to GEDCOM tags(BTW by convention the first name in the list of names is always the primary name and this would have to be added to to the table as primary=true)
    2 GIVN Maurice: given name=Maurice
    2 SURN Lampron: surname=Lampron
    2 NSFX Lacharité: suffix=Lacharite
    2 SOUR @S188@: source=188 and id number for the source
    3 PAGE Page 1: page="Page 1" a string that belongs to the note field of the source table
    2 SOUR @S1457@: source=1457 a second source

    The next name is an alias and follows the same pattern except that this time primary=false and there are three sources. All sources would be entered into the "citation" table. Then there are tags that give birth, death, burial, marriage info plus if those dates are suspect there can be secondary dates and these show up in an "EVEN"=event tag. So the problem as I see it is how to collect all the relevant parts of an entry (basically anything that starts with a 1) then collect the sub-entries into coherent parts and then plug this part of the data into the appropriate tables and then go and read one or more lines (again starting with a 1) to complete the next bit of information. So it seems that we need to count the # lines in the file, read all the lines between the occurances of 1, reset our counter to start where we last left off, take all of the data we collected and operate on it to put it into chunks that can then be added to a database. To further complimicate things, I am a relative neophyte at ROR but in my favor, I do like a challenge - and this fills the bill. So if you see me ask some pretty basic questions, chalk it up to the learning process. The first question is how to build the model and controller and an output screen to observe the steps in making this thing work because you are dealing with virtually all tables in the DB.
    Ruby, Ruby when will you be mine

  4. #4
    SitePoint Guru
    Join Date
    Aug 2005
    Posts
    986
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I know that the leftmost numbers are not part of the file ;-).

    Once you have this data in a database it's trivial.

    I really think you should try to transform this into YAML (Ruby already has a YAML parser).

    Try this code (I'm not sure if it works, but I think the appoach works):

    Code:
    require 'yaml'
    
    yaml = IO.readlines('file.txt').map do |l|
    	l.sub(/([0-9] [A-Z0-9@]{4})/, '\\1:').sub(/([0-9]) /){|number| '    ' * number.to_i}
    end
    
    puts yaml
    puts YAML.parse(yaml.join("\n"))

  5. #5
    SitePoint Zealot ricklach's Avatar
    Join Date
    Nov 2004
    Location
    Victoria BC
    Posts
    116
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    At this stage I am willing to try anything. The question is how best to start? Should I create a "gedcom-yaml.rb" file? Do I use the console to inspect the results? I presume this methodology produces a new yaml document and it is that document that will get parsed? I am in unfamiliar territory now and need your guidance. I would like to develop this incrementally so I can inspect the results of each change and get a better appreciation of just what the code is doing. Rick
    Ruby, Ruby when will you be mine

  6. #6
    SitePoint Guru
    Join Date
    Aug 2005
    Posts
    986
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, first make sure you understand yaml (it's very simple, so this is not a problem). Then figure out how to transform the gedcom file to yaml. You don't have to save the yaml in a file, just keep it in a string. Now you can use the yaml library to parse it.

  7. #7
    SitePoint Zealot ricklach's Avatar
    Join Date
    Nov 2004
    Location
    Victoria BC
    Posts
    116
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ok, I have read up on yaml but I do forsee a bit of a problem. Some of these GEDCOM files can be 10-50 MB in size and that is a lot of information to put into a string. The php model for this transformation reads the file on a line by line basis, then checks each line and if required reads another line. etc. Given the size of the files is yaml still the best approach?
    Ruby, Ruby when will you be mine


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •