Home Who am I? projects hobbies links


     
 

The Semantic Web for Family History

Summary

For thousands of years, people have been keeping track of their family history.  Therefore, genealogy seems to be an obvious application of an RDF ontology and the Semantic web.  I've investigated making use of RDF and the Semantic Web for Family History.  The results of my investigation are here on this web page.  In my work, I created a program to translate files in GEDCOM format to XML.  I also wrote several stylesheets which translated the data into a new format GEDCOM XML, HTML, and RDF.  On this page I discuss the History of Geneaolgy and computers, Genealogy Markup Languages, Sources for Genealogy Data in GEDCOM format, the GEDCOM format, the GEDCOM XML format, the RDF format for Family History that I use, a Java Program to convert from GEDCOM to basic XML, Stylesheets I've created to transform data between these different formats, the resulting data that I transformed using my Stylesheets, Advantages of RDF, Problems with RDF, and Future Directions.

History of Genealogy and Computers

In the 1980's, the Church of Jesus Christ of Latter-day Saints created computer software to help individuals keep track of their family history.  They created PAF or Personal Ancestry File.  Many commercial companies have also created commercial Genealogy programs to do the same thing. The problem with having all these different Genealogy program is that different programs store their data in different formats.  Genealogists love to share data. Having all these different file formats is not conducive to sharing data.  For this very reason, the LDS church created the GEDCOM format. GEDCOM stands for Genealogy data communication.  It caught on quite well and most current commercial and free genealogy programs today can import and export GEDCOM files.  This makes it relatively easy to exchange data between genealogists.  The GEDCOM specification has grown and developed over time, because newer Genealogy programs are able to store more and more information about people.  GEDCOM must be as flexible as possible to allow communication between the many different types of family history programs out there.  The current GEDCOM specification even allows multimedia files so people can store videos of there kids birthdays or sound clips of their grandparents anniversaries, etc...  The current version of production version of GEDCOM is version 5.5.  The next version of GEDCOM is GEDCOM 6.0 or also called GEDCOM XML.  It is currently in Beta and the DTD can be found here.  There are several other XML vocabularies for Geneaology data that have been proposed by other organizations.  Here are several of them:

Genealogy Markup languages

There are many markup languages which have been or are being developed for Genealogy.  Here are a few of the markup languages:

GEDCOM XML - As mentioned above, this is also referred to as Gedcom 6.0.  It was prepared by the Family and Church History Department of The Church of Jesus Christ of Latter-day Saints.
GedML - Genealogical Data in XML
Encoding genealogical data sets in XML, it combines the well-established GEDCOM data model with the XML standard for encoding complex information.
GeniML - Genealogical Information Markup Language
An XML vocabulary for recording and exchanging genealogical data.
GenXML  A file format for exchange of data between genealogy programs. It is an alternative to Gedcom 5.5.

Since Gedcom XML is the next version of the Gedcom standard, I believe it will be more poplular than the others.  Therefore, I have chosen to work with GEDCOM XML instead of the others.

Sources for Genealogy Data:

To get GedCom data, I found many GEDCOM files on the website: www.genealogy.com/famousfolks.   I also have been working on my own genealogy for several years now and have several of my own files I can play with.  Another place to get GEDCOM files is www.familysearch.org

Gedcom format: 

The Gedcom file format has two major sections.  The first section lists individuals and information about them.  An example is as follows:

0 @I12@ INDI
1 NAME Clarence Earl /Hanlin/
1 SEX M
1 BIRT
2 DATE JUN 1880
1 FAMS @F93@
1 FAMC @F95@
0 @I13@ INDI
1 NAME John William /ASKREN/
1 SEX M
1 BIRT
2 DATE 5 JUN 1893
1 DEAT
2 DATE 1938
1 FAMS @F5@
1 FAMC @F4@

The first number on each line shows nesting.  '0' is the beginning of a new record as in "0 @I12@ INDI"  The characters between the '@' symbols refer to the unique identifier for the individual.  The "INDI" show that this record is an individual.  The second line above starts with a 1.  This means, that we are getting more detailed about the given individual.  A couple lines down, the line begins with a '2'.  Again, more details are given about the line above, in this case, the date of the birth in the above line.  The tags "FAMS" and "FAMC" refer to the families that the individual is a spouse in and a child in respectively.

The second section of the Gedcom file is a list of all the relationships.  An example is as follows

0 @F5@ FAM
1 HUSB @I13@
1 WIFE @I21@
1 CHIL @I22@
1 CHIL @I23@
1 MARR
2 DATE 18 DEC 1915
2 PLAC Harrison Co,IN
1 DIV N

This shows that Family F5 has a husband a wife, and two children.


GEDCOM XML 6.0 format:

The tags that are important for my purposes are <FamilyRec>, <IndividualRec>, <EventRec>, and <GroupRec>.  FamilyRec is for families and of course IndividualRec is for indivduals. The equivalent tags in GEDCOM 5.5 for these tags are FAM and INDI.  EventRec stands for events such as births, deaths, marriages, etc...  GEDCOM 5.5 does not have an event tag.  It does however have tags for specific events such as birth (BIRT), marriage (MARR), death (DEAT), etc...  The GroupRec can store information about a group such as a household, a neighborhood, an orphanage, a group of homes, etc...  GEDCOM 5.5 does not appear to have such a tag.

Examples of the markup:


<!-- Family Records -->

<FamilyRec Id="FM001">
    <HusbFath>
        <Link Target="IndividualRec" Ref="IN001"/>
    </HusbFath>
    <WifeMoth>
        <Link Target="IndividualRec" Ref="IN002"/>
        <FamilyNbr>2 </FamilyNbr>
        <!-- Her second marriage or family. -->
    </WifeMoth>
    <Child>
        <Link Target="IndividualRec" Ref="IN003"/>
        <ChildNbr>1</ChildNbr>
        <!-- Child's order in family, if birth dates unknown. -->
    </Child>
    <Child>
        <Link Target="IndividualRec" Ref=". . ."/>
        <ChildNbr>2</ChildNbr>
        <RelToFath>adopted</RelToFath>
    </Child>

    <BasedOn>
        <!-- Events justifying the creation of the family, and its members. -->
        <Link Target="EventRec" Ref="EV001"/>
        <Link Target="EventRec" Ref="EV002"/>
        <Note>
            . . .
        </Note>
    </BasedOn>
    <ExternalID Type="AFN" Id="4S3469Q"/>
    <ExternalID Type="submitter" Id="F8945"/>

    <!--
        This is the ID used by the system that produced this GEDCOM file.
        It can be used to communicate changes, differing opinions, and so on, to the file submitter.

    -->
    <Submitter>. . .</Submitter>
    <Note>. . .</Note>
    <Evidence>
        <Citation>
            <!--
                Normally a family is based on (see above) events, and the evidence citations are contained in the events.
                Evidence is allowed in family records for those cases where a family is documented,
                    such as in a family history, but no specific events are known.

            -->
        </Citation>
    </Evidence>
    <Enrichment>
        <Citation>
            <Link Target="SourceRec" Ref="SR002"/>
            <Caption>We Attend the Kunzle Family Reunion</Caption>
            <WhereInSource>
                5 min, 15 sec into the video, to 10 min, 30 sec.
            </WhereInSource>
            <Note>Our family is featured about 5 minutes into the video.</Note>
        </Citation>
    </Enrichment>
    <Changed Date="23 APR 1976" Time="13:25:12">
        <Note>Record created</Note>
    </Changed>
    <Changed Date=". . ." Time=". . .">
    <!-- The Contact here is the person responsible for the change. -->
        <Contact>
            <Link Target="ContactRec" Ref=". . ."/>
        </Contact>
        <Note>Adopted child added</Note>
    </Changed>
</FamilyRec>


<!-- Individual Records -->

<IndividualRec Id="IN001">
    . . .
</IndividualRec>

<IndividualRec Id="IN002">
    <IndivName>
        <NamePart Type="title">Duchess </NamePart>
        <NamePart Type="given name" Level="3"> Neta </NamePart>
        <NamePart Type="maiden name" Level="2"> Eskelson </NamePart>
            von
        <NamePart Type="surname" Level="1">Allen</NamePart>

        <IndNameVariation Method="romanji">
            . . .
        </IndNameVariation>
    </IndivName>
    <IndivName Type="alias">
            . . .
    </IndivName>
    <IndivName Type="nickname">
        . . .
    </IndivName>
    <Gender>F</Gender>
    <DeathStatus>dead</DeathStatus>
    <PersInfo Type ="occupation">
        <Information>seamstress</Information>
        <Date>FROM 1835 TO 1875</Date>
    </PersInfo>
    <PersInfo Type ="residence">
        <Date>FROM 10 JUL 1845 TO 25 MAY 1880</Date>
        <Place>. . .</Place>
    </PersInfo>
    <PersInfo Type ="attribute">
        <Information>5 ft. 4 in. tall, blond hair, blue eyes, well mannered</Information>
    </PersInfo>
    <AssocIndiv>
        <Link Target="IndividualRec" Ref=". . ."/>
        <Association>first ancestor</Association>
        <!--
            This shows how the associated person is related to this person.
            For example, the linked individual is my great uncle.
            The example shown is an oriental cultural requirement.

         -->
        <Note>. . .</Note>
        <Citation>. . .</Citation>
    </AssocIndiv>
    <DupIndiv> p. 30
        <Link Target="IndividualRec" Ref=". . ."/>
        <Note>. . .</Note>
        <Citation>. . .</Citation>
    </DupIndiv>
    <ExternalID Type=". . ." Id=". . ."/>
    <Submitter>. . .</Submitter>
    <Note>. . .</Note>
    <Evidence>. . .</Evidence>
    <Enrichment>. . .</Enrichment>
    <Changed>. . .</Changed>
</IndividualRec>
<IndividualRec Id="IN003">
    . . .
</IndividualRec>

<!-- Event Records -->

<EventRec Id="EV001" Type="marriage" VitalType="marriage">
    <Participant>
        <Link Target="IndividualRec" Ref="IN001"/>
        <Role>husband</Role>
        <Age>26</Age>
    </Participant>
    <Participant>
        <Link Target="IndividualRec" Ref="IN002"/>
        <Role>wife</Role>
        <Age>21</Age>
    </Participant>
    <Date Calendar="Julian">ABT 7 NOV 1834</Date>
    <Place>
        <PlaceName>
            <PlacePart Type="town" Level="4">Cove</PlacePart>,
            <PlacePart Type="county" Level="3">Cache</PlacePart>,
            <PlacePart Type="state" Level="2">Utah</PlacePart>,
            <PlacePart Type="country" Level="1">USA</PlacePart>
            <!--
                This would print as Cove, Cache, Utah, USA.
                In the data stream, each comma is followed by a blank, which can't be seen here.
                The line breaks that we have used for clarity are not in the actual data stream.

            -->
        </PlaceName>
        <Coordinates>18.153N 178.150E</Coordinates>
        <PlaceNameVar Method="kana">. . .</PlaceNameVar>
        <PlaceNameVar Method="romanji">. . .</PlaceNameVar>
    </Place>
    <Religion>Reformed Christian</Religion>
    <ExternalID Type=". . ." Id=". . ."/>
    <Submitter>. . .</Submitter>
    <Note>. . .</Note>
    <Evidence>
        <Citation>
            <Link Target="SourceRec" Ref="SR001"/>
            <WhereInSource>File No. 7895-09, p. 23</WhereInSource>
            <WhenRecorded>10 June 1903</WhenRecorded>
            <Extract>Text extracted from the source.</Extract>
            <Note>Certified copy in possession of Larry T. Smith, Sandy, Utah.</Note>
        </Citation>
    </Evidence>
    <Enrichment>. . ..</Enrichment>
    <Changed>. . .</Changed>
</EventRec>

<EventRec Id="EV002" Type="christening" VitalType="birth">

    <Participant>
        <Link Target="IndividualRec" Ref="IN001"/>
        <Role>father</Role>
    </Participant>
    <Participant>
        <Link Target="IndividualRec" Ref="IN002"/>
        <Role>mother</Role>
    </Participant>
    <Participant>
        <Link Target="IndividualRec" Ref="IN003"/>
        <Role>child</Role>
    </Participant>
    . . .
</EventRec>

RDF Format


There are various RDF formats that have been created for Family History.  Here are several that I found:

I decided to use the first one which is based on DAML + OIL because it seemed to be the simplest.

This version of RDF has two main tags which I'm concerned about.  They are Individual and Family.  Examples can be seen below:

<Individual rdf:ID="I06">
      <name>   <xsd:string rdf:value="Patsy /Schooler/"/>  </name>
      <sex>      <xsd:string rdf:value="F"/>     </sex>
      <spouseIn>   <Family rdf:resource="http://jay.askren.net/gedcom/rdf#F37"/>    </spouseIn>
      <childIn>    <Family rdf:resource="http://jay.askren.net/gedcom/rdf#F11"/>      </childIn>
      <death>
         <date>     <xsd:date rdf:value="26 Jan 1811"/>    </date>
         <place>   <xsd:string rdf:value="Kentucky"/>       </place>
      </death>
   </Individual>


   <Family rdf:ID="F37">
      <marriage>
         <date>    <xsd:date rdf:value="1763"/>                 </date>
         <place>  <xsd:string rdf:value="Page Co., Virginia"/>  </place>
      </marriage>
   </Family>

In the example above, this person's name is Patsy Schooler and she is a female.  She is a spouse in family "F37" and thus was married in 1763 in Page Co., Virginia.  She was a child in the family "F11"


Here is a graph of a simple genealogy RDF document.

More resources regarding GEDCOM and DAML can be found here

A Java Program to convert between GEDCOM and simple XML.

I wrote a java program to transform data in the GEDCOM 5.5 format to basic XML.  I also wrote many unit tests for the program to make sure it was working the way I expected it too.  Too run the java program the type the following command in the folder which contains the jar file:

java -jar GedComConverter.jar inputFile.ged outputFile.xml

where "input.ged" is the input file and "outputFile.xml" is the name of the new file you wish to create.

I ran some statistics on my java code using Maven.  Of particular interest are the Unit Tests reports, the JCoverage reports, the Java Docs, and the source and test xref(cross reference).  They are all in the Project Reports Menu.

Stylesheets:

Convert From Basic XML To GedCom XML
Convert From Basic XML to simple HTML
Convert From Basic XML to RDF
Convert from RDF to a Family Tree Web Page

Results:


Name
Original
File
Basic
XML
GEDCOM
XML
Simple
HTML
RDF
Family Tree
HTML
Abraham Lincoln
view
view view view view
view
Benjamin Franklin
view
view
view
view
view
view
Bill Cosby
view
view
view
view
view
view
Daniel Boone view
view
view
view
view
view
Davy Crockett
view
view
view
view
view
view
George Washington
view
view
view
view
view
view
George W. Bush
view
view
view
view
view
view
Mark Twain
view
view
view
view
view
view
Thomas Edison
view
view
view
view
view
view
Walt Disney
view
view
view
view
view
view

Advantages of RDF

  1. RDF has several advantages over other formats.  First, it is a relatively flat and simple structure.  Because of this, it should be much easier to write a stylesheet that converts the RDF to another XML file format than it would be to write a stylesheet that converts from one XML format to another.  One may need to search deep in the tree of an XML document, but RDF doesn't go that deep. 
  2. Another nice advantage is that it would be fairly easy to combine two different vocabularies for the same domain.  DAML has tags such as: daml:intersectionOf, daml:unionOf, daml:complementOf daml:inverseOf, daml:equivalentTo, daml:sameClassAs, daml:sameIndividualAs, etc...  With these tags, one could easily combine two different vocabularies and define that an Individual in one vocabulary is the same as a Person in another vocabulary.
  3. RDF was created to be based on semantics, so it should be easier to make Inferences, especially with the help of an RDF processor.
  4. Nice Graphs can be made to graph the relations between objects such as this graph.
  5. The elements of RDF are uniquely defined so they could be uniquely specified.

Potential Problems with RDF

  1. There's nothing to stop people from developing multiple RDF schema's for any given domain.  Scouring the web a little bit, I found four for Genealogy.  XML or any other file format suffers from the same problem.
  2. RDF is relatively new and tools for RDF are relatively scarce compared to other more seasoned technologies.
  3. Anyone who wants to can write RDF including RDF which has false data or destructive data.  Someone could write RDF for which a person has a son who is also his father.  This recursion is not only impossible in real life and is thus bad data, but it could also potentially break an RDF processor.  This problem is not a new problem, nor is it unique to RDF.  The same could be said of any format.  When receiving data from other sources, it should be varified before putting a lot of faith into it.

Future Directions:

RDF does have potential.  One could make a networkable database using RDF without too much trouble.  Each RDF record has a unique identifier and could be referred to uniquely.  Along the same lines, RDF would be good for allowing users to share data across the network.  Data could be referred to and sent across the network easily because of the standard format, and the unique identifiers.  Because genealogists love to share data, this would be a perfect use of RDF.  This is similar to Tim Berners Lee's vision of the semantic web.  Agents could be scouring the internet searching for other agents on the internet who have Genealogy data that would be helpful to them.  People could have their family history research done, while they are out grocery shopping.  Taking this a step further, Genealogy data could be combined with family disease history.  Such data would be extremely useful for Family Doctor's.  The doctor's agent could check with his patients agent to see if the patient has genetic predisposition to certain diseases.  I could imagine such data being extremely useful for medical research as well.