|
| |
|
|
| |
The Semantic Web for Family
History
Summary
For thousands of years, people have been keeping track of their family
history. Therefore, genealogy seems to be an obvious application
of an RDF ontology and the Semantic web. I've investigated making
use of RDF and the Semantic Web for Family History. The results
of my investigation are here on this web page. In my work, I
created a program to translate files in GEDCOM format to XML. I
also wrote several stylesheets which translated the data into a new
format GEDCOM XML, HTML, and RDF. On this page I discuss the History of Geneaolgy and computers, Genealogy
Markup Languages, Sources
for Genealogy Data in GEDCOM format, the GEDCOM
format, the GEDCOM XML
format, the RDF format for Family History
that I use, a Java Program to convert from
GEDCOM to basic XML, Stylesheets
I've created to transform data between these different formats, the resulting data that I transformed
using my
Stylesheets, Advantages of RDF, Problems with RDF, and Future
Directions.
History of Genealogy and Computers
In the 1980's, the Church
of
Jesus Christ of Latter-day Saints created
computer software to help individuals keep track of their family
history. They created PAF
or Personal Ancestry File. Many commercial companies have
also created commercial Genealogy programs to do the same thing. The
problem with having all these different Genealogy program is that
different programs store their data in different formats.
Genealogists love to share data. Having all these different file
formats is not conducive to sharing data. For this very reason,
the LDS church created the GEDCOM format. GEDCOM stands for Genealogy
data communication. It caught on quite well and most current
commercial and free genealogy programs today can import and export
GEDCOM files. This makes it relatively easy to exchange data
between genealogists. The GEDCOM specification has grown and
developed over time, because newer Genealogy programs are able to store
more and more information about people. GEDCOM must be as
flexible as possible to allow communication between the many different
types of family history programs out there. The current GEDCOM
specification even allows multimedia files so people can store videos
of there kids birthdays or sound clips of their grandparents
anniversaries, etc... The current version of production version
of GEDCOM is version 5.5. The next
version of GEDCOM is GEDCOM 6.0 or also called GEDCOM XML. It is
currently in Beta and the DTD can be found here.
There are several other XML vocabularies for Geneaology data that have
been proposed by other organizations. Here are several of them:
There are many markup languages which have been or are
being
developed for Genealogy. Here are a few of the markup languages:
GEDCOM
XML
- As mentioned above, this is also referred to as Gedcom 6.0. It
was prepared by the Family and Church History Department of The Church
of Jesus Christ of Latter-day Saints.
GedML
-
Genealogical Data in XML
Encoding genealogical data sets in XML, it combines the
well-established GEDCOM data model with the XML standard for encoding
complex information.
GeniML - Genealogical
Information
Markup Language
An XML vocabulary for recording and exchanging genealogical data.
GenXML
A
file format for exchange of data between genealogy programs. It is an
alternative to Gedcom 5.5.
Since Gedcom XML is the next version of the Gedcom standard, I believe
it will be more poplular than the others. Therefore, I have
chosen to work with GEDCOM XML instead of the others.
Sources for Genealogy Data:
To get GedCom data, I found
many GEDCOM files on the website: www.genealogy.com/famousfolks.
I also have been working on my own genealogy for several years
now and have several of my own files I can play with. Another
place to get GEDCOM files is www.familysearch.org.
The Gedcom file format has two major sections. The first section
lists individuals and information about them. An example is as
follows:
0 @I12@ INDI
1 NAME Clarence Earl /Hanlin/
1 SEX M
1 BIRT
2 DATE JUN 1880
1 FAMS @F93@
1 FAMC @F95@
0 @I13@ INDI
1 NAME John William /ASKREN/
1 SEX M
1 BIRT
2 DATE 5 JUN 1893
1 DEAT
2 DATE 1938
1 FAMS @F5@
1 FAMC @F4@
The first number on each line shows nesting. '0' is the beginning
of a new record as in "0 @I12@ INDI" The characters between the
'@' symbols refer to the unique identifier for the individual.
The "INDI" show that this record is an individual. The second
line above starts with a 1. This means, that we are getting more
detailed about the given individual. A couple lines down, the
line begins with a '2'. Again, more details are given about the
line above, in this case, the date of the birth in the above
line. The tags "FAMS" and "FAMC" refer to the families that the
individual is a spouse in and a child in respectively.
The second section of the Gedcom file is a list of all the
relationships. An example is as follows
0 @F5@ FAM
1 HUSB @I13@
1 WIFE @I21@
1 CHIL @I22@
1 CHIL @I23@
1 MARR
2 DATE 18 DEC 1915
2 PLAC Harrison Co,IN
1 DIV N
This shows that Family F5 has a husband a wife, and two children.
The tags that are important for my purposes are <FamilyRec>,
<IndividualRec>, <EventRec>, and <GroupRec>.
FamilyRec is for families and of course IndividualRec is for
indivduals. The equivalent tags in GEDCOM 5.5 for these tags are FAM
and INDI. EventRec stands for events such as births, deaths,
marriages, etc... GEDCOM 5.5 does not have an event tag. It
does however have tags for specific events such as birth (BIRT),
marriage (MARR), death (DEAT), etc... The GroupRec can store
information about a group such as a household, a neighborhood, an
orphanage, a group of homes, etc... GEDCOM 5.5 does not appear to
have such a tag.
Examples of the markup:
<!-- Family Records -->
<FamilyRec
Id="FM001">
<HusbFath>
<Link Target="IndividualRec" Ref="IN001"/>
</HusbFath>
<WifeMoth>
<Link Target="IndividualRec" Ref="IN002"/>
<FamilyNbr>2 </FamilyNbr>
<!-- Her second marriage or family. -->
</WifeMoth>
<Child>
<Link Target="IndividualRec" Ref="IN003"/>
<ChildNbr>1</ChildNbr>
<!-- Child's order in family, if birth dates unknown. -->
</Child>
<Child>
<Link Target="IndividualRec" Ref=". . ."/>
<ChildNbr>2</ChildNbr>
<RelToFath>adopted</RelToFath>
</Child>
<BasedOn>
<!-- Events justifying the creation of the family, and its members.
-->
<Link Target="EventRec" Ref="EV001"/>
<Link Target="EventRec" Ref="EV002"/>
<Note>
. . .
</Note>
</BasedOn>
<ExternalID
Type="AFN" Id="4S3469Q"/>
<ExternalID
Type="submitter" Id="F8945"/>
<!--
This is the ID used by the system that produced this GEDCOM file.
It can be used to communicate
changes, differing opinions, and so on, to the file submitter.
-->
<Submitter>.
. .</Submitter>
<Note>. .
.</Note>
<Evidence>
<Citation>
<!--
Normally a family is based on
(see above) events, and the evidence citations are contained in the
events.
Evidence is allowed in family
records for those cases where a family is documented,
such as in a family history, but
no specific events are known.
-->
</Citation>
</Evidence>
<Enrichment>
<Citation>
<Link Target="SourceRec" Ref="SR002"/>
<Caption>We Attend the Kunzle Family
Reunion</Caption>
<WhereInSource>
5 min, 15 sec into the video, to
10 min, 30 sec.
</WhereInSource>
<Note>Our family is featured about 5 minutes
into the video.</Note>
</Citation>
</Enrichment>
<Changed
Date="23 APR 1976" Time="13:25:12">
<Note>Record created</Note>
</Changed>
<Changed
Date=". . ." Time=". . .">
<!-- The
Contact here is the person responsible for the change. -->
<Contact>
<Link Target="ContactRec" Ref=". . ."/>
</Contact>
<Note>Adopted child added</Note>
</Changed>
</FamilyRec>
<!-- Individual Records -->
<IndividualRec
Id="IN001">
. . .
</IndividualRec>
<IndividualRec
Id="IN002">
<IndivName>
<NamePart Type="title">Duchess </NamePart>
<NamePart Type="given name"
Level="3"> Neta </NamePart>
<NamePart Type="maiden name"
Level="2"> Eskelson </NamePart>
von
<NamePart Type="surname"
Level="1">Allen</NamePart>
<IndNameVariation Method="romanji">
. . .
</IndNameVariation>
</IndivName>
<IndivName
Type="alias">
. . .
</IndivName>
<IndivName
Type="nickname">
. . .
</IndivName>
<Gender>F</Gender>
<DeathStatus>dead</DeathStatus>
<PersInfo Type
="occupation">
<Information>seamstress</Information>
<Date>FROM 1835 TO 1875</Date>
</PersInfo>
<PersInfo Type
="residence">
<Date>FROM 10 JUL 1845 TO 25 MAY 1880</Date>
<Place>. . .</Place>
</PersInfo>
<PersInfo Type
="attribute">
<Information>5 ft. 4 in. tall, blond hair, blue eyes, well
mannered</Information>
</PersInfo>
<AssocIndiv>
<Link Target="IndividualRec" Ref=". . ."/>
<Association>first ancestor</Association>
<!--
This shows how the associated person is related to
this person.
For example,
the linked individual is my great uncle.
The example
shown is an oriental cultural requirement.
-->
<Note>. . .</Note>
<Citation>. . .</Citation>
</AssocIndiv>
<DupIndiv>
p. 30
<Link Target="IndividualRec" Ref=". . ."/>
<Note>. . .</Note>
<Citation>. . .</Citation>
</DupIndiv>
<ExternalID
Type=". . ." Id=". . ."/>
<Submitter>.
. .</Submitter>
<Note>. .
.</Note>
<Evidence>.
. .</Evidence>
<Enrichment>. . .</Enrichment>
<Changed>. .
.</Changed>
</IndividualRec>
<IndividualRec
Id="IN003">
. . .
</IndividualRec>
<!-- Event Records -->
<EventRec Id="EV001"
Type="marriage" VitalType="marriage">
<Participant>
<Link Target="IndividualRec" Ref="IN001"/>
<Role>husband</Role>
<Age>26</Age>
</Participant>
<Participant>
<Link Target="IndividualRec" Ref="IN002"/>
<Role>wife</Role>
<Age>21</Age>
</Participant>
<Date
Calendar="Julian">ABT 7 NOV 1834</Date>
<Place>
<PlaceName>
<PlacePart Type="town"
Level="4">Cove</PlacePart>,
<PlacePart Type="county"
Level="3">Cache</PlacePart>,
<PlacePart Type="state"
Level="2">Utah</PlacePart>,
<PlacePart Type="country"
Level="1">USA</PlacePart>
<!--
This would print as Cove, Cache,
Utah, USA.
In the data stream, each comma is followed by a
blank, which can't be seen here.
The line breaks that we have used for clarity are
not in the actual data stream.
-->
</PlaceName>
<Coordinates>18.153N 178.150E</Coordinates>
<PlaceNameVar Method="kana">. . .</PlaceNameVar>
<PlaceNameVar Method="romanji">. . .</PlaceNameVar>
</Place>
<Religion>Reformed Christian</Religion>
<ExternalID
Type=". . ." Id=". . ."/>
<Submitter>.
. .</Submitter>
<Note>. .
.</Note>
<Evidence>
<Citation>
<Link Target="SourceRec" Ref="SR001"/>
<WhereInSource>File No. 7895-09, p.
23</WhereInSource>
<WhenRecorded>10 June 1903</WhenRecorded>
<Extract>Text extracted from the
source.</Extract>
<Note>Certified copy in possession of Larry T.
Smith, Sandy, Utah.</Note>
</Citation>
</Evidence>
<Enrichment>. . ..</Enrichment>
<Changed>. .
.</Changed>
</EventRec>
<EventRec Id="EV002" Type="christening" VitalType="birth">
<Participant>
<Link Target="IndividualRec" Ref="IN001"/>
<Role>father</Role>
</Participant>
<Participant>
<Link Target="IndividualRec" Ref="IN002"/>
<Role>mother</Role>
</Participant>
<Participant>
<Link Target="IndividualRec" Ref="IN003"/>
<Role>child</Role>
</Participant>
. . .
</EventRec>
RDF Format
There are various RDF formats that have been created for Family
History. Here are several that I found:
I decided to use the first one which is based on DAML + OIL because it
seemed to be the simplest.
This version of RDF has two main tags which I'm concerned about.
They are Individual and Family. Examples can be seen below:
<Individual
rdf:ID="I06">
<name> <xsd:string rdf:value="Patsy
/Schooler/"/> </name>
<sex> <xsd:string
rdf:value="F"/> </sex>
<spouseIn> <Family
rdf:resource="http://jay.askren.net/gedcom/rdf#F37"/>
</spouseIn>
<childIn> <Family
rdf:resource="http://jay.askren.net/gedcom/rdf#F11"/>
</childIn>
<death>
<date> <xsd:date rdf:value="26 Jan
1811"/> </date>
<place> <xsd:string
rdf:value="Kentucky"/>
</place>
</death>
</Individual>
<Family
rdf:ID="F37">
<marriage>
<date> <xsd:date
rdf:value="1763"/>
</date>
<place> <xsd:string rdf:value="Page Co.,
Virginia"/> </place>
</marriage>
</Family>
In the example above, this person's name is Patsy Schooler and she is a
female. She is a spouse in family "F37" and thus was married in
1763 in Page Co., Virginia. She was a child in the family "F11"
Here is a graph of a simple genealogy
RDF document.
More resources regarding GEDCOM and DAML can be found here
A Java Program to convert between GEDCOM and simple XML.
I wrote a java program
to
transform data in the GEDCOM 5.5 format to basic XML. I also
wrote many unit tests for the program to make sure it was working the
way I expected it too. Too run the java program the type the
following command in the folder which contains the jar file:
java -jar
GedComConverter.jar
inputFile.ged outputFile.xml
where "input.ged" is the input file and "outputFile.xml" is the name of
the new file you wish to create.
I ran some statistics on my java
code using Maven. Of
particular interest are the Unit Tests reports, the JCoverage reports,
the Java Docs, and the source and test xref(cross reference).
They are all in the Project Reports Menu.
Stylesheets:
Convert From Basic XML To
GedCom
XML
Convert From Basic XML to
simple HTML
Convert From Basic XML to
RDF
Convert from RDF to a Family Tree
Web Page
Results:
- RDF has several advantages over other formats.
First, it is
a
relatively flat and simple structure. Because of this, it should
be much easier to write a stylesheet that converts the RDF to another
XML file format than it would be to write a stylesheet that converts
from one XML format to another. One may need to search deep in
the tree of an XML document, but RDF doesn't go that deep.
- Another nice advantage is that it would be fairly
easy to combine
two
different vocabularies for the same domain. DAML has tags such
as:
daml:intersectionOf, daml:unionOf, daml:complementOf
daml:inverseOf, daml:equivalentTo,
daml:sameClassAs, daml:sameIndividualAs, etc... With these tags,
one could easily combine two different vocabularies and define that an
Individual in one vocabulary is the same as a Person in another
vocabulary.
- RDF was created to be based on semantics, so it
should be easier
to make Inferences, especially with the help of an RDF processor.
- Nice Graphs can be made to graph the relations
between objects
such as this
graph.
- The elements of RDF are uniquely defined so they
could be
uniquely specified.
- There's nothing to stop people from developing
multiple RDF
schema's for any given domain. Scouring the web a little bit, I
found four for Genealogy. XML or any other file format suffers
from the same problem.
- RDF is relatively new and tools for RDF are
relatively scarce
compared to other more seasoned technologies.
- Anyone who wants to can write RDF including RDF which
has false
data or destructive data. Someone could write RDF for which a
person has a son who is also his father. This recursion is not
only impossible in real life and is thus bad data, but it could also
potentially break an RDF processor. This problem is not a new
problem, nor is it unique to RDF. The same could be said of any
format. When receiving data from other sources, it should be
varified before putting a lot of faith into it.
Future Directions:
RDF does have potential. One could make a networkable database
using RDF without too much trouble. Each RDF record has a unique
identifier and could be referred to uniquely. Along the same
lines, RDF would be good for allowing users to share data across the
network. Data could be referred to and sent across the network
easily because of the standard format, and the unique
identifiers. Because genealogists love to share data, this would
be a perfect use of RDF. This is similar to Tim Berners Lee's
vision of the semantic web. Agents could be scouring the internet
searching for other agents on the internet who have Genealogy data that
would be helpful to them. People could have their family history
research done, while they are out grocery shopping. Taking this a
step further, Genealogy data could be combined with family disease
history. Such data would be extremely useful for Family
Doctor's. The doctor's agent could check with his patients agent
to see if the patient has genetic predisposition to certain
diseases. I could imagine such data being extremely useful for
medical research as well.
|
|
| |
|
|
|
|