|
|
| Line 1: |
Line 1: |
| A full week of learning GATE text mining/information extraction language processing and talks. [http://gate.ac.uk/wiki/TrainingCourseAug2010/ Session wiki]
| | <div style="float:right; width: 48%"> |
| | == Current engagements == |
|
| |
|
| [[File:GATE_screenshot.png|900px|GATE developer screenshot]] | | * [http://www.innovationcell.com Innovation Cell] public health systems development |
| | * http://equalit.ie International security and human rights projects |
| | * [http://www.fungalgenomics.ca Genomics lab] at Concordia University |
| | * [http://www.asiancanadianwiki.org Asian Canadian wiki / Accès Asie] |
|
| |
|
| GATE is written in Java and very Java centric. This makes it portable, fast, and heavyweight. A programming library is available. It's 14 years old and has many users and contributors.
| | == Cultivating projects with ... == |
|
| |
|
| = Using GATE developer =
| | * [http://www.asiancanadianwiki.org Asian Canadian Wiki] |
| | * [http://www.wikimontreal.net Wiki Montréal] |
| | * Other formative projects |
|
| |
|
| * GATE developer is used to process sets of Language Resources in Corpus using Processing Resources. They are typically saved to a serialized Datastore.
| | == Past projects that should be picked up again ... == |
| * ANNIE, VG (verb group) processors.
| |
| * Preserve formatting embeds tags in HTML or XML.
| |
| ** Different strengths using GATE's graph (node/offset) based XML vs. preserved formatting (original xml/html)
| |
|
| |
|
| = Information Extraction = | | * Making government and the social sphere comprehensible |
| | ** [http://zooid.org/~vid/presentations/publicwhip/ Public whip port to Canada] from the [http://www.mysociety.co.uk mysociety] project |
| | ** [http://offsait.info/smw/ Russia democracy project] - an initiative to help understand the structure and representatives of the Russian government. |
| | ** [http://subvention.zooid.org Documenting Quebec social orgs] - Graphing relationships and locations of organizations to provide a basis to capture experiences and advice |
| | ** ByDesign eLab - ecommons projects |
| | </div> |
| | <div style="float: left; width: 48%"> |
| | Hi, I'm David, see my home page at http://zooid.org/~vid |
|
| |
|
| * IR - retrieve docs | | * [[Bliki|Bliki blog]] |
| * IE - retreive structured data | | * [[User:DavidM/Third person bio|Third person bio]] |
| | * [[User:DavidM/Statement of goals]] |
| | * {{ #ask: [[iCal::+]] |
| | | ?Start date = start |
| | | ?End date = end |
| | | ?location |
| | | format=icalendar}} |
| | {{ #ask: [[Activity::+]] |
| | |?Name |
| | |?Start date |
| | |?URL |
| | |format=timeline |
| | |mainlabel=- |
| | |timelinesize=400px |
| | |timelinebands=YEAR |
|
| |
|
| * Knowledge Engineering - rule based
| | }} |
| * Learning Systems - statistical
| | [[User:DavidM/Timeline | Timeline]] |
|
| |
|
| Old Bailey IE project - old english (Online)
| | </div> |
|
| |
|
| * POS - assigned in Token (noun, verb, etc)
| | <br style="clear: both" /> |
| | |
| * Gazateer - gotcha, have to set initialization parameter listsURL before it's
| |
| loaded. Must also "save and reinitialize."
| |
| * Gazeteer creates Lookups, then transducer creaties named entities
| |
| * Then orthomatcher (spelling features in common) coreference associates those
| |
| | |
| * Annotation Key sets and annotation comparing
| |
| ** Need setToKeep key in Document Reset for any pre annotated texts
| |
| | |
| == Evaluation / Metrics ==
| |
| | |
| * Evaluation metric - mathematically against human annotated
| |
| * Scoring - performance measures for annotation types
| |
| | |
| * Result types - Correct, missing, spurious, partially correct (overlapped)
| |
| | |
| * Tools > Annotations Diff - comparing human vs machine annotation
| |
| | |
| * Corpus > Corpus quality assurance - compare by type
| |
| * (B has to be the generated set)
| |
| | |
| * Annotation set transfer (in tools) - transfer between docs in pipeline
| |
| ** useful for eg HTML that has boilerplate
| |
| | |
| === To investigate ===
| |
| | |
| * markupAware for HTML/XML (keeps tags in editor)
| |
| * AnnotationStack
| |
| * Advanced Options
| |
| | |
| = JAPE =
| |
| | |
| * Rules based on tokens and lookups
| |
| | |
| == To review, gotchas ==
| |
| | |
| * Rule types : first takes only first match, excludes compound
| |
| ** a? b for "a b" will match "a b"
| |
| * multiplexor tranducers
| |
| * multi-constraint statements
| |
| * macros
| |
| * To reuse created annotations has to be a separate rule
| |
| | |
| {{Blikied|Aug 30, 2010}}
| |