No edit summary
No edit summary
 
(72 intermediate revisions by 3 users not shown)
Line 1: Line 1:
A full week of learning [http://en.wikipedia.org/wiki/General_Architecture_for_Text_Engineering GATE] text mining/information extraction language processing and talks. [http://gate.ac.uk/wiki/TrainingCourseAug2010/ Session wiki]
<noinclude>
This is the "AAType" template.
It should be used with the form.
</noinclude><includeonly>
<div style="display: none">


[[File:GATE_screenshot.png|900px|GATE developer screenshot]]
{{ #ifeq: {{ #titleparts: {{FULLPAGENAME}}|1|-1}}|fr
|[[Category:Pages françaises]]{{ #vardefine: lang|fr}}{{#vardefine: otherlang|{{#replace: {{FULLPAGENAME}}|/fr|}} }}{{#vardefine: otherlange|{{#replace: {{FULLPAGENAMEE}}|/fr|}} }}
|[[Category:English pages]]{{#vardefine:lang|en}}{{#vardefine: otherlang|{{FULLPAGENAME}}/fr }}{{#vardefine: otherlange|{{FULLPAGENAMEE}}/fr }}
}}{{ #set: Lang={{#var: lang}} }}
</div><div style="float: right; width: 230px; padding: 1em">
{{ #ifeq: {{ #var: lang}}|en
|{{ #ifexist: {{#var: otherlang}}|[[{{#var: otherlang}}|Version français]]|[http://www.asiancanadianwiki.org/w/Special:FormEdit/AAType/{{#var: otherlange}}?AAType%5BImage%5D={{ #replace:{{{Image|}}}| |_}} Créer la version française] }}
|{{ #ifexist: {{#var: otherlang}}|[[{{#var: otherlang}}|English version]]|[http://www.asiancanadianwiki.org/w/Special:FormEdit/AAType/{{#var: otherlange}}?AAType%5BImage%5D={{ #replace:{{{Image|}}}| |_}} Create the English version] }}
}}
{{ #if: {{{Image|}}} |
{{ #set:Image={{{Image}}} }}
{{ #ifeq: {{#titleparts: {{{Image}}} }}|Http:
|{{{Image}}}
|[[File:{{{Image}}}|200px|thumb|{{PAGENAME}}]]
}}
|[[Category:Profiles with no image]] }}
{{ #if: {{{Featured|}}}|[[Category:Featured]]'''Featured'''}}{{ #set: Featured={{{Featured|}}} }}
<br class="cleared" />
</div>


GATE is written in Java and very Java centric. This makes it portable, fast, and heavyweight. A programming library is available.  It's 14 years old and has many users and contributors.
{{ #if: {{{Home page|}}} |
<div class="tpldiv"><span class="tpllabel">Home page</span>
<span class="tblvalue">[[Home page::{{{Home page|}}}]]</span>
</div>
| }}
{{ #if: {{{Location|}}} |
{{#arraymap:{{{Location}}}|,|x|[[Category:x]] |<nowiki> </nowiki>}}
<div class="tpldiv"><span class="tpllabel">Location</span>
<span class="tblvalue">{{#arraymap:{{{Location}}}|,|x|[[Location::x]] |<nowiki> </nowiki>}}<span class="tblvalue">
<span class="hidden">[[GPS Location::{{ #geocode:{{{Location}}}, Canada}}]]</span>
| }}
{{ #if: {{{Arts|}}} |
{{#arraymap:{{{Arts}}}|,|x|[[Category:x]] |<nowiki> </nowiki>}}
<span class="hidden">{{#arraymap:{{{Arts}}}|,|x|[[Arts::x]] |<nowiki> </nowiki>}}</span>
| }}
{{ #if: {{{Type|}}} |
{{#arraymap:{{{Type}}}|,|x|[[Category:x]] |<nowiki> </nowiki>}}
<span class="hidden">{{#arraymap:{{{Type}}}|,|x|[[Type::x]] |<nowiki> </nowiki>}}</span>
| }}
{{ #if: {{{Aspects|}}} |
{{#arraymap:{{{Aspects}}}|,|x|[[Category:x]] |<nowiki> </nowiki>}}
<span class="hidden">{{#arraymap:{{{Aspects}}}|,|x|[[Aspects::x]] |<nowiki> </nowiki>}}</span>
| }}


= Using GATE developer =
{{ #ask: [[Category:Book]][[Author::{{PAGENAME}}]]
 
|intro=<h2>Published books</h2>
* GATE developer is used to process sets of Language Resources in Corpus using Processing Resources. They are typically saved to a serialized Datastore.
|mainlabel=Title
* ANNIE, VG (verb group) processors.
|?Extended title
* Preserve formatting embeds tags in HTML or XML.
|?Year of Publication
** Different strengths using GATE's graph (node/offset) based XML vs. preserved formatting (original xml/html)
|sort=Year of Publication
 
}}
= Information Extraction =
{{ #ask: [[Category:Film]][[Producer::{{PAGENAME}}]] OR [[Director::{{PAGENAME}}]]
 
|intro=<h2>Films</h2>
* IR - retrieve docs
|mainlabel=Title
* IE - retreive structured data
|?Description
 
|?Year
* Knowledge Engineering - rule based
|sort=Year
* Learning Systems - statistical
}}
 
__TOC__
Old Bailey IE project  - old english (Online)
</includeonly>
 
* POS - assigned in Token (noun, verb, etc)
 
* Gazateer - gotcha, have to set initialization parameter listsURL before it's
loaded. Must also "save and reinitialize."
* Gazeteer creates Lookups, then transducer creaties named entities
* Then orthomatcher (spelling features in common) coreference associates those
 
* Annotation Key sets and annotation comparing
** Need setToKeep key in Document Reset for any pre annotated texts
 
== Evaluation / Metrics ==
 
* Evaluation metric - mathematically against human annotated
* Scoring - performance measures for annotation types
 
* Result types - Correct, missing, spurious, partially correct (overlapped)
 
* Tools > Annotations Diff - comparing human vs machine annotation
 
* Corpus > Corpus quality assurance - compare by type
* (B has to be the generated set)
 
* Annotation set transfer (in tools) - transfer between docs in pipeline
** useful for eg HTML that has boilerplate
 
= Other notes =
 
== Lucene data store and ANIC ==
 
* Use <null> for default set
* Go to Datastore for queries
** eg {Person}({Token})+{Money}
* Useful for debugging JAPE and results
 
[[File:GATE-lucene-person-money.png|800px]]
 
== To investigate ==
 
* markupAware for HTML/XML (keeps tags in editor)
* AnnotationStack
* Advanced Options
 
= JAPE =
 
* Rules based on tokens and lookups
 
== To review, gotchas ==
 
* Rule types : first takes only first match, excludes compound
** a? b for "a b" will match "a b"
* multiplexor tranducers
* multi-constraint statements
* macros
* To reuse created annotations has to be a separate rule
 
= To follow up =
 
* WebSphinx crawler CREOLE plugin
 
= Demos =
 
* Mímir for querying large volumes of data (uses MG4J)
* Translating parts of speech between languages using Compound editor and Alignment editor
* Predicate extractor (MultiPaX)
** Mixed results at best
* OwlExporter
** NLP ontology
 
= Conclusions =
 
While it can do a lot out of the box and benefits from development time and breadth of connectivity, to be useful to more than patient specialists, it needs usability testing. A lot of things are inobvious and too domain specific that with a bit of work could be more broadly useful. Interaction could include a lot more immediate, useful and interesting looking displays. A web based version could have these features. However the team seems somewhat ambivalent about development. :)
 
{{Blikied|Aug 30, 2010}}
 
[[Category:SemWeb]]

Latest revision as of 12:36, 4 August 2015

This is the "AAType" template. It should be used with the form.