Sandbox: Difference between revisions

From Asian Canadian Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
<div style="float:right; width: 48%">
A full week of learning [http://en.wikipedia.org/wiki/General_Architecture_for_Text_Engineering GATE] text mining/information extraction language processing and talks. [http://gate.ac.uk/wiki/TrainingCourseAug2010/ Session wiki]
== Current engagements ==


* [http://www.innovationcell.com Innovation Cell] public health systems development
[[File:GATE_screenshot.png|900px|GATE developer screenshot]]
* http://equalit.ie International security and human rights projects
* [http://www.fungalgenomics.ca Genomics lab] at Concordia University
* [http://www.asiancanadianwiki.org Asian Canadian wiki / Accès Asie]


== Cultivating projects with ... ==
GATE is written in Java and very Java centric. This makes it portable, fast, and heavyweight. A programming library is available.  It's 14 years old and has many users and contributors.


* [http://www.asiancanadianwiki.org Asian Canadian Wiki]
= Using GATE developer =
* [http://www.wikimontreal.net Wiki Montréal]
* Other formative projects


== Past projects that should be picked up again ... ==
* GATE developer is used to process sets of Language Resources in Corpus using Processing Resources. They are typically saved to a serialized Datastore.
* ANNIE, VG (verb group) processors.
* Preserve formatting embeds tags in HTML or XML.
** Different strengths using GATE's graph (node/offset) based XML vs. preserved formatting (original xml/html)


* Making government and the social sphere comprehensible
= Information Extraction =
** [http://zooid.org/~vid/presentations/publicwhip/ Public whip port to Canada] from the [http://www.mysociety.co.uk mysociety] project
** [http://offsait.info/smw/ Russia democracy project] - an initiative to help understand the structure and representatives of the Russian government.
** [http://subvention.zooid.org Documenting Quebec social orgs] - Graphing relationships and locations of organizations to provide a basis to capture experiences and advice
** ByDesign eLab -  ecommons projects
</div>
<div style="float: left; width: 48%">
Hi, I'm David, see my home page at http://zooid.org/~vid


* [[Bliki|Bliki blog]]
* IR - retrieve docs
* [[User:DavidM/Third person bio|Third person bio]]
* IE - retreive structured data
* [[User:DavidM/Statement of goals]]
* {{ #ask: [[iCal::+]]
| ?Start date = start
| ?End date = end
| ?location
| format=icalendar}}
{{ #ask: [[Activity::+]]
|?Name
|?Start date
|?URL
|format=timeline
|mainlabel=-
|timelinesize=400px
|timelinebands=YEAR


}}
* Knowledge Engineering - rule based
[[User:DavidM/Timeline | Timeline]]
* Learning Systems - statistical


</div>
Old Bailey IE project  - old english (Online)


<br style="clear: both" />
* POS - assigned in Token (noun, verb, etc)
 
* Gazateer - gotcha, have to set initialization parameter listsURL before it's loaded. Must also "save and reinitialize."
* Gazeteer creates Lookups, then transducer creaties named entities
* Then orthomatcher (spelling features in common) coreference associates those
 
* Annotation Key sets and annotation comparing
** Need setToKeep key in Document Reset for any pre annotated texts
 
== Evaluation / Metrics ==
 
* Evaluation metric - mathematically against human annotated
* Scoring - performance measures for annotation types
 
* Precision = correct / correct + spurious
* Recall = correct / correct + missing
* F-measure is precision and recall (harmonic mean)
* F=2⋅(precision⋅recall / precision+recall)
* GATE supports average, strict, lenient
 
* Result types - Correct, missing, spurious, partially correct (overlapped)
 
* Tools > Annotations Diff - comparing human vs machine annotation
 
* Corpus > Corpus quality assurance - compare by type
* (B has to be the generated set)
 
* Annotation set transfer (in tools) - transfer between docs in pipeline
** useful for eg HTML that has boilerplate
 
 
== To investigate ==
 
* markupAware for HTML/XML (keeps tags in editor)
* AnnotationStack
* Advanced Options
 
= JAPE =
 
* Rules based on tokens and lookups
 
== To review, gotchas ==
 
* Rule types : first takes only first match, excludes compound
** a? b for "a b" will match "a b"
* multiplexor tranducers
* multi-constraint statements
* macros
* To reuse created annotations has to be a separate rule
 
= To follow up =
 
* WebSphinx crawler CREOLE plugin
 
= Demos =
 
* Mímir for querying large volumes of data (uses MG4J)
* Translating parts of speech between languages using Compound editor and Alignment editor
* Predicate extractor (MultiPaX)
** Mixed results at best
* OwlExporter
** NLP ontology
 
= Conclusions =
 
While it can do a lot out of the box and benefits from development time and breadth of connectivity, to be useful to more than patient specialists, it needs usability testing. A lot of things are inobvious and too domain specific that with a bit of work could be more broadly useful. Interaction could include a lot more immediate, useful and interesting looking displays. A web based version could have these features. However the team seems somewhat ambivalent about development. :)
 
{{Blikied|Aug 30, 2010}}
 
[[Category:SemWeb]]

Revision as of 17:14, 20 June 2010

A full week of learning GATE text mining/information extraction language processing and talks. Session wiki

Error creating thumbnail: File missing

GATE is written in Java and very Java centric. This makes it portable, fast, and heavyweight. A programming library is available. It's 14 years old and has many users and contributors.

Using GATE developer

  • GATE developer is used to process sets of Language Resources in Corpus using Processing Resources. They are typically saved to a serialized Datastore.
  • ANNIE, VG (verb group) processors.
  • Preserve formatting embeds tags in HTML or XML.
    • Different strengths using GATE's graph (node/offset) based XML vs. preserved formatting (original xml/html)

Information Extraction

  • IR - retrieve docs
  • IE - retreive structured data
  • Knowledge Engineering - rule based
  • Learning Systems - statistical

Old Bailey IE project - old english (Online)

  • POS - assigned in Token (noun, verb, etc)
  • Gazateer - gotcha, have to set initialization parameter listsURL before it's loaded. Must also "save and reinitialize."
  • Gazeteer creates Lookups, then transducer creaties named entities
  • Then orthomatcher (spelling features in common) coreference associates those
  • Annotation Key sets and annotation comparing
    • Need setToKeep key in Document Reset for any pre annotated texts

Evaluation / Metrics

  • Evaluation metric - mathematically against human annotated
  • Scoring - performance measures for annotation types
  • Precision = correct / correct + spurious
  • Recall = correct / correct + missing
  • F-measure is precision and recall (harmonic mean)
  • F=2⋅(precision⋅recall / precision+recall)
  • GATE supports average, strict, lenient
  • Result types - Correct, missing, spurious, partially correct (overlapped)
  • Tools > Annotations Diff - comparing human vs machine annotation
  • Corpus > Corpus quality assurance - compare by type
  • (B has to be the generated set)
  • Annotation set transfer (in tools) - transfer between docs in pipeline
    • useful for eg HTML that has boilerplate


To investigate

  • markupAware for HTML/XML (keeps tags in editor)
  • AnnotationStack
  • Advanced Options

JAPE

  • Rules based on tokens and lookups

To review, gotchas

  • Rule types : first takes only first match, excludes compound
    • a? b for "a b" will match "a b"
  • multiplexor tranducers
  • multi-constraint statements
  • macros
  • To reuse created annotations has to be a separate rule

To follow up

  • WebSphinx crawler CREOLE plugin

Demos

  • Mímir for querying large volumes of data (uses MG4J)
  • Translating parts of speech between languages using Compound editor and Alignment editor
  • Predicate extractor (MultiPaX)
    • Mixed results at best
  • OwlExporter
    • NLP ontology

Conclusions

While it can do a lot out of the box and benefits from development time and breadth of connectivity, to be useful to more than patient specialists, it needs usability testing. A lot of things are inobvious and too domain specific that with a bit of work could be more broadly useful. Interaction could include a lot more immediate, useful and interesting looking displays. A web based version could have these features. However the team seems somewhat ambivalent about development. :)

Template:SH Triple Template:SH AddTemplate:SH Obsolete