|
|
| Line 1: |
Line 1: |
| A full week of learning [http://en.wikipedia.org/wiki/General_Architecture_for_Text_Engineering GATE] text mining/information extraction language processing and talks. [http://gate.ac.uk/wiki/TrainingCourseAug2010/ Session wiki]
| | {{ #ask: [[GPS Location::+]] |
| | |?GPS Location |
| | |?Location |
| | |?Arts |
| | |format=exhibit |
| | |facets=Location,Arts |
| | |views=map,table |
| | |center=65,-85 |
| | |zoom=3 |
| | }} |
|
| |
|
| [[File:GATE_screenshot.png|900px|GATE developer screenshot]] | | {{ #ask: [[GPS Location::+]] |
| | |?GPS Location |
| | |?Location |
| | }} |
|
| |
|
| GATE is written in Java and very Java centric. This makes it portable, fast, and heavyweight. A programming library is available. It's 14 years old and has many users and contributors.
| | {{ #ask: [[Category:Person]] |
| | | |?Arts |
| = Using GATE developer =
| | |?Location |
| | | |?Aspects |
| * GATE developer is used to process sets of Language Resources in Corpus using Processing Resources. They are typically saved to a serialized Datastore.
| | |format=graph |
| * ANNIE, VG (verb group) processors.
| | | graphlegend=Yes |
| * Preserve formatting embeds tags in HTML or XML.
| | |graphcolor=yes |
| ** Different strengths using GATE's graph (node/offset) based XML vs. preserved formatting (original xml/html)
| | |rankdir=BT |
| | | |graphlink=yes |
| = Information Extraction =
| | }} |
| | |
| * IR - retrieve docs
| |
| * IE - retreive structured data
| |
| | |
| * Knowledge Engineering - rule based
| |
| * Learning Systems - statistical
| |
| | |
| Old Bailey IE project - old english (Online)
| |
| | |
| * POS - assigned in Token (noun, verb, etc)
| |
| | |
| * Gazateer - gotcha, have to set initialization parameter listsURL before it's loaded. Must also "save and reinitialize."
| |
| * Gazeteer creates Lookups, then transducer creaties named entities
| |
| * Then orthomatcher (spelling features in common) coreference associates those
| |
| | |
| * Annotation Key sets and annotation comparing
| |
| ** Need setToKeep key in Document Reset for any pre annotated texts
| |
| | |
| == Evaluation / Metrics ==
| |
| | |
| * Evaluation metric - mathematically against human annotated
| |
| * Scoring - performance measures for annotation types
| |
| | |
| * Precision = correct / correct + spurious
| |
| * Recall = correct / correct + missing
| |
| * F-measure is precision and recall (harmonic mean)
| |
| * F=2⋅(precision⋅recall / precision+recall)
| |
| * GATE supports average, strict, lenient
| |
| | |
| * Result types - Correct, missing, spurious, partially correct (overlapped)
| |
| | |
| * Tools > Annotations Diff - comparing human vs machine annotation
| |
| | |
| * Corpus > Corpus quality assurance - compare by type
| |
| * (B has to be the generated set)
| |
| | |
| * Annotation set transfer (in tools) - transfer between docs in pipeline
| |
| ** useful for eg HTML that has boilerplate
| |
| | |
| | |
| == To investigate ==
| |
| | |
| * markupAware for HTML/XML (keeps tags in editor)
| |
| * AnnotationStack
| |
| * Advanced Options
| |
| | |
| = JAPE =
| |
| | |
| * Rules based on tokens and lookups
| |
| | |
| == To review, gotchas ==
| |
| | |
| * Rule types : first takes only first match, excludes compound
| |
| ** a? b for "a b" will match "a b"
| |
| * multiplexor tranducers
| |
| * multi-constraint statements
| |
| * macros
| |
| * To reuse created annotations has to be a separate rule
| |
| | |
| = To follow up =
| |
| | |
| * WebSphinx crawler CREOLE plugin
| |
| | |
| = Demos =
| |
| | |
| * Mímir for querying large volumes of data (uses MG4J)
| |
| * Translating parts of speech between languages using Compound editor and Alignment editor
| |
| * Predicate extractor (MultiPaX)
| |
| ** Mixed results at best
| |
| * OwlExporter
| |
| ** NLP ontology
| |
| | |
| = Conclusions =
| |
| | |
| While it can do a lot out of the box and benefits from development time and breadth of connectivity, to be useful to more than patient specialists, it needs usability testing. A lot of things are inobvious and too domain specific that with a bit of work could be more broadly useful. Interaction could include a lot more immediate, useful and interesting looking displays. A web based version could have these features. However the team seems somewhat ambivalent about development. :)
| |
| | |
| {{Blikied|Aug 30, 2010}}
| |
| | |
| [[Category:SemWeb]]
| |
Androids are compromised with carrier and operating system locks. You can't travel and just swap the SIM card, remove unwanted carrier software, access functions the carrier removed (like tethering via Wifi hotspot), or install updated/custom software (do any devices have commitments to the imminent Android 3.0?). The Nexus One is the best "open source Android" — you have to void the warranty, but root and custom OS installation is just a click away without playing cat and mouse "jailbreaking" games. But the N1 doesn't have a keyboard, a feature many professional and technical people want.
HTC's new G2/Z has a keyboard, and runs stock Android, so it's a good choice as an "open source Android" (though a handset with slightly larger battery and numeric row on the keyboard would be nice). Perhaps Google will release it as the "N2;" if not, the next best thing could be a group purchase of a particular handset from an unlocked provider (maybe http://www.puremobile.ca/HTC/HTC-Desire-Z-GSM-Phone/ in Canada).
If you're interested, write something in the comments below or stir it up on your favourite site.
Updates
Oct 20, 2010: Received word from PureMobile, they can do group buys with a minimum of ten people, and will get back as soon as they have more info with further details.
Rogers has also extended their hardware update to 30 months, from 24, meaning you're stuck with a compromised device longer.
Also, I'm really tired of commercially oriented tech blogs focused on hoarding and being uncritical, they're compromised too. If you're interested in a completely open data and participation concepts, please write a note below or edit this page.
Template:SH Triple
Template:SH AddTemplate:SH Obsolete