open-csp / wiki-search
Faceted search for Semantic MediaWiki
Installs: 1 474
Dependents: 0
Suggesters: 0
Security: 0
Stars: 5
Watchers: 2
Forks: 4
Open Issues: 4
Type:mediawiki-extension
Requires
- php: >= 7.4 < 8.3
- composer/installers: 1.*,>=1.0.1
- elasticsearch/elasticsearch: ^5.3|^6.0|^7.12
- ongr/elasticsearch-dsl: ~6.0|^7.2
- paquettg/php-html-parser: >=3.0
Requires (Dev)
README
This document describes how to use the WikiSearch API. For a more beginner-friendly introduction to WikiSearch, you should read the documentation on the MediaWiki extension page.
Performing a search
Performs a search and returns the list of search results. If the API is in debug mode, this endpoint also returns the raw ElasticSearch query that was used to perform the search.
Parameters
Example request
Example request (cURL):
curl https://wiki.example.org/api.php \
-d action=query \
-d format=json \
-d meta=WikiSearch \
-d filter=[{"value":"5","key":"Average rating","range":{"gte":5,"lte":6}}] \
-d from=0 \
-d limit=10 \
-d pageid=698 \
-d aggregations=[
{"type":"range","ranges":[
{"from":1,"to":6,"key":"1"},
{"from":2,"to":6,"key":"2"},
{"from":3,"to":6,"key":"3"},
{"from":4,"to":6,"key":"4"},
{"from":5,"to":6,"key":"5"}
],"property":"Average rating"}
]
Example response:
{
"batchcomplete": "",
"result": {
"hits": "[<TRUNCATED, SEE BELOW FOR PARSING>]",
"total": 1,
"aggs": {
"Average rating": {
"meta": [],
"doc_count": 1,
"Average rating": {
"buckets": {
"1": {
"from": 1,
"to": 6,
"doc_count": 1
},
"2": {
"from": 2,
"to": 6,
"doc_count": 1
},
"3": {
"from": 3,
"to": 6,
"doc_count": 1
},
"4": {
"from": 4,
"to": 6,
"doc_count": 1
},
"5": {
"from": 5,
"to": 6,
"doc_count": 1
}
}
}
}
}
}
}
Parsing the response
This section assumes you have successfully made a request to the API using PHP and have stored the raw API result in the
variable $response
.
The $response
object is a JSON encoded string, and needs to be decoded before it can be used:
$response = json_decode($response, true);
After having decoded the $response
object, the response usually contains two keys (three if debug mode is enabled):
Generally, we are only interested in the API result object, so we can create a new variable only containing that field:
$result = $response["result"];
This $result
field will look something like this:
{ "hits": "[<TRUNCATED, SEE BELOW FOR PARSING>]", "total": 1, "aggs": { "Average rating": { "meta": [], "doc_count": 1, "Average rating": { "buckets": { "1": { "from": 1, "to": 6, "doc_count": 1 }, "2": { "from": 2, "to": 6, "doc_count": 1 }, "3": { "from": 3, "to": 6, "doc_count": 1 }, "4": { "from": 4, "to": 6, "doc_count": 1 }, "5": { "from": 5, "to": 6, "doc_count": 1 } } } } } }
The hits
field
The hits
field contains a JSON-encoded string of the ElasticSearch search results. This field needs to be decoded
using json_decode
before it can be used. The field directly corresponds to the hits.hits
field from the
ElasticSearch response. See the
ElasticSearch documentation
for very detailed documentation about what this field looks like.
To get the associated page name of any search result, the subject.namespacename
and subject.title
hit-field in the
hits
field may be concatenated using a colon, like so:
$hits = json_decode($result["hits"], true); foreach ($hits as $hit) { $namespace_name = $hit["subject"]["namespacename"]; $page_title = $hit["subject"]["title"]; $page_name = sprintf("%s:%s", $namespace_name, $page_title); echo $page_name; }
The subject.namespacename
hit-field contains the name of the namespace in which the search result lives, and the subject.title
hit-field contains the name of the page that matched the search (without a namespace prefix). To get the full URL for this page, you can prepend http://<wikiurl>/index.php/
to the page name.
The hits
field also contains the generated highlighted snippets, if they are available. These can be accessed through the highlight
hit-field, like so:
$hits = json_decode($result["hits"], true); foreach ($hits as $hit) { $highlights = $hit["highlight"]; foreach ($highlights as $highlight) { // $highlight is an array of highlighted snippets $highlight_string = implode("", $highlight); echo $highlight_string; } }
See also the ElasticSearch Highlighting documentation.
The aggs
field
The aggs
field directly corresponds to the aggregations
field from the ElasticSearch response. See the ElasticSearch documentation for further details.
The total
field
The total
field contains the total number of results found by ElasticSearch. This field is not influenced by the limit
and always displays the total number of results available, regardsless of how many were actually returned.
Filters syntax
The filter
parameter takes a list of objects. These objects have the following form:
PropertyRangeFilter
This filter only returns pages that have the specified property with a value in the specified range.
{
"key": "Age",
"range": {
"gte": 0,
"lt": 100
}
}
The above filter only includes pages where property Age
has a value that is greater than
or equal to 0
, but strictly less than 100
.
The range
parameter takes an object that allows the following properties:
gte
: Greater-than or equal togt
: Strictly greater-thanlte
: Less-than or equal tolt
: Strictly less-than
PropertyValueFilter
This filter only returns pages that have the specified property with the specified value.
{
"key": "Class",
"value": "Manual"
}
The above filter only includes pages where the property Class
has the value Manual
. The value
may
by any of the following data types:
- string
- boolean
- integer
- float
- double
See also: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/query-dsl-range-query.html
PropertyValuesFilter
This filter only returns pages that have the specified property with any of the specified values.
{
"key": "Class",
"value": ["foo", "bar"]
}
The above filter only includes pages where the property Class
has the value foo
or bar
.
See also: https://www.elastic.co/guide/en/elasticsearch/reference/6.8//query-dsl-terms-query.html
HasPropertyFilter
This filter only returns pages that have the specified property with any value.
{
"key": "Class",
"value": "+"
}
The above filter only includes pages that have the property Class
. It does not take the value of the property into account.
See also: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html
PropertyTextFilter
This filter only returns pages that have the specified property with a value that matches the given search query string.
{
"key": "Class",
"value": "Foo | (Bar + Quz)",
"type": "query"
}
The above filter executes the given query and only includes pages that matched the executed query. The query syntax is identical to the simple query syntax used by ElasticSearch.
PropertyFuzzyValueFilter
This filter only returns pages that have the specified property with approximately the specified value.
{
"key": "Class",
"value": "Manual",
"type": "fuzzy"
}
The above filter only includes pages where the property Class
has a value similar to Manual
. The value
must be
a string.
Additionally, the maximum edit distance can be specified through the fuzziness
parameter:
{
"key": "Class",
"value": "Manual",
"type": "fuzzy",
"fuzziness": 6
}
fuzziness
must either be the string "AUTO" to automatically determine the appropriate fuzziness (default), or
a positive integer specifying the maximum edit distance.
See also: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/query-dsl-fuzzy-query.html
Aggregations syntax
The aggregations
parameter takes a list of objects. These objects have the following form:
PropertyRangeAggregation
{
"type": "range",
"ranges": [
{ "to": 50 },
{ "from": 50, "to": 100 },
{ "from": 100 }
],
"property": "Price",
"name": "Prices" # Optional, property name when empty
}
Note: The
from
parameter is inclusive, and theto
parameter is exclusive. This means that for an aggregation from (and including)1
up to and including5
, thefrom
andto
parameters should be1
and6
(!) respectively.
PropertyAggregation
{
"type": "property",
"property": "Genre",
"name": "Genres" # Optional, property name when empty
}
Sortings syntax
The sortings
parameter takes a list of objects. These objects have the following form:
PropertySort
{
"type": "property",
"property": "Genre",
"order": "asc" # Optional, undefined when empty
}
The above filter sorts the results based on the value of the property Genre
in an asc
ending order. It is also possible to sort in a desc
ending order.
Note: Sorting on a property that does not exist will result in an error.
Highlight API
Note: This API is mostly for internal use.
The highlight API has the following properties:
query
: The query to generate highlighted terms fromproperties
: The properties over which the highlights need to be calculatedpage_id
: The page ID of the page on which the highlights need to be calculatedlimit
: The number of highlighted terms to calculate; this does not always correspond directly with the number of terms returned, since duplicates are removed after the query to ElasticSearchsize
: The (approximate) size of snippets to generate, leave blank to highlight individual words
Chained properties
WikiSearch provides support for creating filters with chained properties. Chained properties can be used in any filter. They can also be used as a search term property.
{
"key": "Subpage.Foobar",
"value": "+"
}
For instance, the above filter matches any page for which the value of the property "Subpage" is a page that contains the property "Foobar".
See also: https://www.semantic-mediawiki.org/wiki/Help:Subqueries_and_property_chains
Special properties
There are a number of special properties defined by Semantic MediaWiki that are worth pointing out. These properties act just like regular properties, but do not appear in Special:Browse.
text_copy
: (from SemanticMediaWiki documentation) this mapping is used to enable wide proximity searches on textual annotated elements. Thetext_copy
field is a compound field for all strings to be searched when a specific property is unknown.text_raw
: this mapping contains unstructured, unprocessed raw text from an article.attachment-title
: this mapping contains the title of a file attachment.attachment-content
: this mapping contains the content of a file attachment.
For example, if you want to search through PDF files linked through the Pdf
property, you can use the chained property Pdf.attachment-content
.
Hooks
WikiSearchBeforeElasticQuery
This hook is called right before the query is sent to ElasticSearch. It has the following signature:
function onWikiSearchBeforeElasticQuery( array &$query, array &$hosts ) {}
The hook has access to and can alter the given $query
. It can also add or remove hosts from the
$hosts
array.
WikiSearchApplyResultTranslations
This hook is called right before returning the final results to the API. It can be used
to alter the $results
array. This can be useful to filter any pages the user is not allowed
to see or add additional data to the query result.
It has the following signature:
function onWikiSearchApplyResultTranslations( array &$results ) {}
WikiSearchOnLoadFrontend
This hook must be implemented by any WikiSearch frontend. It gets called when the #loadSeachEngine
parser function
is called. It has the following signature:
function onWikiSearchOnLoadFrontend( string &$result, \WikiSearch\SearchEngineConfig $config, Parser $parser, array $parameters ) {}
string &$result
: The result of the call to the parser function. This is the text that will be transcluded on the page.SearchEngineConfig $config
: The SearchEngineConfig object of the current page. The SearchEngineConfig object exposes the following methods:getTitle(): Title
: The Title associated with this SearchEngineConfiggetConditionProperty(): PropertyInfo
: The PropertyInfo object associated with the property in the search condition (e.g.Class
forClass=Foobar
)- The
PropertyInfo
class exposes the following methods:getPropertyID(): int
: Returns the property IDgetPropertyType(): string
: Returns the property type (e.g.txtField
orwpgField
)getPropertyName(): string
: Returns the name of the property (e.g.Class
)
- The
getConditionValue(): string
: Returns the value in the condition (e.g.Foobar
inClass=Foobar
)getFacetProperties(): array
: Returns the facet properties in the config (facet properties are the properties that are not prefixed with?
). May be the name of a property (e.g. "Foobar") or a translation pair (e.g. "Foobar=Boofar")getFacetPropertyIDs(): array
: Returns a key-value pair list where the key is the ID of the facet property and the value the type of that propertygetResultProperties(): array
: Returns the result properties in the config as PropertyInfo objects (result properties are the properties prefixed with?
)getResultPropertyIDs(): array
: Returns a key-value pair list where the key is the name of the result property and the value the ID of that propertygetSearchParameters(): array
: Returns a key-value pair list of additional search parameters
Parser $parser
: The current Parser objectarray $parameters
: The parameters passed to the#loadSearchEngine
call
Config variables
WikiSearch has several configuration variables that influence its default behaviour.
$wgWikiSearchElasticStoreIndex
: Sets the name of the ElasticStore index to use (defaults to"smw-data-" . strtolower( wfWikiID() )
)$wgWikiSearchDefaultResultLimit
: Sets the number of results to return when no explicit limit is given (defaults to10
)$wgWikiSearchHighlightFragmentSize
: Sets the maximum number of characters in the highlight fragment (defaults to250
)$wgWikiSearchHighlightNumberOfFragments
: Sets the maximum number of highlight fragments to return per result (defaults to1
)$wgWikiSearchElasticSearchHosts
: Sets the list of ElasticSearch hosts to use (defaults to["localhost:9200"]
)$wgWikiSearchAPIRequiredRights
: Sets the list of rights required to query the WikiSearch API (defaults to["read", "wikisearch-execute-api"]
)$wgWikiSearchSearchFieldOverride
: Sets the search page to redirect to when using Special:Search. The user is redirected to the specified wiki article with the query parametersearch_query
specified through the search page if it is available. Does not change the behaviour of the search snippets shown when using the inline search field.$wgWikiSearchMaxChainedQuerySize
: Sets the maximum number of results to retrieve for a chained property query (defaults to1000
). Setting this to an extreme value may cause ElasticSearch to run out of memory when performing a large chained query.
Debug mode
To enable debug mode, set $wgWikiSearchEnableDebugMode
to true
.
Parser functions
WikiSearch defines two parser functions.
#WikiSearchConfig
(case-sensitive)
The #WikiSearchConfig
parser function is used to set several configuration variables that cannot be passed to the API for security
reasons. It sets the search condition for that page, the list of facet properties, and the list of result properties.
{{#WikiSearchConfig:
|<facet property>
|?<result property>
}}
{{#WikiSearchConfig:
|Version
|Tag
|Space
|?Title
|?Version
}}
Note: Only one call to #WikiSearchConfig
is allowed per page. Multiple calls will result in unexpected behaviour.
Search parameters
Certain configuration parameters can also be given through the search engine config. This section documents these parameters and their behaviour.
base query
The base query
configuration parameter can be used to add a base query to the search. This base query is given as a Semantic MediaWiki query. A
document will only be included in the search if it matched both the base query and the generated query.
highlighted properties
The highlighted properties
configuration parameter can be used to specify alternate properties that should be highlighted. Please note that these
properties do need to be part of the search space.
search term properties
The search term properties
configuration parameter can be used to specify alternate properties to search through when doing a free-text search. These
properties may also be chained properties.
A weight can be added to each field in the search term properties by using the ^%d
syntax. For example, to give additional weight to the title, you can do the following:
|search term properties=Title^10,Content^2,Pdf.attachment-content
The weight determines the ranking when sorting on relevance. A match in a field with a higher weight will count more towards the relevance score than a match in a field with a lower weight. When no weight is given, the weight is set to 1
.
default operator
The default operator
configuration parameter can be used to change the default operator of the free-text search. The default operator inserted between
each term is or
and this configuration parameters allows the administrator to change that to an and
if required.
post filter properties
The post filter properties
configuration parameter can be used to specify which filters should be added as a post filter instead
of a regular filter. This parameter takes a comma-separated list of property names. Each filter that applies to any of the given property names
will be added as a post filter. The difference between post filters and regular filters is explained here.
This configuration parameter is especially useful when you have disjunct checkbox properties.
#WikiSearchFrontend
(case-sensitive)
The #WikiSearchFrontend
parser function is used to load the frontend. The parameters and return value of this parser function
depend completely on the frontend.
Installation
- Download and place the file(s) in a directory called WikiSearch in your extensions/ folder.
- Add the following code at the bottom of your LocalSettings.php:
- wfLoadExtension( 'WikiSearch' );
- Run the update script which will automatically create the necessary database tables that this extension needs.
- Run Composer.
- Navigate to Special:Version on your wiki to verify that the extension is successfully installed.
Copyright
Faceted search for MediaWiki. Copyright (C) 2021- Marijn van Wezel, Robis Koopmans
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.