Table of Contents
Tutorial: Extending SWIRL
Background: The SWIRL Search Workflow
In a nutshell:
- User creates a query - example: http://localhost:8000/swirl/search/?q=ai
- Pre-query processing - example: SpellcheckQueryProcessor
Each Search Provider executes in parallel
- Query processing - example: AdaptiveQueryProcessor
- Connector - example: RequestsGet
- Result processing - example: MappingResultsProcessor
Parallel processing ends here
- Post-result processing - example: CosineRelevancyPostResultProcessor
- Ranked-results available via mixer - example: http://localhost:8000/swirl/results/?search_id=1
For more information, consult the Developer Guide, Workflow Overview.
Creating a SearchProvider
A SearchProvider is a configuration of a Connector. To connect to a given source, first verify that it supports a Connector that SWIRL already has. (See the next tutorial for information on creating new Connectors.)
For example, if trying to query a website using a URL like https://host.com/?q=my+query+here
that returns JSON or XML, create a new SearchProvider configuring the RequestsGet
Connector as follows:
-
Copy any of the Google PSE SearchProviders
- Modify the
url
andquery_template
to construct the query URL. Using the above example:{ "url": "https://my-host.com/", "query_template": "{url}?q={query_string}", }
To learn more about query and URL parameters, refer to the SearchProvider Guide.
- If the website offers the ability to page through results, or sort results by date (as well as relevancy), use the
PAGE=
andDATE_SORT=
query mappings to add support for these features through SWIRL. For example:TO DO
For more information, refer to the SearchProvider Guide, Query Mappings section.
-
Open the query URL in a browser and look through the JSON response. If using Visual Studio Code, right-click on the pasted JSON and select
Format Document
to make it easier to read. - Identify the results list, and number of results found and retrieved. Put these JSON paths in the
response_mappings
. Then, identify the JSON paths to use to extract the SWIRL default fieldstitle
,body
,url
,date_published
andauthor
from each item in the result list and add those to in theresult_mappings
, with the SWIRL field name on the left, and the source JSON path on the right. For example:"response_mappings": "FOUND=searchInformation.totalResults,RETRIEVED=queries.request[0].count,RESULTS=items", "result_mappings": "url=link,body=snippet,author=displayLink,cacheId,pagemap.metatags[*].['og:type'],pagemap.metatags[*].['og:site_name'],pagemap.metatags[*].['og:description'],NO_PAYLOAD",
-
Add credentials as required for the service. The format to use depends on the type of credential. Details are here: User Guide, Authentication & Credentials section.
-
Add a suitable tag that can be used to describe the source or what it knows about. Spaces are not permitted; good tags are clear and obvious when used in a query, like
company:tesla
ornews:openai
. For more about tags, see: Organizing SearchProviders with Active, Default and Tags - Review the finished SearchProvider:
{ "name": "My New SearchProvider", "connector": "RequestsGet", "url": "https://host.com/", "query_template": "{url}?q={query_string}", "query_processors": [ "AdaptiveQueryProcessor" ], "query_mappings": "", "result_processors": [ "MappingResultProcessor", "CosineRelevancyResultProcessor" ], "response_mappings": "FOUND=jsonpath.to.number.found,RETRIEVED=jsonpath.to.number.retrieved,RESULTS=jsonpath.to.result.list", "result_mappings": "url=link,body=snippet,author=displayLink,NO_PAYLOAD", "credentials": "bearer=your-bearer-token-here", "tags": [ "MyTag" ] }
-
Go to SWIRL
localhost:8000/swirl/searchproviders/
, logging in if necessary. Put the form at the bottom of the page intoRaw data
mode, and paste the SearchProvider into the box. Click thePOST
button to add it. The SearchProvider page will reload. - Go to the Galaxy UI at
localhost:8000/galaxy/
and run a search using the Tag you created for the new SearchProvider. Results should again appear in roughly the same period of time.
Creating a Connector
SWIRL Connectors are responsible for loading a SearchProvider, constructing and transmitting queries to a particular type of service, then saving the response - typically a result list.
Consider using your favorite coding AI to generate a Connector by passing it the Connector base classes, and information about the API you are trying to query.
If you are trying to send an HTTP/S request to an endpoint that returns JSON or XML, you don't need to create a Connector. Instead, Create a SearchProvider that configures the RequestsGet connector included with SWIRL.
To create a new Connector:
-
Create a new Connector file, e.g.
swirl/connectors/my_connector.py
- Copy the style of the
ChatGPT
connector as a starting point, orBigQuery
if targeting a database.class MyConnector(Connector): def __init__(self, provider_id, search_id, update, request_id=''): self.system_guide = MODEL_DEFAULT_SYSTEM_GUIDE super().__init__(provider_id, search_id, update, request_id)
-
In the
__init__
class, load and persist anything that will be needed when connecting and querying the service. - Import the Python package(s) to connect to the service. The ChatGPT connector uses the
openai
package, for example:import openai
- Modify the
execute_search
method to connect to the service. As you can see from the ChatGPT Connector, it first loads the OpenAI credentials, then constructs a prompt, sends the prompt viaopenai.ChatCompletion.create()
, then stores the response.
def execute_search(self, session=None):
logger.debug(f"{self}: execute_search()")
if self.provider.credentials:
openai.api_key = self.provider.credentials
else:
if getattr(settings, 'OPENAI_API_KEY', None):
openai.api_key = settings.OPENAI_API_KEY
else:
self.status = "ERR_NO_CREDENTIALS"
return
prompted_query = ""
if self.query_to_provider.endswith('?'):
prompted_query = self.query_to_provider
else:
if 'PROMPT' in self.query_mappings:
prompted_query = self.query_mappings['PROMPT'].format(query_to_provider=self.query_to_provider)
else:
prompted_query = self.query_to_provider
self.warning(f'PROMPT not found in query_mappings!')
if 'CHAT_QUERY_REWRITE_GUIDE' in self.query_mappings:
self.system_guide = self.query_mappings['CHAT_QUERY_REWRITE_GUIDE'].format(query_to_provider=self.query_to_provider)
if not prompted_query:
self.found = 0
self.retrieved = 0
self.response = []
self.status = "ERR_PROMPT_FAILED"
return
logger.info(f'CGPT completion system guide:{self.system_guide} query to provider : {self.query_to_provider}')
self.query_to_provider = prompted_query
completions = openai.ChatCompletion.create(
model=MODEL,
messages=[
{"role": "system", "content": self.system_guide},
{"role": "user", "content": self.query_to_provider},
],
temperature=0,
)
message = completions['choices'][0]['message']['content'] # FROM API Doc
self.found = 1
self.retrieved = 1
self.response = message.replace("\n\n", "")
return
-
ChatGPT depends on the OpenAI API key, which is provided to SWIRL via the
.env
file. To follow this pattern, create new values in.env
then modifyswirl_server/settings.py
to load them as Django settings, and set a reasonable default. -
Modify the
normalize_response()
method to store the raw response. This is literally no more (or less) than writing the result objects out as a Python list and storing that inself.results
:def normalize_response(self): logger.debug(f"{self}: normalize_response()") self.results = [ { 'title': self.query_string_to_provider, 'body': f'{self.response}', 'author': 'CHATGPT', 'date_published': str(datetime.now()) } ] return
There's no need to do this if self.response
is already a Python list.
- Add the new Connector to
swirl/connectors/__init__.py
from swirl.connectors.my_connector import MyConnector
- Restart SWIRL
% python swirl.py restart
- Create a SearchProvider to configure the new Connector, then add it to the SWIRL installation as noted in the Creating a SearchProvider section above. Don't forget a useful Tag so that you can easily target the new connector when ready to test.
To learn more about developing Connectors, refer to the Developer Guide, Developing New Connectors section.
Creating a QueryProcessor
A QueryProcessor is a stage executed either during Pre-Query or Query Processing. The difference between these is that the result of Pre-Query processing is applied to all SearchProviders, and Query Processing is executed by each individual SearchProvider.
In both cases, the goal is to modify the query sent to some group of SearchProviders.
Note: if you just want to rewrite the query using lookup tables or regular expressions, consider using QueryTransformations
instead.
To create a new QueryProcessor:
-
Create a new module like
swirl/processors/my_query_processor.py
. You can also add your new class to an existing module likeswirl/processors/Generic.py
. - Copy the
GenericQueryProcessor
class as a starting point, and rename it:class MyQueryProcessor(QueryProcessor): type = 'MyQueryProcessor' def process(self): # TO DO: modify self.query_string, and return it return self.query_string + ' modified'
- A common use case for a new QueryProcessor is to use some function or external service to rewrite the query. Here's an example of how to do the former. Given some function in package MyPackage:
def my_query_modification_function(query): # implementation return modified_query
- Create a new QueryProcessor that imports the function:
from MyPackage import my_query_modification_function
class MyQueryProcessor(QueryProcessor):
type = 'MyQueryProcessor'
def process(self):
new_query_string = my_query_modification(self.query_string)
if new_query_string:
return new_query_string
else:
return self.query_string
-
Save the new QueryProcessor.
- If you created a new module, add it to
swirl/processors/__init__.py
. This can be skipped if you added the new QueryProcessor to an existing module.from swirl.processors.my_processor import *
-
Add the new Processor to the appropriate
swirl.models
CHOICES block. - For Pre-Query processing, add it to the
Search
object (this is required for security reasons):
PRE_QUERY_PROCESSOR_CHOICES = [
('GenericQueryProcessor', 'GenericQueryProcessor'),
('SpellcheckQueryProcessor', 'SpellcheckQueryProcessor'),
('MyQueryProcessor','MyQueryProcessor')
]
- To make the new Processor a default for new Search objects, add it to the return list in this block:
def getSearchPreQueryProcessorsDefault(): return []
- For Query processing in one or more SearchProviders, add it here:
QUERY_PROCESSOR_CHOICES = [
('GenericQueryProcessor', 'GenericQueryProcessor'),
('SpellcheckQueryProcessor', 'SpellcheckQueryProcessor'),
('MyQueryProcessor','MyQueryProcessor')
]
- To make the new Processor a default for new SearchProvider objects, add it to the return list in this block:
def getSearchProviderQueryProcessorsDefault(): return ["AdaptiveQueryProcessor"]
- Add the new QueryProcessor to either a
Search.pre_query_processing
pipeline or at least oneSearchProvider.query_processing
pipeline:
SearchProvider
:
"query_processors": [
"AdaptiveQueryProcessor",
"MyQueryProcessor"
],
Search
:
{
"query_string": "news:ai",
"pre_query_processors": [
"MyQueryProcessor"
],
}
- Restart SWIRL
% python swirl.py restart
- Go to the Galaxy UI (
http://localhost:8000/galaxy/
) and run a search; if using a query processor be sure to target that SearchProvider with a tag.
For example if you added a QueryProcessor to a SearchProvider query_processing
pipeline with a Tag of "news", the query would be http://localhost:8000/swirl/search/?q=news:some+query
instead. Results should appear in a just a few seconds. In the messages
block, a message indicating that the new QueryProcessor rewrote the query should appear:
MyQueryProcessor rewrote <Some-Connector> query to: <modified-query>
Creating a ResultProcessor
A ResultProcessor is a stage executed by each SearchProvider, after the Connector has retrieved results. ResultProcessors operate on results and transform them as needed for downstream consumption or presentation.
The GenericResultProcessor
and MappingResultProcessor
stages are intended to normalize JSON results. GenericResultProcessor
searches for exact matches to the SWIRL schema (as noted in the SearchProvider example) and copies them over. MappingResultProcessor
applies result_mappings
to normalize the results, again as shown in the SearchProvider example above. In general, adding stages after these is a good idea, unless the SearchProvider is expected to respond in a SWIRL schema format.
To create a new ResultProcessor:
-
Create a new file, e.g.
swirl/processors/my_result_processor.py
-
Copy the
GenericResultProcessor
class as a starting point, and rename it. -
Implement the
process()
method. This is the only one required.Process()
operates onself.results
, which will contain all the results from a given SearchProvider, in Python list format. Modify items in the result list, and report the number updated. - For example, here is a
process()
method that adds a field 'my_field1' to each result item, with the value "test":def process(self): if not self.results: return updated = 0 for item in self.results: # TO DO: operate on each item and count number updated item['my_field1'] = 'test' updated = updated + 1 # note: there is no need to save in this type of Processor # save modified self.results self.processed_results = self.results # save number of updated self.modified = updated return self.modified
- Save the new module and add it to
swirl/processors/__init__.py
from swirl.processors.my_processor import MyResultProcessor
- Add the new module to the following swirl.models CHOICES block (this is required for security reasons.)
RESULT_PROCESSOR_CHOICES = [ ('GenericResultProcessor', 'GenericResultProcessor'), ('DuplicateHalfResultProcessor', 'DuplicateHalfResultProcessor'), ('TestResultProcessor', 'TestResultProcessor'), ('MappingResultProcessor', 'MappingResultProcessor'), ('DateFinderResultProcessor','DateFinderResultProcessor'), ('DedupeByFieldResultProcessor', 'DedupeByFieldResultProcessor'), ('LenLimitingResultProcessor', 'LenLimitingResultProcessor'), ('CleanTextResultProcessor','CleanTextResultProcessor'), ('RequireQueryStringInTitleResultProcessor','RequireQueryStringInTitleResultProcessor'), ('AutomaticPayloadMapperResultProcessor', 'AutomaticPayloadMapperResultProcessor'), ('CosineRelevancyResultProcessor','CosineRelevancyResultProcessor'), ('MyResultProcessor','MyResultProcessor') ]
- To make the new ResultProcessor a default for new SearchProvider objects, add it to the return list in this block:
def getSearchProviderResultProcessorsDefault(): return ["MappingResultProcessor","DateFinderResultProcessor","CosineRelevancyResultProcessor"]
- Add the new module to at least one
SearchProvider.result_processing
pipeline:"result_processors": [ "MappingResultProcessor", "MyResultProcessor", "CosineRelevancyResultProcessor" ], ...etc...
- Restart SWIRL
% python swirl.py restart
-
Go to the Galaxy UI (
http://localhost:8000/galaxy/
) and run a search; be sure to target at least one SearchProvider that has the new ResultProcessor. For example if you added a ResultProcessor to a SearchProviderresult_processing
pipeline with the Tag "news", the query would need to behttp://localhost:8000/swirl/search/?q=news:some+query
instead of the above. - Results should appear in a just a few seconds. In the
messages
block a message indicating that the new ResultProcessor updated a number of results should appear, and the content should be modified as expected.MyResultProcessor updated 5 results from: MyConnector",
To learn more about writing Processors, refer to the Developer Guide
Creating a PostResultProcessor
A PostResultProcessor is a stage executed after all SearchProviders have returned results. They operate on all the results for a given Search.
To create a new ResultProcessor:
-
Create a new module like
swirl/processors/my_post_result_processor.py
- Copy the template below as a starting point, and rename it:
class MyPostResultProcessor(PostResultProcessor): type = 'MyPostResultProcessor' def process(self): updated = 0 for results in self.results: if not results.json_results: continue for item in results.json_results: # TO DO: operate on each result item item['my_field2'] = "test" updated = updated + 1 # end for # call results.save() if any results were modified if updated > 0: results.save() # end for ############################################ self.results_updated = updated return self.results_updated
-
Modify the
process()
method operating on the items and saving each result set as shown. - Add the new module to
swirl/processors/__init__.py
from swirl.processors.my_post_result_processor import MyPostResultProcessor
- Add the new module to the following
swirl.models
CHOICES block (this is required for security reasons. ):POST_RESULT_PROCESSOR_CHOICES = [ ('CosineRelevancyPostResultProcessor', 'CosineRelevancyPostResultProcessor'), ('DropIrrelevantPostResultProcessor','DropIrrelevantPostResultProcessor'), ('DedupeByFieldPostResultProcessor', 'DedupeByFieldPostResultProcessor'), ('DedupeBySimilarityPostResultProcessor', 'DedupeBySimilarityPostResultProcessor'), ('MyPostResultProcessor','MyPostResultProcessor') ]
- To make the new ResultProcessor a default for new SearchProvider objects, add it to the return list in this block:
def getSearchPostResultProcessorsDefault(): return ["DedupeByFieldPostResultProcessor","CosineRelevancyPostResultProcessor,"MyPostResultProcessor"]
- Add the new module to a
Search.post_result_processing
pipeline:{ "query_string": "news:ai", "post_result_processors": [ "DedupeByFieldPostResultProcessor", "CosineRelevancyPostResultProcessor", "MyPostResultProcessor" ], ...etc... }
- Restart SWIRL
% python swirl.py restart
-
Go to the Galaxy UI (
http://localhost:8000/galaxy/
) and run a search; be sure to target at least one SearchProvider that has the new PostResultProcessor. For example if you added a PostResultProcessor to a Searchpost_result_processing
pipeline with the Tag "news", the query would need to behttp://localhost:8000/swirl/search/?q=news:some+query
instead of the above. - Results should appear in a just a few seconds. In the
messages
block a message indicating that the new PostResultProcessor updated a number of results should appear, and the content should be modified as expected.MyPostResultProcessor updated 10 results from: MySearchProvider
To learn more about writing Processors, refer to the Developer Guide