1
2
3
4
5
6
7Internet Engineering Task Force (IETF) T. Sirainen
8Request for Comments: 6203 March 2011
9Category: Standards Track
10ISSN: 2070-1721
11
12
13 IMAP4 Extension for Fuzzy Search
14
15Abstract
16
17 This document describes an IMAP protocol extension enabling a server
18 to perform searches with inexact matching and assigning relevancy
19 scores for matched messages.
20
21Status of This Memo
22
23 This is an Internet Standards Track document.
24
25 This document is a product of the Internet Engineering Task Force
26 (IETF). It represents the consensus of the IETF community. It has
27 received public review and has been approved for publication by the
28 Internet Engineering Steering Group (IESG). Further information on
29 Internet Standards is available in Section 2 of RFC 5741.
30
31 Information about the current status of this document, any errata,
32 and how to provide feedback on it may be obtained at
33 http://www.rfc-editor.org/info/rfc6203.
34
35Copyright Notice
36
37 Copyright (c) 2011 IETF Trust and the persons identified as the
38 document authors. All rights reserved.
39
40 This document is subject to BCP 78 and the IETF Trust's Legal
41 Provisions Relating to IETF Documents
42 (http://trustee.ietf.org/license-info) in effect on the date of
43 publication of this document. Please review these documents
44 carefully, as they describe your rights and restrictions with respect
45 to this document. Code Components extracted from this document must
46 include Simplified BSD License text as described in Section 4.e of
47 the Trust Legal Provisions and are provided without warranty as
48 described in the Simplified BSD License.
49
50
51
52
53
54
55
56
57
58Sirainen Standards Track [Page 1]
59
60RFC 6203 IMAP4 FUZZY Search March 2011
61
62
631. Introduction
64
65 When humans perform searches in IMAP clients, they typically want to
66 see the most relevant search results first. IMAP servers are able to
67 do this in the most efficient way when they're free to internally
68 decide how searches should match messages. This document describes a
69 new SEARCH=FUZZY extension that provides such functionality.
70
712. Conventions Used in This Document
72
73 In examples, "C:" indicates lines sent by a client that is connected
74 to a server. "S:" indicates lines sent by the server to the client.
75
76 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
77 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
78 document are to be interpreted as described in RFC 2119 [KEYWORDS].
79
803. The FUZZY Search Key
81
82 The FUZZY search key takes another search key as its argument. The
83 server is allowed to perform all matching in an implementation-
84 defined manner for this search key, including ignoring the active
85 comparator as defined by [RFC5255]. Typically, this would be used to
86 search for strings. For example:
87
88 C: A1 SEARCH FUZZY (SUBJECT "IMAP break")
89 S: * SEARCH 1 5 10
90 S: A1 OK Search completed.
91
92 Besides matching messages with a subject of "IMAP break", the above
93 search may also match messages with subjects "broken IMAP", "IMAP is
94 broken", or anything else the server decides that might be a good
95 match.
96
97 This example does a fuzzy SUBJECT search, but a non-fuzzy FROM
98 search:
99
100 C: A2 SEARCH FUZZY SUBJECT work FROM user@example.com
101 S: * SEARCH 1 4
102 S: A2 OK Search completed.
103
104 How the server handles multiple separate FUZZY search keys is
105 implementation-defined.
106
107 Fuzzy search algorithms might change, or the results of the
108 algorithms might be different from search to search, so that fuzzy
109 searches with the same parameters might give different results for
110 1) the same user at different times, 2) different users (searches
111
112
113
114Sirainen Standards Track [Page 2]
115
116RFC 6203 IMAP4 FUZZY Search March 2011
117
118
119 executed simultaneously), or 3) different users (searches executed at
120 different times). For example, a fuzzy search might adapt to a
121 user's search habits in an attempt to give more relevant results (in
122 a "learning" manner). Such differences can also occur because of
123 operational decisions, such as load balancing. Clients asking for
124 "fuzzy" really are requesting search results in a not-necessarily-
125 deterministic way and need to give the user appropriate warning about
126 that.
127
1284. Relevancy Scores for Search Results
129
130 Servers SHOULD assign a search relevancy score for each matched
131 message when the FUZZY search key is given. Relevancy scores are
132 given in the range 1-100, where 100 is the highest relevancy. The
133 relevancy scores SHOULD use the full 1-100 range, so that clients can
134 show them to users in a meaningful way, e.g., as a percentage value.
135
136 As the name already indicates, relevancy scores specify how relevant
137 to the search the matched message is. It's not necessarily the same
138 as how precisely the message matched. For example, a message whose
139 subject fuzzily matches the search string might get a higher
140 relevancy score than a message whose body had the exact string in the
141 middle of a sentence. When multiple search keys are matched fuzzily,
142 how the relevancy score is calculated is server-dependent.
143
144 If the server also advertises the ESEARCH capability as defined by
145 [ESEARCH], the relevancy scores can be retrieved using the new
146 RELEVANCY return option for SEARCH:
147
148 C: B1 SEARCH RETURN (RELEVANCY ALL) FUZZY TEXT "Helo"
149 S: * ESEARCH (TAG "B1") ALL 1,5,10 RELEVANCY (4 99 42)
150 S: B1 OK Search completed.
151
152 In the example above, the server would treat "hello", "help", and
153 other similar strings as fuzzily matching the misspelled "Helo".
154
155 The RELEVANCY return option MUST NOT be used unless a FUZZY search
156 key is also given. Note that SEARCH results aren't sorted by
157 relevancy; SORT is needed for that.
158
1595. Fuzzy Matching with Non-String Search Keys
160
161 Fuzzy matching is not limited to just string matching. All search
162 keys SHOULD be matched fuzzily, although exactly what that means for
163 different search keys is left for server implementations to decide --
164 including deciding that fuzzy matching is meaningless for a
165 particular key, and falling back to exact matching. Some suggestions
166 are given below.
167
168
169
170Sirainen Standards Track [Page 3]
171
172RFC 6203 IMAP4 FUZZY Search March 2011
173
174
175 Dates:
176 A typical example could be when a user wants to find a message
177 "from Dave about a week ago". A client could perform this search
178 using SEARCH FUZZY (FROM "Dave" SINCE 21-Jan-2009 BEFORE
179 24-Jan-2009). The server could return messages outside the
180 specified date range, but the further away the message is, the
181 lower the relevancy score.
182
183 Sizes:
184 These should be handled similarly to dates. If a user wants to
185 search for "about 1 MB attachments", the client could do this by
186 sending SEARCH FUZZY (LARGER 900000 SMALLER 1100000). Again, the
187 further away the message size is from the specified range, the
188 lower the relevancy score.
189
190 Flags:
191 If other search criteria match, the server could return messages
192 that don't have the specified flags set, but with lower relevancy
193 scores. SEARCH SUBJECT "xyz" FUZZY ANSWERED, for example, might
194 be useful if the user thinks the message he is looking for has the
195 ANSWERED flag set, but he isn't sure.
196
197 Unique Identifiers (UIDs), sequences, modification sequences: These
198 are examples of keys for which exact matching probably makes sense.
199 Alternatively, a server might choose, for instance, to expand a UID
200 range by 5% on each side.
201
2026. Extensions to SORT and SEARCH
203
204 If the server also advertises the SORT capability as defined by
205 [SORT], the results can be sorted by the new RELEVANCY sort criteria:
206
207 C: C1 SORT (RELEVANCY) UTF-8 FUZZY SUBJECT "Helo"
208 S: * SORT 5 10 1
209 S: C1 OK Sort completed.
210
211 The message with the highest score is returned first. As with the
212 RELEVANCY return option, RELEVANCY sort criteria MUST NOT be used
213 unless a FUZZY search key is also given.
214
215 If the server also advertises the ESORT capability as defined by
216 [CONTEXT], the relevancy scores can be retrieved using the new
217 RELEVANCY return option for SORT:
218
219 C: C2 SORT RETURN (RELEVANCY ALL) (RELEVANCY) UTF-8 FUZZY TEXT
220 "Helo"
221 S: * ESEARCH (TAG "C2") ALL 5,10,1 RELEVANCY (99 42 4)
222 S: C2 OK Sort completed.
223
224
225
226Sirainen Standards Track [Page 4]
227
228RFC 6203 IMAP4 FUZZY Search March 2011
229
230
231 Furthermore, if the server advertises the CONTEXT=SORT (or
232 CONTEXT=SEARCH) capability, then the client can limit the number of
233 returned messages to a SORT (or a SEARCH) by using the PARTIAL return
234 option. For example, this returns the 10 most relevant messages:
235
236 C: C3 SORT RETURN (PARTIAL 1:10) (RELEVANCY) UTF-8 FUZZY TEXT
237 "World"
238 S: * ESEARCH (TAG "C3") PARTIAL (1:10 42,9,34,13,15,4,2,7,23,82)
239 S: C3 OK Sort completed.
240
2417. Formal Syntax
242
243 The following syntax specification uses the augmented Backus-Naur
244 Form (BNF) as described in [ABNF]. It includes definitions from
245 [RFC3501], [IMAP-ABNF], and [SORT].
246
247 capability =/ "SEARCH=FUZZY"
248
249 score = 1*3DIGIT
250 ;; (1 <= n <= 100)
251
252 score-list = "(" [score *(SP score)] ")"
253
254 search-key =/ "FUZZY" SP search-key
255
256 search-return-data =/ "RELEVANCY" SP score-list
257 ;; Conforms to <search-return-data>, from [IMAP-ABNF]
258
259 search-return-opt =/ "RELEVANCY"
260 ;; Conforms to <search-return-opt>, from [IMAP-ABNF]
261
262 sort-key =/ "RELEVANCY"
263
2648. Security Considerations
265
266 Implementation of this extension might enable denial-of-service
267 attacks against server resources. Servers MAY limit the resources
268 that a single search (or a single user) may use. Additionally,
269 implementors should be aware of the following: Fuzzy search engines
270 are often complex with non-obvious disk space, memory, and/or CPU
271 usage patterns. Server implementors should at least test the fuzzy-
272 search behavior with large messages that contain very long words
273 and/or unique random strings. Also, very long search keys might
274 cause excessive memory or CPU usage.
275
276 Invalid input may also be problematic. For example, if the search
277 engine takes a UTF-8 stream as input, it might fail more or less
278 badly when illegal UTF-8 sequences are fed to it from a message whose
279
280
281
282Sirainen Standards Track [Page 5]
283
284RFC 6203 IMAP4 FUZZY Search March 2011
285
286
287 character set was claimed to be UTF-8. This could be avoided by
288 validating all the input and, for example, replacing illegal UTF-8
289 sequences with the Unicode replacement character (U+FFFD).
290
291 Search relevancy rankings might be susceptible to "poisoning" by
292 smart attackers using certain keywords or hidden markup (e.g., HTML)
293 in their messages to boost the rankings. This can't be fully
294 prevented by servers, so clients should prepare for it by at least
295 allowing users to see all the search results, rather than hiding
296 results below a certain score.
297
2989. IANA Considerations
299
300 IMAP4 capabilities are registered by publishing a standards track or
301 IESG-approved experimental RFC. The "Internet Message Access
302 Protocol (IMAP) 4 Capabilities Registry" is available from
303 http://www.iana.org/.
304
305 This document defines the SEARCH=FUZZY IMAP capability. IANA has
306 added it to the registry.
307
30810. Acknowledgements
309
310 Alexey Melnikov, Zoltan Ordogh, Barry Leiba, Cyrus Daboo, and Dave
311 Cridland have helped with this document.
312
31311. Normative References
314
315 [ABNF] Crocker, D., Ed. and P. Overell, "Augmented BNF for
316 Syntax Specifications: ABNF", STD 68, RFC 5234,
317 January 2008.
318
319 [CONTEXT] Cridland, D. and C. King, "Contexts for IMAP4",
320 RFC 5267, July 2008.
321
322 [ESEARCH] Melnikov, A. and D. Cridland, "IMAP4 Extension to SEARCH
323 Command for Controlling What Kind of Information Is
324 Returned", RFC 4731, November 2006.
325
326 [IMAP-ABNF] Melnikov, A. and C. Daboo, "Collected Extensions to
327 IMAP4 ABNF", RFC 4466, April 2006.
328
329 [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate
330 Requirement Levels", BCP 14, RFC 2119, March 1997.
331
332 [RFC3501] Crispin, M., "INTERNET MESSAGE ACCESS PROTOCOL - VERSION
333 4rev1", RFC 3501, March 2003.
334
335
336
337
338Sirainen Standards Track [Page 6]
339
340RFC 6203 IMAP4 FUZZY Search March 2011
341
342
343 [RFC5255] Newman, C., Gulbrandsen, A., and A. Melnikov, "Internet
344 Message Access Protocol Internationalization", RFC 5255,
345 June 2008.
346
347 [SORT] Crispin, M. and K. Murchison, "Internet Message Access
348 Protocol - SORT and THREAD Extensions", RFC 5256,
349 June 2008.
350
351Author's Address
352
353 Timo Sirainen
354
355 EMail: tss@iki.fi
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394Sirainen Standards Track [Page 7]
395
396