Searching does not work properly for CJK ideographs #1004

nevikw39 · 2020-04-14T14:57:42Z

Hello,

Telegram's searching ability is poor when it comes to Chinese-Japanese-Korean ideographs, which leads to difficulty in promoting it around Taiwan.

I tried to find out the cause. I took a look in MessagesDb.cpp and find that Telegram uses SQLite to restore messages and FTS5 module to make a search table.

And that is the point. FTS5 splits string into phrases, putting them into hash table. Suppose there is a text "Telegram search". Only "Telegram" and "search" would match the text, whereas either "Tele" or "a" would get no result. Unfortunately, Chinese characters are all categorized into "Letter", which is considered to be token. Hence, the whole Chinese text like "我好想要中文搜尋", containing consecutive Chinese chars without any delimiter, would be viewed as a single phrase. That is, none of "想要", "中文" or "搜尋" would match the result.

I have two ideas. The simple one, we can insert invisible separator such as '\a' between every Chinese char. The other one, we may implement a custom tokenizer.

Nonetheless, I can hardly realize what MessagesDb.cpp works. Actually I don't know how Telegram performs search tasks or how search_id is generated.

So, how can we solve this problem? I would like to make my efforts to contribute to Telegram.

Thanks.

levlam · 2020-04-14T16:48:39Z

You have found a client-side search, which is enabled only for secret chat messages. The best way to improve it is to contibute directly to SQLite's FTS extension.
Search for messages in all other chats is done server-side, so there is no way to improve it on TDLib's side.

nevikw39 · 2020-04-15T00:21:17Z

OK I see.

So, there is no way to check out Telegram server side code?

levlam · 2020-04-15T06:04:50Z

No.

kouhe3 · 2020-06-13T23:54:33Z

The search of telegram is based on "word", and the interval of "word" is punctuation or space.
This is an English based search method, which is very convenient for English search. For example, "hello" can't be found by "he", and "hello" must be used. This is in line with the English context. When I want to find "he" messages, I don't want to see "hello" messages. But this way is not convenient for Chinese and other languages. Chinese is based on Chinese character

https://congcong0806.github.io/2019/11/04/TelegramSearch/

nathancchu · 2022-02-11T22:32:48Z

any updates on this? cannot effectively searching CJK characters is a huge pain using Telegram

tylvn · 2022-04-09T17:43:19Z

This is a huge trouble for people who use CJK language, but telegram doesn't seem to plan to solve the problem, don't know why? Because telegram users hardly use CJK language, or is it technically not easy to achieve?

githubhjs · 2022-10-03T03:50:07Z

This is a huge trouble for people who use CJK language, but telegram doesn't seem to plan to solve the problem, don't know why? Because telegram users hardly use CJK language, or is it technically not easy to achieve?

Actually, I found lots of CJK users on Telegram. But the search issue is limiting the number to grow.

devuterian · 2024-02-07T22:17:14Z

Still waiting for fix... this is important

tonytonyjan · 2024-05-18T13:36:05Z

I was eager to have this feature before.

Now I eventually switched to Discord with my friends.

TokuiNico · 2024-11-13T08:48:17Z

This is feature is import for me

Platway · 2024-11-21T07:42:20Z

They think their language is the entire world. What a narrow prejudice!

Valentyna12 · 2024-12-21T13:57:07Z

I agree with you. Чт, 21 нояб. 2024 г. в 09:42, Platway ***@***.***>:

…

They think their language is the entire world. What a narrow prejudice! — Reply to this email directly, view it on GitHub <#1004 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AZCSN4QJDZD42ELQJIJTBFL2BWFHHAVCNFSM4MHZ26BKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TENBZGAZDQNJUGEZA> . You are receiving this because you commented.Message ID: ***@***.***>

tdlib deleted a comment from Valentyna12 Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Searching does not work properly for CJK ideographs #1004

Searching does not work properly for CJK ideographs #1004

nevikw39 commented Apr 14, 2020

levlam commented Apr 14, 2020

nevikw39 commented Apr 15, 2020

levlam commented Apr 15, 2020

kouhe3 commented Jun 13, 2020

nathancchu commented Feb 11, 2022

tylvn commented Apr 9, 2022

githubhjs commented Oct 3, 2022

devuterian commented Feb 7, 2024

tonytonyjan commented May 18, 2024

TokuiNico commented Nov 13, 2024

Platway commented Nov 21, 2024

Valentyna12 commented Dec 21, 2024 via email

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Searching does not work properly for CJK ideographs #1004

Searching does not work properly for CJK ideographs #1004

Comments

nevikw39 commented Apr 14, 2020

levlam commented Apr 14, 2020

nevikw39 commented Apr 15, 2020

levlam commented Apr 15, 2020

kouhe3 commented Jun 13, 2020

nathancchu commented Feb 11, 2022

tylvn commented Apr 9, 2022

githubhjs commented Oct 3, 2022

devuterian commented Feb 7, 2024

tonytonyjan commented May 18, 2024

TokuiNico commented Nov 13, 2024

Platway commented Nov 21, 2024

Valentyna12 commented Dec 21, 2024 via email

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!