Вы не можете сделать этого с sprintf (), но Вы можете быть в состоянии к с snprintf (), в зависимости от Вашей платформы.
необходимо знать, сколько символов Вы заменяете (но поскольку Вы помещаете их в середину строки, Вы, вероятно, знаете это так или иначе).
Это работает, потому что некоторые реализации snprintf () НЕ гарантируют, что оконечный знак записан - по-видимому, для совместимости с функциями как stncpy ().
char message[32] = "Hello 123, it's good to see you.";
snprintf(&message[6],3,"Joe");
После этого "123" заменяется "Joe".
На реализациях, где snprintf () гарантирует пустое завершение, даже если строка будет усеченной, это не будет работать. Таким образом, если переносимость кода является беспокойством, необходимо избежать этого.
Большинство версии на базе Windows из snprintf () показывают это поведение.
, Но, MacOS и BSD (и возможно Linux), кажется, всегда пустые оконечные.
You might want to look the category of "similarity measures" or "distance measures" (which is different, in data mining lingo, than "classification".)
Basically, a similarity measure is a way in math you can:
With similarity measures, this number is a number between 0 and 1, where "0" means "nothing matches at all" and "1" means "identical"
So you can actually think of your sentence as a vector - and each word in your sentence represents an element of that vector. Likewise for each category's list of keywords.
And then you can do something very simple: take the "cosine similarity" or "Jaccard index" (depending on how you structure your data.)
What both of these metrics do is they take both vectors (your input sentence, and your "keyword" list) and give you a number. If you do this across all of your categories, you can rank those numbers in order to see which match has the greatest similarity coefficient.
As an example:
From your question:
Customer Transactions: deposits, депозит, клиент, счет, счета
Итак, вы можете построить вектор из 5 элементов: (1, 1, 1, 1, 1). Это означает, что для ключевого слова «транзакции клиентов» у вас есть 5 слов, и (это будет звучать очевидным, но) каждое из этих слов присутствует в вашей строке поиска. оставайтесь со мной.
Итак, теперь вы принимаете приговор:
Система применяет депозиты к customer's specified account.
This has 2 words from the "Customer Transactions" set: {deposits, account, customer}
(actually, this illustrates another nuance: you actually have "customer's". Is this equivalent to "customer"?)
The vector for your sentence might be (1, 0, 1, 1, 0)
The 1's in this vector are in the same position as the 1's in the first vector - because those words are the same.
So we could say: how many times do these vectors differ? Lets compare:
(1,1,1,1,1) (1,0,1,1,0)
Hm. They have the same "bit" 3 times - in the 1st, 3rd, and 4th position. They only differ by 2 bits. So lets say that when we compare these two vectors, we have a "distance" of 2. Congrats, we just computed the Hamming distance! The lower your Hamming distance, the more "similar" the data.
(The difference between a "similarity" measure and a "distance" measure is that the former is normalized - it gives you a value between 0 and 1. A distance is just any number, so it only gives you a relative value.)
Anyway, this might not be the best way to do natural language processing, but for your purposes it is the simplest and might actually work pretty well for your application, or at least as a starting point.
(PS: "classification" - as you have in your title - would be answering the question "If you take my sentence, which category is it most likely to fall into?" Which is a bit different than saying "how much more similar is my sentence to category 1 than category 2?" which seems to be what you're after.)
good luck!
The main characteristics of the problem are:
These characteristics bring both good and bad news: the implementation should be relatively straight forward, but a consistent level of accuracy of the categorization process may be hard to achieve. Also the small amounts of various quantities (number of possible categories, max/average number of words in a item etc.) should give us room to select solutions that may be CPU and/or Space intentsive, if need be.
Yet, even with this license got "go fancy", I suggest to start with (and stay close to) to a simple algorithm and to expend on this basis with a few additions and considerations, while remaining vigilant of the ever present danger called overfitting.
Basic algorithm (Conceptual, i.e. no focus on performance trick at this time)
Parameters = CatKWs = an array/hash of lists of strings. The list contains the possible keywords, for a given category. usage: CatKWs[CustTx] = ('deposits', 'deposit', 'customer' ...) NbCats = integer number of pre-defined categories Variables: CatAccu = an array/hash of numeric values with one entry per each of the possible categories. usage: CatAccu[3] = 4 (if array) or CatAccu['CustTx'] += 1 (hash) TotalKwOccurences = counts the total number of keywords matches (counts multiple when a word is found in several pre-defined categories) Pseudo code: (for categorizing one input item) 1. for x in 1 to NbCats CatAccu[x] = 0 // reset the accumulators 2. for each word W in Item for each x in 1 to NbCats if W found in CatKWs[x] TotalKwOccurences++ CatAccu[x]++ 3. for each x in 1 to NbCats CatAccu[x] = CatAccu[x] / TotalKwOccurences // calculate rating 4. Sort CatAccu by value 5. Return the ordered list of (CategoryID, rating) for all corresponding CatAccu[x] values about a given threshold.
Simple but plausible: we favor the categories that have the most matches, but we divide by the overall number of matches, as a way of lessening the confidence rating when many words were found. note that this division does not affect the relative ranking of a category selection for a given item, but it may be significant when comparing rating of different items.
Now, several simple improvements come to mind: (I'd seriously consider the first two, and give thoughts to the other ones; deciding on each of these is very much tied to the scope of the project, the statistical profile of the data to be categorized and other factors...)
Also, aside from the calculation of the rating per-se, we should also consider:
The question of metrics, should be considered early, but this would also require a reference set of input item: a "training set" of sort, even though we are working off a pre-defined dictionary category-keywords (typically training sets are used to determine this very list of category-keywords, along with a weight factor). Of course such reference/training set should be both statistically significant and statistically representative [of the whole set].
To summarize: stick to simple approaches, anyway the context doesn't leave room to be very fancy. Consider introducing a way of measuring the efficiency of particular algorithms (or of particular parameters within a given algorithm), but beware that such metrics may be flawed and prompt you to specialize the solution for a given set at the detriment of the other items (overfitting).