lundi 3 mai 2010

* AUTOPSIE D'UNE THÈSE-PLAGIAT, SUITE

Cet article a été mis en ligne le 4 mai 2010. Il suit "Serials plagiaires [1]" et constitue un des volets d'un tryptique dont l'article "Comités de sélection, plagiat et les mystères de Paris 8" est le centre.

AVANT PROPOS
À la fin de l'article "Serials Plagiaires [1]", publié sur ce blog le 12 avril dernier, nous annoncions pour début mai la mise en ligne d'un article "sur le contexte de publication et la nature des deux articles (...) co-signés par M. Sanan, M.Rammal et K. Zreik et intégrés sans plus de précautions, par copier-coller et traduction, par Majed Sanan dans sa thèse."

Nous avons reçu sur ce blog le matin du 1er mai un mail envoyé depuis l'adresse "msanan". Le contenu de ce mail étant ce qu'il est, nous n'avons aucune raison de douter de son authenticité. Pour résumer, ce mail de Majed Sanan nous menace (dans des termes diffamatoires) de poursuites pour diffamation à son encontre pour avoir fait état de ses plagiats. Ceci au nom, je cite, de l'"art. 29 de la loi du 29 juillet 1881 sur la liberté".
Hélène Maurel-Indart, Professeur de littérature à l'Université François-Rabelais à Tours, et spécialiste du plagiat littéraire, a pour ses travaux liés à son HDR (sous la direction d'Antoine Compagnon) aussi traité, parmi d'autres cas, de plagiats de travaux universitaires (Plagiats, les coulisses de l'écriture. Ed. de la Différence, 2007). Elle a aussi été l'objet pour ce motif de ce type de mauvais procès dont nous sommes menacé (affaire Bernard Edelmann contre Hélène Maurel-Indart; cf. son site, http://www.leplagiat.net/, à la rubrique "un odieux procès"). Pour ce procès, H.Maurel-Indart a reçu un soutien juridique ferme de son université.
À vrai dire, peut-être faudrait-il souhaiter que nos contradicteurs s'égarent et s'engagent dans la voie judiciaire. Outre l'exposition que ce procès donnerait à ce blog, et donc à leurs plagiats, ce serait une belle occasion de traiter du statut des thèses-plagiat et de celui des articles "scientifiques"-plagiat, ainsi que de la liberté de critique sur les travaux universitaires (problème déjà au centre des débats du procès BernardEdelman contre Hélène Maurel-Indart).

Étant donné son caractère diffamatoire, ce mail a été transmis à la Présidence de l'Université Paris 8 accompagné d'une demande de protection "fonctionnelle" (juridique) dans la perspective du dépôt d'une plainte.
Cependant, ce mail "msanan" reste intéressant tant il éclaire crument les raisons, les justifications et les moyens de défense d'un plagiaire. C'est pourquoi nous donnons à voir ce mail à la place même souhaitée par son auteur (commentaire n°9 de l'article "Serials Plagiaires [1]").

* *

AUTOPSIE D'UNE THÈSE-PLAGIAT, SUITE


PLAGIATS DE DEUXIÈME ORDRE

L'article ARABIC DOCUMENTS CLASSIFICATION USING N-GRAM, co-signé par Majed Sanan, Mahmoud Rammal et Kaldoun Zreik et repris intégralement en annexe, a été publié dans les actes du ICHSL6 (International Conferences on Human System Learning) édités en 2008 par Saïd Tazi et Kaldoun Zreik aux éditions Europia (cf. note 1). Cette publication s'est faite (cf. couverture ci-jointe) sous le logo de l'"IEEE, section France", branche française d'une association née aux États-Unis et très connue des informaticiens.
Traduit en français, cet article a été copié intégralement sans mention claire de sa source par Majed Sanan dans la troisième partie de sa thèse, "Expérimentation" (cf. article Serials Plagiaires [1]). Thèse dirigée par Kaldoun Zreik, co-signataire de ce même article.

Tout plagiat de premier ordre de cet article en anglais est donc aussi un plagiat de deuxième ordre de la thèse de Majed Sanan.
La "bibliographie des plagiats" qui suit complète ainsi la bibliographie des plagiats de la thèse de M. Sanan (cf. Serials Plagiaires [1]).
D'autres aspects de cet article-plagiat, notamment ses variantes dans un contexte éditorial diversifié, seront abordés dans un article ultérieur : LE SOUFFLE DE SHANGHAI ET LES ERREMENTS DE LA BIBLIOMÉTRIE.

En annexe, nous avons adopté pour reproduire cet article les mêmes codes chromatiques de présentation que pour la thèse-plagiat de Marie-France Ango Obiang (cf. "Nancy 2 : le plagiat c'est ça").


BIBLIOGRAPHIE DES PLAGIATS

DUNLOP Mark D. (1994). Free Text Retrieval : The Magic Explained. Colloque LibTechInternational' 94. [En ligne] site personnel de l'auteur : http://personal.cis.strath.ac.uk/~mdd/research/publications/libtech/

JALAM Radwan (2003). Apprentissage automatique et catégorisation de textes multilingues.Thèse (informatique) soutenue à l'Université Lumière - Lyon 2. [en ligne] site agrocampus-ouest : http://www2.agrocampus-ouest.fr/math/jalam/these/these_radwan.pdf

KHREISAT Laila, 2006. Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study. At the 2006 International conference in Data Mining, Part of the 2006World Congress in Computer Sciences. DMIN 2006 78-82. [En ligne] site du WCCS :http://ww1.ucmss.com/books/LFS/CSREA2006/DMI5552.pdf

NADKARNI Prakash M. [en ligne] site du Yale center for medical informatichttp://ycmi.med.yale.edu/nadkarni/db_course/IR_Frame.htm

PENG Fuchun, 2003. Language Independent Text Learning with Statistical n-GramLanguage Models. Thèse pour le titre de "Doctor of Philosophy of Computer science".University of Waterloo, Canada. [En ligne] site de l'Université de Waterloo :http://ai.uwaterloo.ca/~f3peng/publication/thesis.pdf

SOUCY Pascal et MINEAU Guy, 2005. Beyond TFIDF Weighting for Text Categorization inthe Vector Space Model. 19e International joint conference on Artificial intelligence, àEdinbourg, Écosse. [En ligne] sur le site de l'IJCAI : http://www.ijcai.org/papers/0304.pdf

* *

UN PLAGIAT EN FORME DE SAUCIÈRE GRAS MAIGRE

Parmi les plagiats mis en évidence par leur coloration (voir annexe), reprenons deux d'entre eux publiés en bleu marine et renvoyant, comme source, à la seule thèse de RadwanJalam. Il s'agit des deux séquences-plagiat suivantes. Pour mieux distinguer ces séquences l'une de l'autre, la seconde passe ci-dessous du bleu marine maigre au bleu émeraude gras.

JALAM Radwan, 2006. (thèse 2003, p. 11).
The categorization of documents comports a choice of a learning technique (or classifier). The main classifiers used are the following:
- Discriminated factorial analysis (2)
- Neuronal network (3)
- K-neighbors (4)
- Decision tree (5)
- Bayesian network (6).
JALAM Radwan (thèse 2003, p.11)
[2] Lebart and Salem, 1994. [3] Wiener et al. 1995. Schütze et al. 1995. Stricker. 2000. [4] Yang and Chute, 1994. Yang et Liu. 1999. [5] Lewis and Ringuette, 1994. Apté et al. 1994. [6] Borkoand Bernick, 1964. Lewis, 1998. Andropsopulos et al., 2000. Chai et al., 2002. Adam et al., 2002.
Citons maintenant le texte source, unique et en français, de ces 2 séquences tel qu'il apparaît à la page 11 de la thèse de Radwan Jalam :
1.3.2. Choix de classifieurs
La catégorisation de textes comporte un choix de technique d’apprentissage(ou classifieur) disponibles. Parmi les méthodes d’apprentissage les plus souvent utilisées figurent l’analyse factorielle discriminante [Lebart andSalem, 1994], la régression logistique [Hull, 1994], les réseaux de neurones [Wiener et al., 1995, Schütze et al., 1995, Stricker, 2000], les plus proches voisins [Yang and Chute, 1994, Yang andLiu, 1999], les arbres de décision [Lewis and Ringuette, 1994,Apté et al., 1994], les réseaux bayesiens [Borko and Bernick, 1964,Lewis, 1998, Androutsopoulos et al., 2000, Chai et al., 2002,Adam et al., 2002], les machines à vecteurs supports [Joachims, 1998,Joachims 1999, Joachims 2000, Dumais et al., 1998 HE et al., 2000] et (...)
Les 2 séquences-plagiat, séparées dans l'article, proviennent de la même séquence-source de la thèse. Par une méthode qui rappelle celle de certaines saucières dites "gras maigre", à un seul réservoir mais deux becs verseurs, le plagiaire a séparé la séquence-source (thèse de Jalam) en deux, pour en faire deux plagiats : un premier plagiat, la liste des "classifieurs" traduit en "classifiers", utilisé dans le corps du texte (ici en maigre) et un second plagiat, les références bibliographiques (en gras), placé en fin d'article.
Notons que dans ce cas, même si le texte source n'avait été traduit du français vers l'anglais, ce type de dissociation en deux du texte source aurait placé les 2 séquences-plagiat qui en sont issues à l'abri du repérage par un logiciel "anti-plagiat".


RENDRE À CÉSAR CE QUI APPARTIENT À ...

L'article ayant trois co-auteurs, on peut être tenté d'attribuer à chacun ses plagiats. On ne dispose alors que d'indices. Il est probable que c'est Majed Sanan qui dispose du fichier des 2667 textes du parlement libanais et le fait entrer dans une machine d'où sortent des tableaux et des statistiques, signes d'une intense activité de recherche. Pour le reste, on peut noter les éléments, ou indices, qui suivent :

Plagiats issus de l'article de Laila KHREISAT :
Sur Google, les requêtes associant le nom de Laila Khreisat avec celui de M. Sanan, ou celui de M. Rammal, ne donnent aucun résultat.
Par contre, la requête sur Google associant les noms de Laila Khreisat et Kaldoun Zreik conduit à l'article Classification of Arabic Information Extraction methods. Article co-signé par trois auteurs: A. S. Al Hajjar, M. Hajjar et K. Zreik en 2009, donc postérieurement à l'article co-signé par Majed Sanan, Mahmoud Rammal et Kaldoun Zreik. Un emprunt, sans guillemets de rigueur, y est fait au même article de Laila Khreisat plagié dans l'article co-signés par Majed Sanan, Mahmoud Rammal et Khaldoun Zreik.

Dans ce dernier article Classification of Arabic Information Extraction methods, le plagiat du texte de Laila Khreisat se présente sous une forme mixte, aujourd'hui très fréquente. Le texte emprunté est bien précédé de la source, mais en l'absence totale de guillemets le lecteur est conduit à penser que le texte se référant aux travaux de Laila Khreisat est bien rédigé par les co-auteurs de l'article, donc le fruit d'un travail de synthétisation et de rédaction, alors qu'il n'est que l'assemblage de simples copier-coller (en rouge) issus de l'article cité de Laila Khreisat. Une forme de plagiat que le CNU de droit public juge pour sa part "indigne d'universitaires" (cf. La lutte contre le plagiat à l'université est mal partie) :
L. Khreisat (2006) presented the N-Gram Frequency Statistics technique for classifying arabic text documents. The technique employs a dissimilarity measure called the “Manhattan distance”, and “Dice’s measure”. (...) A corpus of arabic text documents was collected from online arabic newspapers, 40% of the corpus was used as training classes and the remaining 60% of the corpus was used for classification. All documents, whether training documents or documents to be classified went through a preprocessing normalization phase that remove the punctuation marks, the stop words, the diacritics, (...)
Notons à cette occasion que Laila Khreisat, la plagiée, a entrepris ses études à l'Université de Yarmouk, en Jordanie, avant de passer deux masters et un Ph. D à Columbia University et à la City University of New-York. Majed Sanan, le thésard plagiaire a fait ses études à l'Université de Beyrouth puis à choisi Caen et ensuite Paris 8 pour rédiger sa thèse-plagiat. Peut-être qu'il lui aurait été plus difficile de faire valider ce genre de travaux à New-York.

Plagiat issu de Fuchun PENG :
Sur Google, la requête associant les noms de Fuchun Peng et K. Zreik conduit à la thèse de D. NGuyen (thèse en Informatique de l'Université de Caen, 2006) : Extraction d'Information à partir de documents Web multilingues : une approche d'analyse structurelle. Thèse dirigée par Kaldoun Zreik (rapporteurs Imad Saleh et Saïd Tazi). Fuchun Pen y est cité dans la bibliographie pour un autre article sur les n-grammes
Par contre, sur Google, l'association du nom de Fuchun Peng avec celui de M. Sanan, ou celui de M. Rammal, ne donne aucun résultat.
Aucune requête sur Google associant deux à deux les noms de Nadkarni, Mineau ou Soucy avec M. Sanan, M. Rammal ou K. Zreik ne donne de résultats.

* *

ÉTUDE INACHEVÉE

Le second article co-signé par Majed Sanan, Mahmoud Rammal et Kaldoun Zreik et objet de "copier-coller" par M. Sanan pour sa thèse est intitulé "L’accès multilingue à l’information scientifique et technologique : limitations des moteurs de recherche en langue arabe".
Cet article a été publié en ligne dans les actes du CIDE 10, 10 ème Colloque International sur le Document Electronique, organisé à l'initiative de K. Zreik, qui s'est tenu à Nancy du 2 au 4 juillet 2007. (http://cide10.inist.fr/article.php3?id_article=14).

L'analyse de cet article sous l'angle de ses plagiats n'est pas achevée pour la raison suivante:
La plateforme Internet IEEEXplor DIGITAL LIBRARY, émanation de l'IEEE, propose l'article, "Internet Arabic Search Engines Studies", co-signé par M. Sanan, M. Rammal et K. Zreik.
Sur IEEEXPLORE, cet article est présenté comme partie des actes du Colloque ICTTA 2008 (Information and Communication Technologies : from Theory to Applications). Nous n'avons pas encore eu accès à ce document. Notre hypothèse est que cet article en langue anglaise pourrait être, au moins pour certaines parties, assez proche de l'article publié en français dans les actes du CIDE 10. Dans le cas de plagiats issus de sources anglaises que nous n'avons pas encore repérés dans la version CIDE 10, ils seront plus facilement détectés, si notre hypothèse se vérifiait, depuis l'analyse de leurs versions anglaises.

Nous nous limiterons donc à une bibliographie inachevée des plagiats de cet article du CIDE 10. Cette "bibliographie" s'ajoute aussi, pour les raisons mentionnées plus haut, comme plagiats de deuxième ordre, à la bibliographie des plagiats de la thèse de M. Sanan.

Bibliographie des plagiats :

HARMANANI Haidar, KEIROUZ Walid et RAHEEL Saeed (2006).
A Rule-Based Extensible Stemmer for Information Retrieval with Application to Arabic. In The International Arab Journal of Information Technology, Vol. 3, N° 3, July 2006. [En ligne] site des Colleges of Computing and Information Society (CCIS):http://www.ccis2k.org/iajit/PDF/vol.3,no.3/12-Haidar.pdf

KADRI Youssef et NIE Jian-Yun (2004). Traduction des requêtes pour la recherche d’information translinguistique anglais-arabe. Colloque JEP-TALN, Traitement Automatique de l'Arabe, Fès. [En ligne] site de l'Université d' Aix : www.lpl.univ-aix.fr/jep-taln04/proceed/actes/arabe2004/TFYK31.pdf

XU Jinxi, FRASER Alexander, WEISCHEDEL Ralph (2002). Empirical studies in strategies for arabic retrivial. [en ligne] site de l'Université de Stuttgart :

* *
Note (1)
Le contexte de la publication d'articles co-signés par ces 3 auteurs aux éditions Europia sera abordé dans un prochain article, LE SOUFFLE DE SHANGHAI ET LES ERREMENTS DE LA BIBLIOMÉTRIE.
Les éditions Europia sont un département d'Europia productions, entreprise de communication fondée et dirigée par K. Zreik.
Europia organise de nombreux colloques et conférences scientifiques (Europia, ICHSL, CIDE, HyperUrbain, CAC, PhDit, DEECE, O1Design) qui s'adressent principalement à des architectes liés à l'informatique, à des informaticiens et des chercheurs en Sciences de l'information et de la communication. Les actes de ces colloques sont ensuite publiés aux éditions Europia.

Jean-Noël Darde
MCF - Université Paris 8

* * *

ARTICLE IN-EXTENSO

Rappel : nous avons adopté pour reproduire cet article les mêmes codes chromatiques de présentation que pour la conclusion de la thèse-plagiat de Marie-France Ango Obiang (cf. "Nancy 2 : le plagiat c'est ça !").
Toutes les séquences en couleurs sont les séquences plagiées documentées. Les séquences en noir sont donc soit des séquences écrites de la main des co-auteurs, soit des plagiats non repérés.
Les mots en arabe, les équations (see equation 1, 2...), les tableaux (table 1, 2...) et les illustrations (figure...) n'apparaissent pas en tant que tels ci-dessous. Mais leurs places sont mentionnées.

ARABIC DOCUMENTS CLASSIFICATION USING N-GRAM
Majed Sanan, Paris 8, University Paris, France
Mahmoud Rammal, Lebanese University, Beirut - Lebanon
Khaldoun Zreik, Paris 8, University Paris - France

KHREISAT Laila (2006 p. 1)
I. INTRODUCTION
The rapid growth of the internet has increased the number of online documents available. This has led to the development of automated text and document classification systems that are capable of automatically organizing and classifying documents. Text classification (or categorization) is the process of structuring a set of documents according to a group structure that is known in advance. There are several different methods for text classification, including statistical-based algorithms, Bayesian classification, distance-based algorithms, k-nearest neighbors, decision tree-based methods... ([4] to name a few).
Text classification techniques are used in many applications, including e-mail filtering, mail routing, spam filtering, news monitoring, sorting through digitized paper archives, automated indexing of scientific articles, classification of news stories, and searching for interesting information on the web (WWW).


The majority of these systems is designed to handle documents written in non-Arabic
language, developing text classification systems for Arabic documents is a challenging task due to the complex and rich nature of the Arabic language.(...) The Arabic language consists of 28 letters. The language is written from right to left. It has very complex morphology, and the majority of words have a tri-letter root. The rest have either a quad-letter root, penta-letter root, or hexa-letter root.
In our approach, we will use only the similarity measures and compare the results in order to know the convenient measure in classification using N-grams. And because that classification is one method of text mining we will explain in the following paragraph the steps of text mining, then we will see the preprocessing and indexing of texts before to be classified. At the next paragraph, we will explain the different similarity measures that we will use in our approach, and then the effectiveness measure used to calculate the precision and recall of each class. At paragraph 6 we will explain our approach and experiments, and finally we will see the conclusion and future approaches.

PPT : Text Based Information Retrieval - Document Mining
II TEXT MINING
A Definition Text mining is defined[1] as the non-trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data. Text Mining is the process of applying automatic methods to analyse and structure textual data in order to create useable knowledge from previously unstructured information. B Text mining methods There are many text mining applications or methods. Four of these methods are the following:
* Information Retrieval This method consists of indexing and retrieval of textual documents.
* Information Extraction; It means extraction of partial knowledge in the text.
* Web Mining It consists on indexing and retrieval of textual documents and extraction of partial knowledge using the web.
* classification Given: a collection of labelled documents (training set), the goal is to find a model for the class as a function of the values of the features


C Text mining steps
These steps concern principally the manner in which a text is represented (or structured), the choice of predicted algorithm to use, and then how to evaluate the obtained results to guarantee a good generalization of the model applied.

1) Representation of the information
In this step, we have to segment the unstructured information and put the units segmented into a table. But we have to choose the descriptors (important terms in documents) which can be chosen as words, lemmas, stemmas, or n-grams (characters or words or phrases).
And finally in some cases we have to think how to reduce the dimension of this textual space.
2) Automatic categorization of documents
This is the second step, the text categorization can be defined as the process that permit to associate a category(ies) or class(es) to a text (or document), in function of information contained in this text.
This association is very long and expensive then we think about the automation of this process. The functional link between a class and a document, that is called “a prediction model”, is estimated by a machine learning method.

JALAM Radwan, 2006. (thèse, p. 11).
The categorization of documents comports a choice of a learning technique (or classifier). The main classifiers used are the following :
- Discriminated factorial analysis (2)
- Neuronal network (3)
- K-neighbors (4)
- Decision tree (5)
- Bayesian network (6).
In this final step, we have to evaluate the obtained results to guarantee a good generalization of the model applied.

KHREISAT Laila (2006. p.2)
III TEXT PREPROCESSING AND INDEXING
All text documents went through a preprocessing stage. This was necessary due to the variations in the way text can be represented in Arabic. The preprocessing was performed for the documents to be classified and the training classes themselves. Preprocessing consisted of the following steps:
1) Convert text files to UTF-8 encoding.
2) Remove punctuation marks, diacritics, non-letters, stop words. The definitions of these were obtained from the Khoja stemmer.
3) Replace initial ?? with ??
4) Replace final ?? followed by ??? with (???)

XU Jinxi, FRASER Alexander, WEISCHEDEL Ralph (2002, p. 2)
A
Spelling normalization and mapping
Arabic orthography is highly variable. For instance, changing the letter YEH (?) to ALEF MAKSURA (?) at the end of a word is very common (Not surprisingly, the shapes of the two letters are very similar.). Since variations of this kind usually result in an “invalid” word, in our experiments we detected such “errors” using a stemmer (the Buckwalter Stemmer) and restored the correct word ending.
A more problematic type of spelling variation is that certain glyphs combining HAMZA or MADDA with ALEF (e.g. ??, ?? and ??) are sometimes written as a plain ALEF (??), possibly because of their similarity in appearance. Often, both the intended word and what is actually written are valid words.
This is much like confusing “résumé” with “resume” in English. Since both the intended word and the written form are correct words, it is impossible to correct the spellings without the use of context.
We explored two techniques to address the problem.
1) With normalization technique, we replace all occurrences of the diacritical ALEFs by the plain ALEF.
2) With the mapping technique, we map a word with the plain ALEF to a set of words that can potentially be written as that word by changing diacritical ALEFs to the plain ALEF. In this absence of training data, we will assume that all the words in the set are equally probable.
Both techniques have pros and cons. The normalization technique is simple, but it increases ambiguity. The mapping technique, on the other hand, does not introduce additional ambiguity, but it is more complex.

XU Jinxi, FRASER Alexander, WEISCHEDEL Ralph (2002, p. 2)
B Arabic stemming
Arabic has a complex morphology. Most Arabic words (except some proper nouns and words borrowed from other languages) are derived from a root(root). A root usually consists of three letters. We can view a word as derived by first applying a pattern (patern) to a root to generate a stem and then attaching prefixes and suffixes to the stem to generate the word (7) (Khoja and Garside, 2001). For this reason, an Arabic stemmer can be either root-based or stem-based.

C Character N-grams Broken plurals (...) are very common in Arabic. There is no existing rule-based algorithm to reduce them to their singular forms, and it seems that it would be not be straight-forward to create such an algorithm. As such, broken plurals are not handled by current Arabic stemmers. One technique to address this problem is to use character n-grams. Although broken plurals are not derived by attaching word affixes, many of the letters in broken plurals are the same as in the singular forms (though sometimes in a different order). If words are divided into character n-grams, some of the n-grams from the singular and plural forms will probably match. This technique can also handle words that have a stem but cannot be stemmed by a stemmer for various reasons. For example, the Buckwalter stemmer uses a list of valid stems to ensure the validity of the resulting stems. Although the list is quite large, it is still not complete. N-grams in this case provide a fallback where exact word match fails. Inprevious (this) work [8], experiments have been made (we have experimented) with n-grams created from stems as well as n-grams from words. N-grams were created by applying a shifting window of n characters over a word or stem. If the word or stem has fewer than n characters, the whole word or stem was returned. The following table shows some results of these experiments. Two methods of creating n-grams were tried: from words and from stems. Retrieval scores in Table I (2) show that stem-based n-grams are better than word-based n-grams for retrieval. The probable reason is that some of the word-based n-grams are prefixes or suffixes, which can cause false matches between documents and queries (.)

PENG Fuchun (2003, p. 10)
However, character level n-gram models offer the following benefits and have been successfully used in many IR problems:
1. Language independence and simplicity: character level n-gram models are applicable to any language, and even non-language sequences such as music or gene sequences.
2. Robustness: Character level n-gram models are relatively insensitive to spelling variations and errors, particularly in comparison to word features.
3. Completeness: the vocabulary of character tokens is much smaller than any word vocabulary and normally is known in advance. Therefore, the sparse data problem is much less serious in character N-gram models of the same order.

SIMILARITY AND MATCHING MEASURES IN VECTOR SPACE MODEL

In this paragraph we will explain some similarity and matching measures in vector space model often used in Information Retrieval, but we will use them in similarity between a document and a class.
Now that we have the document in a form that minimizes the information we need to consider when matching documents to classes we have to do some matching.
Under the vector and probabilistic models, the document is initially indexed in the same way as the classes.
A TF*ICF weight
In fact I have used the TF*ICF and apply it to the class, then the query will be replaced by the document to be classified and the document will be replaced by the class, then I have defined the new weight TF*ICF.

In TF*ICF, ICF stands for inverse class frequency and TF stands for term frequency (“*” indicates multiplication).

http://en.wikipedia.org/wiki/Tf-idf
The term frequency (count) in the given class is simply the number of times a given term appears in that class (document). This count is usually normalized to prevent a bias towards longer classes (which may have a higher term frequency regardless of the actual importance of that term in theclass (document) to give a measure of the importance of the term t i within the particular class (document dj). (...)

(see equation 1)

where n i is the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms (in document dj).

The inverse class frequency is a measure of the general importance of the term (obtained by dividing the number of all classes by the number of classes containing the term, and then taking the logarithm of that quotient).

(see equation 2)

with
* |C|, total number of classes in the corpus
and |{c:ti∈c}|, number of classes (documents) where the term t i appears (that is n i ≠ 0) (...).

Then

(see equation 3)

A high weight in TF–IDF is reached by a high-term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms.(...)

NADKARNI Prakash (cours:http://ycmi.med.yale.edu/nadkarni/db_course/IR_Frame.htm
ICF (IDF) is defined as log (no of classes (documents) in the collection/no of classes containing this document in the collection). This reflects the fact that uncommon documents are more likely to useful in narrowing down the selection of class (documents) than very common documents. TF is defined as log (frequency of term in this class) this reflects the fact that if a keyword occurs multiple times in a class, that class is more likely to be relevant than a class where the keyword occurs just once.

DUNLOP Mark D. (1994, p. 11)
B Dice's coefficient
Now that the documents are both represented as vectors, the vector space model considers the similarity of them to be based on the angle between the two vectors in space. Up until this point (and with the probabilistic model), the vector has simply been a convenient mathematical model for storing a list of terms and their weights. The vector space model then makes the jump to processing them as if they were real geometrical vectors in a space with thousands of dimensions. Although this seems rather strange initially, it is based on an extension of a simple matching routine for non-weighted indexes. Consider non-weighted indexes for the above document and the sample class, these are basically a list of four words for the document and fourteen for the class. A rough measure of matching strength is the number of terms they have in common, in this case two. This does not take into account how large each class is and would have a tendency to match larger documents, so we could divide by the number of terms in total between the document and the class. This leads to Dice's coefficient:

(see equation 4)

where |D∩C| is the number of terms common to the document and class, |D| is the number of terms in the document, |C| the number in the class, and m the matching value – the fraction is doubled to give a maximum value, for matching a class with itself, of 1 instead of 0.5.


C Cosine coefficient
When considering weighted terms, like those we indexed, it is not possible to simply count the number of terms in common. Instead the vector space model multiplies the term weights together. For the vast majority of terms either the document or the class will have a zero weight, hence the resulting weight is zero. These individual new weights are then summed to give the top line of the matching algorithm. For a document D of N terms and for a class vector C, this leads to:

(see equation 5)

where ‖D‖ is the length of the document (= total number of terms in document D), D i is the weight of term i in vector D, and N is the total number of individual terms (the dimensionality of C). In geometry, this equation is used to calculate the cosine of the angle between the two vectors, hence this matching routine is known as the cosine coefficient.

Although quite simple to understand this approach has no sound bases in information theory there is no theoretical reason for this to be a good matching algorithm. The cosine coefficient does, however, perform well in practice, is reasonably easy to code and is used in many retrieval systems.


V. CLASSIFICATION EFFECTIVENESS
( PRECISION AND RECALL)

KHREISAT Leila (2006 p. 2)
Precision and recall are defined [9] as follows:

(see equation 6)

where CC, number of correct categories (classes) found
TCF, total number of categories found
TC, total number of correct categories
Ever since the 1960s information retrivial (IR) effectiveness is evaluated using the twin measures of recall and precision (10).

Pascal Soucy Guy W. Mineau (2005)
To combine the two measures (precision and recall) in a single value, the F-measure is often used. The F-measure reflects the relative importance of recall versus precision. When as much importance is granted to precision as it is to recall we have the F 1-measure (...)
which is an estimation of the breakeven point where precision and recall meets if classifier parameters are tuned to balance precision and recall.
Then we can also use a single-number measure for the effectiveness as follows:
(see equation 7)
where F 1 as a harmonic mean of precision and recall (10). For this study, relevance has been defined conceptually as:

VI. OUR APPROACH

A. Corpus
We work on the meeting minutes of Lebanon parliament. Our database content all minutes from 1922 until 2005. The meetings are riche in different kind of information (economical, political, juridical, social, etc.). In other ways, we have a database for the official journal that content all laws and decrees. The texts of official journal are classified manually. Our work is to apply classification of official journal to parliament
minutes in which Lebanese official journal documents form the main part. Then our corpus is the Lebanese official journal documents for year 2002 that form about 2,667 classified (labelled) documents.

B. Methodology
In our approach, we have chosen the N-gram method to represent information. And then to categorize the texts we will use the learning method in which we supposed that we have some categorized texts (learning texts) that we used to find the prediction method using the n-gram technique.

JALAM Radwan (thèse 2003, p.11)
CROQUIS : TEXT CATEGORIZATION PROCESS

In our experiment, the learning texts will be the 2667 Lebanese official journals
(for the year 2002).
These documents are classified and each document belongs to one or multiple predefined classes.

We have two levels of classification, then each document belongs to a class of Level 1 and then to a subclass of a Level 2.
In general, we have three main classifications (Level 1):
1- Administrative classification.
2- Juridical classification.
3- Thematic classification.

And in each classification we have different classes, for example in administrative classification we have 137 classes. Table 2 shows some of these classes.

TABLE 2 : administration classes

Using these classified documents we will segment them by using N-gram method (three characters) and then segment the two level classes.
Then we will try to find the candidate words for each class (levels 1 and 2) and for each document using the vector space model.
After that we will apply the similarity and matching measures on documents and classes to classify automatically the pre classified documents.
Then we will conclude which is the convenient measure using the precision and recall parameters. Knowing the convenient measure we can then in the future use it to classify new documents.


C Experimental software
To segment the 2667 documents that form the corpus, using the n-gram method, I have made a program (using VB.net) that use the N-gram with 3,4, and 5 characters and give the result in a table (top 50).

D Experience
KHREISAT (2006, p. 3)
Generating the N-gram profile consisted of the following steps: 1) Split the text into tokens consisting only of letters. All digits are removed. 2) Compute all possible N-grams, for n = 3 (Trigrams). 3) Compute the frequency of occurrence of each N-gram. 4) Sort the N-grams according to their frequencies from most frequent to least frequent. Discard the frequencies 5) This gives us the N-gram profile for a document. For training class documents, the N-gram profiles were saved in text files. Each document to be classified went through the text preprocessing phase, and then the N-gram profile was generated as described above. The N-gram profile of each text document (document profile) was compared against the profiles of all documents in the training classes (class profile) in terms of similarity. Specifically, two measures were used.

FIGURE 3 : Classify documents using 3-grams

FIGURE 5 : Classification result

E. Results

Calculate the precision and recall basing on similarity methods and then choose the convenient method

Level 1: TF*ICF
Table 3 : Level 1, TF*ICF Classification

Table 4 : Level I, Cosine coefficient Classification

Table 5 : Level I, Dice Classification

Table 6 : Level 2, TF*ICF Classification

Table 7 : Level 2, Cosine coefficient Classification

Table 8: Level 2, Dice Classification


F. Discussion

We remark that the cosine coefficient measure in the two levels has given us the best results between the three measures used. In Level 1, the average F 1 in case of using cosine coefficient as similarity method is: 0.4764, and in level 2 it is: 0.3099.
Then it is the best between the three measures but it still insufficient.
In the results above you will see the term NaN which means Not a Number, it means that we have in the denominator a zero, then the number is not defined.
In Level 1, we remark that the precision and recall for juridical class using TF*ICF and Dice coefficient is zero, that is because no correct classes or categories are found. We can explain by the expanding of juridical documents it means a juridical document can be considered as in administrative or thematic class.
In Level 2, the waste case was that of Dice coefficient in which contains 403 NaN may be because the Dice coefficient uses the intersection between the document and class divided by the sum of document and class. Then may be the ration very low for a document that belongs to a certain class.

VII. CONCLUSIONS AND FUTURE WORK
XU Jinxi, FRASER Alexander, WEISCHEDEL Ralph (2002)
Arabic is one of the most widely used languages in the world, yet there are relatively few studies on the retrieval and classification of Arabic documents.

KHREISAT Laila (2006, p. 4)
This paper presented the results of classifying Arabic text documents using the N-gram frequency statistics technique employing three similarity(dissimilarity) measures: TF*ICF, cosine coefficient, and Dice's measure of similarity.
Results showed that N-gram text classification using the cosine coefficient(Dice) measure outperforms classification using the Dice's measure and TF*ICF weight (the Manhattan measure).
This work evaluated a number of similarity measures for the classification of Arabic documents, using the Lebanese parliament documents and especially the Lebanese official journal documents Arabic corpus as the test bed.
We have proposed a segmentation method (N-gram) applied on Arabic documents, and our goal is to find the convenient similarity measure that gives us the powerful results when applied to Lebanese official journal documents.
the N-gram method is good, but still insufficient for the classification of Arabic documents, then we have to look at the future of a new approach like distributional or symbolic approach in order to increase the effectiveness.

REFERENCES
[1] Wikipedia Encyclopedia. Available at: http://en.wikipedia.org/

JALAM Radwan (thèse 2003, p.11)
[2] Lebart and Salem, 1994.

[3] Wiener et al. 1995. Schütze et al. 1995. Stricker. 2000.

[4] Yang and Chute, 1994. Yang et Liu. 1999.

[5] Lewis and Ringuette, 1994. Apté et al. 1994.

[6] Borko and Bernick, 1964. Lewis, 1998. Andropsopulos et al., 2000. Chai et al., 2002. Adam et al., 2002.

[7] Shereen Khoja and Roland Garside, Stemming Arabic text, Computer Science Department, Lancaster University, Lancaster, UK, www.comp.lancs.ac.uk/computing/users /khoja/stemmer.ps, . 1999.

[8] Jacques Mayfield, Paul McNamee, Cash Costello, Chrisitne Piatko and Admit Banerjee. JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval, in E. Voorhees and D Harman (eds).Proceedings of the Tenth TextREtrieval Conference (TREC 2001). Gaithersburg, Maryland. July 2002.

[9] Ahmed Abdelali, Improving Arabic Information Retrieval Using Local Variations in Modern Standard Arabic, New Mexico Institute of Mining and Technology, 2004.

[10] Victor Lavrenko. Center for intelligent Information Retrieval. University of Massachusetts Amherst. Hopkings IR Workshop (2005).

* * * * *

0 commentaires:

Enregistrer un commentaire

Abonnement Publier les commentaires [Atom]

Liens vers cet article:

Créer un lien

<< Accueil